Before the pandemic, I started looking for a position because my company refused to let me work from home, even part-time. It took me six months, but I found a well-paid remote position.
The first day of my remote position was the first day of California’s lockdown (March 13th, 2020). Which was a bit ironic because most employees became remote on that day.
Pre-pandemic remote positions were a novelty; I applied to 1 or 2 remote jobs a week, and I scoured all the job boards (Indeed, LinkedIn, Glassdoor, Dice). Most companies wanted you in the office at least part of the time. Managers wanted to see you in a cubical; you might not be doing anything but seeing you put them at ease.
The problem with going into the office is that it drastically limits the number of jobs available to you. Remote work opens up the entire United States.
Fast forward to February 2022, I received an email stating my contract wouldn’t be renewed. So, starting in March 2022, I was back in the job market. This time, my experience was the opposite. I changed my status on LinkedIn to “Open to Work.” and for the next week and a half, I received 30 to 50 emails a day, 99% being remote. When I received an email asking me if I was open to relocation, I felt bad for the recruiter. Because that position would never be filled in the current job market.
At one point, I was interviewing for 5 positions, at the same time, with base salary expectations of 175k to 200k per year.
After a week and a half of searching, I accepted an offer, for a remote position, from a large company that everyone would recognize.
If companies hadn’t been forced into remoteness, we’d likely be stuck driving into the office to appease our insecure managers.
]]>When a system (i.e., a database) can no longer enforce the shape of the data, something else must pick up the slack. When might this happen?
The phone number format in the US is (area code) (prefix) – (number), here’s an example: (734) 555-3212. We’ll talk about the database in this article for simplicity’s sake, but the datastore doesn’t have to be a database.
Phone numbers in the US always have ten digits (we are ignoring the international digit). Phone numbers can come in a variety of formats:
Most databases are limited to data-types (i.e., numbers, strings, dates, etc.) and don’t support formating. Many applications opt to use the string data-type to store the phone number. However, the string data-type accepts ANY string. To ensure the phone number is valid, we need an additional layer of validation.
In a single application connecting to a single database, data validation is typically enforced in the application.
When you’re architecture grows to two or more application sharing a database, two things can happen:
1. Each application has its own data validation:
2. There is a central service the applications call to validate the data and persist the data:
The risk of data validation in multiple places is the validations might be out of sync. A valid format for one application might not be valid in another application. In the worse case, a bad format will throw an error or, in extreme cases, crash the application.
The best case is to centralize the data validation so the format stored in the database is consistent for the entire organization. There are exceptions, of course, and I’m assuming multiple applications read and write to a shared database base.
]]>Part 0: Introduction
Data science articulated, data science examples, history and context, technology landscape
Part 1: Data Manipulation, at Scale
Databases and the relational algebra
Readings
MapReduce, Hadoop, relationship to databases, algorithms, extensions, language; key-value stores and NoSQL; tradeoffs of SQL and NoSQL Readings
Data cleaning, entity resolution, data integration, information extraction*(NOT COVERED IN LECTURES)Readings* / Talks
Part 2: Analytics
Topics in statistical modeling and experiment design Readings
Introduction to Machine Learning, supervised learning, decision trees/forests, simple nearest neighborReadings
Unsupervised learning: k-means, multi-dimensional scaling
Readings
Part 3: Interpreting and Communicating Results
Visualization, visual data analytics Readings (well, watchings)
Backlash: Ethics, privacy, unreliable methods, irreproducible results
Part 4: Graph Analytics
Readings
Numerical This is some sort of quantitative measurement i.e. Heights of people, page load times, stocks prices
There are two types of Numerical Data: Discrete Data – Integer bases; oftne counts of some event
How many purchases did a customer make in a year?
How many times did I flip "heads"
Continuous Data
Has an infinite number of possible values
Categorical Qualitative data that has no inherent mathematical meaning
Ordinal This is a mixture of numerical and categorical data
Ordinal data that has mathematical meaning
Mean
Median
Mode
Standard Deviation and Variance – These concepts are all about the spread of the data (the shape)
Variance – measures how "spread-out" the data is.
Standard Deviation is the the square root of the variance
This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.
You can talk about how extreme a data point is by talking about, "how many sigmas" away from the mean it is.
Population vs. Sample
Probability Density Functions
This is the probability of that range occurring. Its NOT the probability of a specific number occuring.
“Gives you the probability of a data point falling within some given range of a given value.”
Probability Mass Function – Discrete Data
Examples of Data Distributions
Uniform Distribution – there is a flat constant probability that it will happen. Basically, an equal chance that it will happen. Means there is a flat constant (equal) probability of the data occurring.
Normal / Gaussian
Exponential PDF / “Power Law” – Things fall off in an exponential manner.
]]>If you’re a DBA, you need to learn to deal with unstructured data
If you are a statistician, you need to learn to deal with data that does not fit in memory
If you are a software engineer, you need to learn statistical modeling and how to communicate results.
If you are a business analyst, you need to learn about algorithms and tradeoffs at scale.
Week 2
PWD – "print working directory”
Week 3
Descriptive – Just to describe a set of data (i.e. census data, ngram viewer)
Exploratory – Find relationships you didn’t know about
Inferential – Use a relatively small sample of data to say something about a bigger population
Predictive – To use the data on some objects to predict values for another object
Causal – To find out what happens to one variable when you make another variable change.
Mechanistic – Understand the exact changes in variables that lead to changes in other variables for individual objects.
What is data – Data are values of qualitative or quantitative variables belonging to a set of items.
Data rarely comes processed.
Data is the second most important thing
What about big data?
Big or small data “The data may not contain the answer. The combination of some data and an aching desire for an answer. The combination of some data an an aching desire for an answer does not ensure that a reasonable answer be extracted from a given body of data”
What should I care about experimental design. It’s really easy to focus on the outcome and overlook an error with the numbers.
Question: Does changing the text on your website improve donations?
Formulate your question in advance
Data Science is a scientific discipline. Science demands you are answering a specific question when you are using data.
Compared two versions of the website. Randomly show visitor two versions. Measure how much they donate to figure out which is better.
Statistical inference – a key component of data science.
Confounding – What are the other variable that are causing a relationship.
Both shoe size and literacy, the bigger the show the more literate someone is, but what’s happening is a baby and child have small feet and less literacy. Age is actually the factor not show size.
Correlation is not causation
Prediction Take a sample of people with Cancer. Take a set of data and separate out the folks that responded to chemotherapy on ones that did not. Then create a function, where you can determine who will and who won’t respond to chemotherapy.
Is challenging than inference
Prediction vs. Inference – the more separated the groupings
Beware data dredging
Good experiments
Prediction is not inference
Beware of data dredging
Could a computer surprise us? Rather than programmers crafting data-processing rules by hand, could a computer automatically learn these rules by looking at data?
To control something, first you need to be able to observe it.
Lose Function The loss function takes the predictions of the network and the true target (what you wanted the network to output) and computes a distance score, capturing how well the network has done
Backpropagation Algorithm **** This adjustment is the job of the optimizer, which implements what’s called the Backpropagation algorithm: the central algorithm in deep learning.
Training Loop This is the training loop, which, repeated a sufficient number of times (typically tens of iterations over thousands of examples), yields weight values that minimize the loss function. A network with a minimal loss is one for which the outputs are as close as they can be to the targets: a trained network. Once again, it’s a simple mechanism that, once scaled, ends up looking like magic. Layer The core building block of neural networks is the layer, a data-processing module that you can think of as a filter for data. Some data goes in, and it comes out in a more useful form.
Representations Specifically, layers extract representations out of the data fed into them—hopefully, representations that are more meaningful for the problem at hand.
Data distillation Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation. A deep-learning model is like a sieve for data processing, made of a succession of increasingly refined data filters—the layers.
The Compilation Step
Overfitting The test-set accuracy turns out to be 97.8%—that’s quite a bit lower than the training set accuracy. This gap between training accuracy and test accuracy is an example of overfitting: the fact that machine-learning models tend to perform worse on new data than on their training data.
Tensor Numpy arrays, also called tensors. At its core, a tensor is a container for data—almost always numerical data. So, it’s a container for numbers. You may be already familiar with matrices, which are 2D tensors: tensors are a generalization of matrices to an arbitrary number of dimensions (note that in the context of tensors, a dimension is often called an axis).
Scalars (0D tensors) A tensor that contains only one number is called a scalar (or scalar tensor, or 0-dimensional tensor, or 0D tensor). In Numpy, a float32 or float64 number is a scalar tensor (or scalar array). You can display the number of axes of a Numpy tensor via the ndim attribute; a scalar tensor has 0 axes (ndim == 0). The number of axes of a tensor is also called its rank.
Vectors (1D tensors) An array of numbers is called a vector, or 1D tensor. A 1D tensor is said to have exactly one axis.
Dimensionality Dimensionality can denote either the number of entries along a specific axis (as in the case of our 5D vector) or the number of axes in a tensor (such as a 5D tensor), which can be confusing at times. In the latter case, it’s technically more correct to talk about a tensor of rank 5 (the rank of a tensor being the number of axes), but the ambiguous notation 5D tensor is common regardless.
Matrices (2D tensors) An array of vectors is a matrix, or 2D tensor. A matrix has two axes (often referred to rows and columns). You can visually interpret a matrix as a rectangular grid of numbers.
3D tensors If you pack such matrices in a new array, you obtain a 3D tensor, which you can visually interpret as a cube of numbers.
Tenors Attributes
Types of Data
Vector Data – 2D tensors of shape (samples, features) Bach single data point can be encoded as a vector, and thus a batch of data will be encoded as a 2D tensor (that is, an array of vectors), where the first axis is the samples axis and the second axis is the features axis.
Timeseries data or sequence data— 3D tensors of shape (samples, timesteps, features) Whenever time matters in your data (or the notion of sequence order), it makes sense to store it in a 3D tensor with an explicit time axis. Each sample can be encoded as a sequence of vectors (a 2D tensor), and thus a batch of data will be encoded as a 3D tensor.
The time axis is always the second axis (axis of index 1), by convention.
Images— 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width) Images typically have three dimensions: height, width, and color depth. Although grayscale images (like our MNIST digits) have only a single color channel and could thus be stored in 2D tensors, by convention image tensors are always 3D, with a one-dimensional color channel for grayscale images.
There are two conventions for shapes of images tensors: the channels-last convention (used by TensorFlow) and the channels-first convention (used by Theano).
5D tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)
Video data is one of the few types of real-world data for which you’ll need 5D tensors. A video can be understood as a sequence of frames, each frame being a color image. Because each frame can be stored in a 3D tensor (height, width, color_depth), a sequence of frames can be stored in a 4D tensor (frames, height, width, color_depth), and thus a batch of different videos can be stored in a 5D tensor of shape (-samples, frames, height, width, color_depth).
Interview question
"You could interview the same candidate twice with the same set of interviews and come to different conclusions each time."
Steve Yegge has an idea that each candidate at a perfect slate of interviews and a perfect slate of anti-slate of interviewers.
Google’s philosophy is "Missing someone who is good, is ok, compared to hiring someone who is bad." A false positive is much worse than a false negative.
Interviewing is a team effort, you might have one person that goes deep in one area, the other person doesn’t need to ask the same questions.
The hiring committee was cross teams. No one person had the power to say yay or nay.
Be prepared. Have a set of questions that you plan to ask, have an idea of your follow-up questions.
An interview is a conversation, you’re both humans, treat it as such.
Good interviewers are generous, they teach and even if the candidate isn’t the right fit.
Good questions are like onions
Strive for higher bandwidth
Strive for more signal to less noise, you job is to get them to show their best work.
Benefits and problem
Code world
Phase zero
Strategic advisor
Code Monkey
Microsoft is pivoting their business from, desktop software, to a cloud service provider. The questions I have is where does this leave the developer who has made a career using their technologies?
Learn about the Mckinsey model, specifically look at the H3 initiatives
Keys to be valuable. (Time to value) Get value out of data as fast as possible.
Do you want to be the architect, who puts the vision together or the plumber, the electrician or the carpenter?
Characterizing data, self servicing data. Vanguard data architect. Understand the business problem. Actionable business value. Help a business executive get business insights from the data.
Do I move forward as an entrepreneur or do I move in the direction of an architect/manager?
To simplify the world, we break up ideas and tasks into smaller pieces, but then we fall into the trap that the smaller pieces are a reality, but they are small windows into the world and interact outside the boundaries of our perception.
A learning organization is a group of people who are continually enhancing their capacity to create what they want to create.
It’s important that organizations pick up the ability to learn to together on a reliable, regular and predictable schedule.
Traditional, authoritarian, hierarchical business organizations fail to tap the abilities of people. For years and years, we’ve acted as the workers have checked their brains at the door and we just wanted them to do “their work, not to think.”
The notion that people are interchangeable units is going to change. Knowledge and learning are always embodied in a person. This makes the person the organization’s most important asset.
Ultimately, it’s a change in ourselves; that will drive the change in our organization.
Systemic structures are the underlining patterns of inner dependencies.
]]>"Software is getting slower more rapidly than hardware becomes faster. -Niklaus Wirth: A Plea for Lean Software"
There are two sides to the performance coin:
Efficiency through Algorithms – How much work is required by a task.
Performance through Data Structures – How quickly a program does it work.
All Comes back to Watts
std::vector<X> f(int n) { std::vector<X> result; for(int i = 0; i < n; ++i) result.push_back(X(...)); return result; }
std::vector<X> f(int n) { std::vector<X> result; result.reserve(n); for(int i = 0; i < n; ++i) result.push_back(X(...)); return result; }
X *getX(std::string key, std::unordered_map<std::string, std::unique_ptr<X>> &cache) { if(cache[key]) return cache[key].get(); cache[key] = std::make_unique<X>(...); return cache[key].get(); }
X *getX(std::string key, std::unordered_map<std::string, std::unique_ptr<X>> &cache) { std::unique_ptr<X> &entry = cache[key]; if(entry) return entry.get(); entry = std::make_unique<X>(...); return entry.get(); }
Always do less work!
Design API’s to Help
CPUs Have Hierarchical Cache System