Flavors of Data
NumericalThis is some sort of quantitative measurement i.e. Heights of people, page load times, stocks prices
There are two types of Numerical Data: Discrete Data – Integer bases; oftne counts of some event

How many purchases did a customer make in a year?

How many times did I flip “heads”
Continuous Data

Has an infinite number of possible values
 How much time did it take for a user to check out
 How much rain fell on a given day?
CategoricalQualitative data that has no inherent mathematical meaning
 Gender, Yes/No (binary data), Race, State of residence, Product Category, Political Party, etc.
 You can assign number to categories in order to represent them more compactly, but the numbers don’t have a mathematical meaning
OrdinalThis is a mixture of numerical and categorical data
Ordinal data that has mathematical meaning
 Example: movie ratings on a 15 scale
 Ratings must be 1, 2, 3, 4, or 5
 But these values have mathematical meanings; 1 means it’s a worse movie than a 2
Statistics 101
Mean
 This is the average. Sum all the values and divide by the number of values.
Median
 Sort the values, and take the value at the midpoint
 if you have a odd number of data points the median might fall in between the two data points.
 If you have an even number of samples take the average of the two in the middle.
 Median is less susceptiable to the outliers than the mean.
 Example: mean household income in the US is $72,641, but the mdeian is only $51,939 – because the mean is skewed by a handfull of billionaries
 Median better repesents the “typical” American in this example.
Mode
 The most common value in a data set
 Not relvant to continuous numerical data
 Back to our number of kids in each house example.
Standard Deviation and Variance – These concepts are all about the spread of the data (the shape)
Variance – measures how “spreadout” the data is.
 Variance (sigma squared) is simply the average of the squared differences form the mean.
 Example: What is the variance of the data set (1, 4, 5, 4, 8)?
 First find the mean: (1+4+5+4+8)/5 = 4.4
 Now find the differences from the mean: (3.4, 0.4, 0.6, 0.4, 3.6)
 Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
 Find the average of the squared differences:
 sigma squared = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
Standard Deviation is the the square root of the variance
This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.
You can talk about how extreme a data point is by talking about, “how many sigmas” away from the mean it is.
Population vs. Sample
 If you’re working with a samepl of data instead of an entire data set (the entire population)…
 The you wnat to use the sample variance instrad of the population variance
 For N sameples, you just divide the squared variacnecs by N1 instead of N.
 So, in out example, we computed the population variance like this:
 Sigma squared (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
 But the sample cariance woulb be:
 S2 = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 4 = 6.3
The Why
Probability Density Functions
This is the probability of that range occurring. Its NOT the probability of a specific number occuring.
“Gives you the probability of a data point falling within some given range of a given value.”
Probability Mass Function – Discrete Data
Examples of Data Distributions
Uniform Distribution – there is a flat constant probability that it will happen. Basically, an equal chance that it will happen. Means there is a flat constant (equal) probability of the data occurring.
Normal / Gaussian
Exponential PDF / “Power Law” – Things fall off in an exponential manner.