Flavors of Data
Numerical This is some sort of quantitative measurement i.e. Heights of people, page load times, stocks prices
There are two types of Numerical Data: Discrete Data – Integer bases; oftne counts of some event
-
How many purchases did a customer make in a year?
-
How many times did I flip "heads"
Continuous Data
-
Has an infinite number of possible values
- How much time did it take for a user to check out
- How much rain fell on a given day?
Categorical Qualitative data that has no inherent mathematical meaning
- Gender, Yes/No (binary data), Race, State of residence, Product Category, Political Party, etc.
- You can assign number to categories in order to represent them more compactly, but the numbers don’t have a mathematical meaning
Ordinal This is a mixture of numerical and categorical data
Ordinal data that has mathematical meaning
- Example: movie ratings on a 1-5 scale
- Ratings must be 1, 2, 3, 4, or 5
- But these values have mathematical meanings; 1 means it’s a worse movie than a 2
Statistics 101
Mean
- This is the average. Sum all the values and divide by the number of values.
Median
- Sort the values, and take the value at the midpoint
- if you have a odd number of data points the median might fall in between the two data points.
- If you have an even number of samples take the average of the two in the middle.
- Median is less susceptiable to the outliers than the mean.
- Example: mean household income in the US is $72,641, but the mdeian is only $51,939 – because the mean is skewed by a handfull of billionaries
- Median better repesents the "typical" American in this example.
Mode
- The most common value in a data set
- Not relvant to continuous numerical data
- Back to our number of kids in each house example.
Standard Deviation and Variance – These concepts are all about the spread of the data (the shape)
Variance – measures how "spread-out" the data is.
- Variance (sigma squared) is simply the average of the squared differences form the mean.
- Example: What is the variance of the data set (1, 4, 5, 4, 8)?
- First find the mean: (1+4+5+4+8)/5 = 4.4
- Now find the differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
- Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
- Find the average of the squared differences:
- sigma squared = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
Standard Deviation is the the square root of the variance
This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.
You can talk about how extreme a data point is by talking about, "how many sigmas" away from the mean it is.
Population vs. Sample
- If you’re working with a samepl of data instead of an entire data set (the entire population)…
- The you wnat to use the sample variance instrad of the population variance
- For N sameples, you just divide the squared variacnecs by N-1 instead of N.
- So, in out example, we computed the population variance like this:
- Sigma squared (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
- But the sample cariance woulb be:
- S2 = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 4 = 6.3
The Why
Probability Density Functions
This is the probability of that range occurring. Its NOT the probability of a specific number occuring.
“Gives you the probability of a data point falling within some given range of a given value.”
Probability Mass Function – Discrete Data
Examples of Data Distributions
Uniform Distribution – there is a flat constant probability that it will happen. Basically, an equal chance that it will happen. Means there is a flat constant (equal) probability of the data occurring.
Normal / Gaussian
Exponential PDF / “Power Law” – Things fall off in an exponential manner.