UW Data Science Course: Week One

Flavors of Data

NumericalThis is some sort of quantitative measurement i.e. Heights of people, page load times, stocks prices

There are two types of Numerical Data: Discrete Data – Integer bases; oftne counts of some event

How many purchases did a customer make in a year?
How many times did I flip “heads”

Continuous Data
Has an infinite number of possible values
- How much time did it take for a user to check out
- How much rain fell on a given day?

CategoricalQualitative data that has no inherent mathematical meaning

Gender, Yes/No (binary data), Race, State of residence, Product Category, Political Party, etc.
You can assign number to categories in order to represent them more compactly, but the numbers don’t have a mathematical meaning

OrdinalThis is a mixture of numerical and categorical data

Ordinal data that has mathematical meaning

Example: movie ratings on a 1-5 scale
- Ratings must be 1, 2, 3, 4, or 5
- But these values have mathematical meanings; 1 means it’s a worse movie than a 2

Statistics 101

Mean

This is the average. Sum all the values and divide by the number of values.

Median

Sort the values, and take the value at the midpoint
if you have a odd number of data points the median might fall in between the two data points.
- If you have an even number of samples take the average of the two in the middle.
Median is less susceptiable to the outliers than the mean.
- Example: mean household income in the US is $72,641, but the mdeian is only $51,939 – because the mean is skewed by a handfull of billionaries
Median better repesents the “typical” American in this example.

Mode

The most common value in a data set
- Not relvant to continuous numerical data
Back to our number of kids in each house example.

Standard Deviation and Variance – These concepts are all about the spread of the data (the shape)

Variance – measures how “spread-out” the data is.

Variance (sigma squared) is simply the average of the squared differences form the mean.
Example: What is the variance of the data set (1, 4, 5, 4, 8)?
- First find the mean: (1+4+5+4+8)/5 = 4.4
- Now find the differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
- Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
- Find the average of the squared differences:
  - sigma squared = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04

Standard Deviation is the the square root of the variance

This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.

You can talk about how extreme a data point is by talking about, “how many sigmas” away from the mean it is.

Population vs. Sample

If you’re working with a samepl of data instead of an entire data set (the entire population)…
- The you wnat to use the sample variance instrad of the population variance
- For N sameples, you just divide the squared variacnecs by N-1 instead of N.
- So, in out example, we computed the population variance like this:
  - Sigma squared (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
- But the sample cariance woulb be:
  - S2 = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 4 = 6.3

The Why

Probability Density Functions

This is the probability of that range occurring. Its NOT the probability of a specific number occuring.

“Gives you the probability of a data point falling within some given range of a given value.”

Probability Mass Function – Discrete Data

Examples of Data Distributions

Uniform Distribution – there is a flat constant probability that it will happen. Basically, an equal chance that it will happen. Means there is a flat constant (equal) probability of the data occurring.

Normal / Gaussian

Exponential PDF / “Power Law” – Things fall off in an exponential manner.