# UW Data Science Course: Week One

## Flavors of Data

Numerical This is some sort of quantitative measurement i.e. Heights of people, page load times, stocks prices

There are two types of Numerical Data: Discrete Data – Integer bases; oftne counts of some event

• How many purchases did a customer make in a year?

• How many times did I flip "heads"

Continuous Data

• Has an infinite number of possible values

• How much time did it take for a user to check out
• How much rain fell on a given day?

Categorical Qualitative data that has no inherent mathematical meaning

• Gender, Yes/No (binary data), Race, State of residence, Product Category, Political Party, etc.
• You can assign number to categories in order to represent them more compactly, but the numbers don’t have a mathematical meaning

Ordinal This is a mixture of numerical and categorical data

Ordinal data that has mathematical meaning

• Example: movie ratings on a 1-5 scale
• Ratings must be 1, 2, 3, 4, or 5
• But these values have mathematical meanings; 1 means it’s a worse movie than a 2

## Statistics 101

Mean

• This is the average. Sum all the values and divide by the number of values.

Median

• Sort the values, and take the value at the midpoint
• if you have a odd number of data points the median might fall in between the two data points.
• If you have an even number of samples take the average of the two in the middle.
• Median is less susceptiable to the outliers than the mean.
• Example: mean household income in the US is \$72,641, but the mdeian is only \$51,939 – because the mean is skewed by a handfull of billionaries
• Median better repesents the "typical" American in this example.

Mode

• The most common value in a data set
• Not relvant to continuous numerical data
• Back to our number of kids in each house example.

Standard Deviation and Variance – These concepts are all about the spread of the data (the shape)

Variance – measures how "spread-out" the data is.

• Variance (sigma squared) is simply the average of the squared differences form the mean.
• Example: What is the variance of the data set (1, 4, 5, 4, 8)?
• First find the mean: (1+4+5+4+8)/5 = 4.4
• Now find the differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
• Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
• Find the average of the squared differences:
• sigma squared = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04

Standard Deviation is the the square root of the variance

This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.

You can talk about how extreme a data point is by talking about, "how many sigmas" away from the mean it is.

Population vs. Sample

• If you’re working with a samepl of data instead of an entire data set (the entire population)…
• The you wnat to use the sample variance instrad of the population variance
• For N sameples, you just divide the squared variacnecs by N-1 instead of N.
• So, in out example, we computed the population variance like this:
• Sigma squared (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
• But the sample cariance woulb be:
• S2 = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 4 = 6.3

## The Why

Probability Density Functions

This is the probability of that range occurring. Its NOT the probability of a specific number occuring.

“Gives you the probability of a data point falling within some given range of a given value.”

Probability Mass Function – Discrete Data

Examples of Data Distributions

Uniform Distribution – there is a flat constant probability that it will happen. Basically, an equal chance that it will happen. Means there is a flat constant (equal) probability of the data occurring.

Normal / Gaussian

Exponential PDF / “Power Law” – Things fall off in an exponential manner.