Flavors of Data

Numerical This is some sort of quantitative measurement i.e. Heights of people, page load times, stocks prices

There are two types of Numerical Data: Discrete Data – Integer bases; oftne counts of some event

  • How many purchases did a customer make in a year?

  • How many times did I flip "heads"

    Continuous Data

  • Has an infinite number of possible values

    • How much time did it take for a user to check out
    • How much rain fell on a given day?

Categorical Qualitative data that has no inherent mathematical meaning

  • Gender, Yes/No (binary data), Race, State of residence, Product Category, Political Party, etc.
  • You can assign number to categories in order to represent them more compactly, but the numbers don’t have a mathematical meaning

Ordinal This is a mixture of numerical and categorical data

Ordinal data that has mathematical meaning

  • Example: movie ratings on a 1-5 scale
    • Ratings must be 1, 2, 3, 4, or 5
    • But these values have mathematical meanings; 1 means it’s a worse movie than a 2

Statistics 101

Mean

  • This is the average. Sum all the values and divide by the number of values.

Median

  • Sort the values, and take the value at the midpoint
  • if you have a odd number of data points the median might fall in between the two data points.
    • If you have an even number of samples take the average of the two in the middle.
  • Median is less susceptiable to the outliers than the mean.
    • Example: mean household income in the US is $72,641, but the mdeian is only $51,939 – because the mean is skewed by a handfull of billionaries
  • Median better repesents the "typical" American in this example.

Mode

  • The most common value in a data set
    • Not relvant to continuous numerical data
  • Back to our number of kids in each house example.

Standard Deviation and Variance – These concepts are all about the spread of the data (the shape)

Variance – measures how "spread-out" the data is.

  • Variance (sigma squared) is simply the average of the squared differences form the mean.
  • Example: What is the variance of the data set (1, 4, 5, 4, 8)?
    • First find the mean: (1+4+5+4+8)/5 = 4.4
    • Now find the differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
    • Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
    • Find the average of the squared differences:
      • sigma squared = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04

Standard Deviation is the the square root of the variance

This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.

You can talk about how extreme a data point is by talking about, "how many sigmas" away from the mean it is.

Population vs. Sample

  • If you’re working with a samepl of data instead of an entire data set (the entire population)…
    • The you wnat to use the sample variance instrad of the population variance
    • For N sameples, you just divide the squared variacnecs by N-1 instead of N.
    • So, in out example, we computed the population variance like this:
      • Sigma squared (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
    • But the sample cariance woulb be:
      • S2 = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 4 = 6.3

The Why

Probability Density Functions

This is the probability of that range occurring. Its NOT the probability of a specific number occuring.

“Gives you the probability of a data point falling within some given range of a given value.”

Probability Mass Function – Discrete Data

Examples of Data Distributions

Uniform Distribution – there is a flat constant probability that it will happen. Basically, an equal chance that it will happen. Means there is a flat constant (equal) probability of the data occurring.

Normal / Gaussian

Exponential PDF / “Power Law” – Things fall off in an exponential manner.