Chuck Conway

Chuck Conway

Building Inspiring Software

Menu
  • Home
  • Projects
  • Notes
  • About
Menu

UW Data Science Course: Week One

Posted on April 15, 2021 by Chuck Conway

Flavors of Data

Numerical This is some sort of quantitative measurement i.e. Heights of people, page load times, stocks prices

There are two types of Numerical Data: Discrete Data – Integer bases; oftne counts of some event

  • How many purchases did a customer make in a year?

  • How many times did I flip "heads"

    Continuous Data

  • Has an infinite number of possible values

    • How much time did it take for a user to check out
    • How much rain fell on a given day?

Categorical Qualitative data that has no inherent mathematical meaning

  • Gender, Yes/No (binary data), Race, State of residence, Product Category, Political Party, etc.
  • You can assign number to categories in order to represent them more compactly, but the numbers don’t have a mathematical meaning

Ordinal This is a mixture of numerical and categorical data

Ordinal data that has mathematical meaning

  • Example: movie ratings on a 1-5 scale
    • Ratings must be 1, 2, 3, 4, or 5
    • But these values have mathematical meanings; 1 means it’s a worse movie than a 2

Statistics 101

Mean

  • This is the average. Sum all the values and divide by the number of values.

Median

  • Sort the values, and take the value at the midpoint
  • if you have a odd number of data points the median might fall in between the two data points.
    • If you have an even number of samples take the average of the two in the middle.
  • Median is less susceptiable to the outliers than the mean.
    • Example: mean household income in the US is $72,641, but the mdeian is only $51,939 – because the mean is skewed by a handfull of billionaries
  • Median better repesents the "typical" American in this example.

Mode

  • The most common value in a data set
    • Not relvant to continuous numerical data
  • Back to our number of kids in each house example.

Standard Deviation and Variance – These concepts are all about the spread of the data (the shape)

Variance – measures how "spread-out" the data is.

  • Variance (sigma squared) is simply the average of the squared differences form the mean.
  • Example: What is the variance of the data set (1, 4, 5, 4, 8)?
    • First find the mean: (1+4+5+4+8)/5 = 4.4
    • Now find the differences from the mean: (-3.4, -0.4, 0.6, -0.4, 3.6)
    • Find the squared differences: (11.56, 0.16, 0.36, 0.16, 12.96)
    • Find the average of the squared differences:
      • sigma squared = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04

Standard Deviation is the the square root of the variance

This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual.

You can talk about how extreme a data point is by talking about, "how many sigmas" away from the mean it is.

Population vs. Sample

  • If you’re working with a samepl of data instead of an entire data set (the entire population)…
    • The you wnat to use the sample variance instrad of the population variance
    • For N sameples, you just divide the squared variacnecs by N-1 instead of N.
    • So, in out example, we computed the population variance like this:
      • Sigma squared (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 5 = 5.04
    • But the sample cariance woulb be:
      • S2 = (11.56 + 0.16 + 0.36 + 0.16 + 12.96) / 4 = 6.3

The Why

Probability Density Functions

This is the probability of that range occurring. Its NOT the probability of a specific number occuring.

“Gives you the probability of a data point falling within some given range of a given value.”

Probability Mass Function – Discrete Data

Examples of Data Distributions

Uniform Distribution – there is a flat constant probability that it will happen. Basically, an equal chance that it will happen. Means there is a flat constant (equal) probability of the data occurring.

Normal / Gaussian

Exponential PDF / “Power Law” – Things fall off in an exponential manner.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

    Archives

    • March 2022
    • November 2021
    • October 2021
    • May 2021
    • April 2021
    • March 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • June 2018
    • October 2017
    • December 2015
    • November 2015
    • August 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • November 2014
    • October 2014
    • March 2014
    • February 2014
    • December 2013
    • March 2013
    • October 2012
    • August 2012
    • May 2012
    • January 2012
    • December 2011
    • June 2011
    • May 2011
    • December 2010
    • November 2010
    • October 2010

    Categories

    • Architecture
    • Article
    • Code
    • Conceptual
    • Design
    • General
    • Influence
    • Notes
    • Process
    • Satire
    ©2023 Chuck Conway