UW Data Science Course | Chuck Conway

If you’re a DBA, you need to learn to deal with unstructured data

If you are a statistician, you need to learn to deal with data that does not fit in memory

If you are a software engineer, you need to learn statistical modeling and how to communicate results.

If you are a business analyst, you need to learn about algorithms and tradeoffs at scale.

Week 2

Structures
1. Rows and columns
2. Nodes and edges
3. Key value pairs
4. A sequence of bytes
Constraints
1. All rows must have the same number so columns
2. All values in one column must have the same type
3. A child cannot have two parents
Operations
1. Find the value of key x
2. Fine the rows where column “lastname” is “Jordan”
3. Get the next N bytes

PWD – “print working directory”

Week 3

Descriptive – Just to describe a set of data (i.e. census data, ngram viewer)

Description and the interpretation are different steps
Description can usually not be generalized without additional statistical modeling

Exploratory – Find relationships you didn’t know about

Exploratory models are good for discovering new connections
They are also useful defining future studies
Exploratory analyses are usually not the final say
Exploratory analyses alone should not be used for generalizing / predicting
Correlation does not imply causation

Inferential – Use a relatively small sample of data to say something about a bigger population

Inference is commonly the goal of statistical models
Inference involves estimating both the quaintly you care about and your uncertainty about your estimate
Inference depends heavily on both the population and the sampling scheme

Predictive – To use the data on some objects to predict values for another object

If X predicts Y, it does not mean that X causes Y
Accurate prediction depends heavily on measuring the right variables.
Although there are better and worse prediction models, more data and a simple model works really well.
Prediction is very hard, especially about the future references.

Causal – To find out what happens to one variable when you make another variable change.

Usually randomized studies are required to identify causation
There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions
Causal relationships are usually identified as average effects, but may not apply to every individual.
Causal models are usually the “gold standard” for data analysis.

Mechanistic – Understand the exact changes in variables that lead to changes in other variables for individual objects.

Incredibly harder to infer, except in simple situations
Usually modeled by deterministic set of equations (Physical/engineering science)
Generally the random component of the data is measurement error
If the equations are known but the parameters are not, they maybe interred with data analysis.

What is data – Data are values of qualitative or quantitative variables belonging to a set of items.

Set of Items: Sometimes called the populate, the set of objects you are interested in.
Variables: A measurement or characteristics of an item.
Qualitative: Country of origin, sex, treatment
Quantitative: Height, weight, blood pressure

Data rarely comes processed.

Data is the second most important thing

The most important think in data science is the question
The second most important is the data
Often the data will limit or enable the questions
But having data can’t save you if you don’t have a question

What about big data?

Collect much more data, much more cheaply. Lots of noise to signal ratio.

Big or small data“The data may not contain the answer. The combination of some data and an aching desire for an answer. The combination of some data an an aching desire for an answer does not ensure that a reasonable answer be extracted from a given body of data”

Experimental Design

What should I care about experimental design. It’s really easy to focus on the outcome and overlook an error with the numbers.

Care about the analysis plan.
It’s critical to pay attention to all aspects of the design and analysis of study. Pay attention to the data cleaning, to data analysis and the reporting so the key issues in the study don’t trip you up.

Question: Does changing the text on your website improve donations?

Experiment:

Formulate your question in advance

Randomly show visitors one version or the other
Measure ho much they donate
Determine which is better

Data Science is a scientific discipline. Science demands you are answering a specific question when you are using data.

Compared two versions of the website. Randomly show visitor two versions. Measure how much they donate to figure out which is better.

Statistical inference – a key component of data science.

Confounding – What are the other variable that are causing a relationship.

Randomization and blocking
- If you can and want to fix a variable
  - Website always says Obama 2012
- If you don’t fix a variable, stratify it.
  - If you are testing sign up phrases and have two websites colors, use both phrases equally on both.
- If you can’t fix a variable, randomize it.
- Why does randomization help?
  - Because it eliminates the possibility that the non-random variable is a factor or not.

Both shoe size and literacy, the bigger the show the more literate someone is, but what’s happening is a baby and child have small feet and less literacy. Age is actually the factor not show size.

Correlation is not causation

PredictionTake a sample of people with Cancer. Take a set of data and separate out the folks that responded to chemotherapy on ones that did not. Then create a function, where you can determine who will and who won’t respond to chemotherapy.

Is challenging than inference

Prediction vs. Inference – the more separated the groupings

Prediction key quantities

Sensitivity
- There probability that you have a disease, given that the test was positive
Specificity
- The probability that you have no disease with a negative test
Postive Predictive Value
- The probability that you have a positive test, that you have a disease
Negative Predictive Value
- If you have a negative test, what is the probability that you have the disease
Accuracy
- This is the probability that you were correct in the outcome.

Beware data dredging

Summary

Good experiments

Have replication
Measure variability
Generalize to the problem you care about
Are Transparent

Prediction is not inference

Both can be important

Beware of data dredging

Data dredging (also data fishing, data snooping, and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.