Part 0: Introduction
Data science articulated, data science examples, history and context, technology landscape
- example: Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow, Albert-László Barabási, Flavor network and the principles of food pairing, Scientific Reports 1, Article number: 196 doi:10.1038/srep00196
- example: Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
- example: Google Flu Trends
- example: Eigenfactor, and publications
- example: L’Aquila quake: Italy scientists guilty of manslaughter, BBC
- Discussion of data science and data scientists
- eScience: The Fourth Paradigm, Foreward and Introduction, pages xi – xxxi; Gray’s Laws, pages 5-12
- Chris Anderson, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” , Wired magazine, 2008
- Responses to Chris Anderson, 2008
Part 1: Data Manipulation, at Scale
Databases and the relational algebra
Readings
- How Vertica Was the Star of the Obama Campaign, and Other Revelations
- E. F. Codd, 1981 Turing Award Lecture, " Relational Database: A Practical Foundation for Productivity", 1981 (Think about which arguments from this short piece are still relevant today.)
- [Advanced] Cohen et al.“MAD Skills: New Analysis Practices for Big Data”, 2009
- [Advanced] Erik Meijer, Gavin Bierma co-Relational Model of Large Shared Data Banks, Communications of the ACM, 2011
MapReduce, Hadoop, relationship to databases, algorithms, extensions, language; key-value stores and NoSQL; tradeoffs of SQL and NoSQL Readings
- Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
- Stonebraker et al., “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM, January 2010.
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, January 2010.
- Rick Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, December 2010 (39:4)
- Optional Technical Background: The Hadoop Distributed File System
Data cleaning, entity resolution, data integration, information extraction*(NOT COVERED IN LECTURES)Readings* / Talks
- Elmagarmid, et. al. Duplicate Record Detection: A Survey,
- Koudas, et. al. Record Linkage: Similarity Measures and Algorithms
Part 2: Analytics
Topics in statistical modeling and experiment design Readings
- Statistics is Easy! Dennis Sasha, Manda Wilson, Morgan and Claypool
- Chapter 3 of A Handbook of Statistical Analyses Using R
- Gregory Park on overfitting to the leaderboard in a Kaggle Competition
Introduction to Machine Learning, supervised learning, decision trees/forests, simple nearest neighborReadings
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read section on C4.5)
- Ullman, Rajaraman, Mining of Massive Datasets , Chapter 1
- Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
Unsupervised learning: k-means, multi-dimensional scaling
Readings
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read section on k-means)
Part 3: Interpreting and Communicating Results
Visualization, visual data analytics Readings (well, watchings)
- Hans Rosling, The Joy of Stats
- Pat Hanaran, Tools for Data Enthusiasts
- Jeffrey Heer, Michael Bostock, Vadim Ogievetsky, A Tour through the Visualization Zoo, Communications of the ACM, Volume 53 Issue 6, June 2010
Backlash: Ethics, privacy, unreliable methods, irreproducible results
- Howard Wen, "Big Ethics for Big Data", O’Reilly Media
- John Markoff, New York Times, Unreported Side Effects of Drugs Are Found Using Internet Search Data, March 13, 2013
- Mike Loukides, Data Skepticism, O’Reilly Media, April 2013
- Gary Marcus and Ernest Davis, Eight (No, Nine!) Problems With Big Data, New York Times, April 6, 2014
- Tim Harford, Big data: are we making a big mistake?, March 28, 2014
- K.N.C., The backlash against big data, The Economist, Apr 20th 2014 (very short)
- See also: Gartner Hype cycle
- George Johnson, New Truths That Only One Can See, New York Times, January 20, 2014
- John P. A. Ioannidis, Why Most Published Research Findings Are False, PLOS One, August 30, 2005
- Dan Mckinley, Whom the Gods Would Destroy, they First Give Real-Time Analytics
Part 4: Graph Analytics
Readings
- Sherif Sakr, Processing large-scale graph data: A guide to current technology, June 2013
- (more to come)