Info 290. Deconstructing Data Science

Info

Many products of human invention — political speeches, product reviews, status updates on Twitter and Facebook, literary texts, music, and paintings — have been analyzed, not uncontroversially, as “data”.

In this graduate-level course (open to all departments, especially those in the humanities and social sciences), we will pursue two ends: we will investigate the landscape of modern quantitative methods for treating data as a lens onto the world, surveying a range of methods in machine learning and data analysis that leverage information produced by people in order to draw inferences (such as discerning the authorship of documents and the political position of social media users, charting the reuse of language in legislative bills, tagging the genres of songs, and extracting social networks from literary texts). Second, we will cast a critical eye on those methods, and investigate the assumptions those algorithms make about the world and the data through which we see it, in order to understand their limitations and when to apply them. How and when can empirical methods support other forms of argumentation, and what are their limits?

Many of these techniques are shared among the nascent communities of practice known as “computational social science”, “computational journalism” and the “digital humanities”; this course provides foundational skills for students to conduct their own research in these areas.

No computational background is required; the Python programming language will be used during instruction. Homeworks will be designed to give students a choice depending on their background — either a.) implementing and evaluating a quantitative method on a dataset, or b.) writing an analysis/critique of an algorithm and published work that has used it. The course will be capped with a final collaborative project.

Texts

[ML] Peter Flach, Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Cambridge, 2012) [Amazon]
[NCM] Easley and Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World (Cambridge, 2010) [online]

Syllabus

(Subject to final changes.) We'll spend the first three weeks introducing the big pillars of data science (clustering, classification and regression as predictive/descriptive tasks) from the perspective of designing and evaluating experiments, and then dive more deeply into the different models that comprise them. Most lectures will be structured as a.) an in-depth description of a model/algorithm, followed by b.) discussion and critique of a specific application that makes use of that method. Our goal is to cultivate critical computational thinking by example.

Date	Topic	Readings
W Jan 20	Overview [slides] [perceptron.ipynb]
M Jan 25	Survey of methods [slides]	ML ch. 1 boyd and Crawford (2012), Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon
W Jan 27	Classification (design + evaluation) [slides]	ML ch. 2, 3 Optional (computational social science): Lazer et al. (2009), Computational Social Science Grimmer (2015), We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together
M Feb 1	Regression (design + evaluation) [slides]	ML ch. 3 (cont'd), ch. 10 Optional (computational journalism): Cohen (2011), Computational journalism Parasie (2015), Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of "big data"
W Feb 3	Clustering (design + evaluation) [slides]. Homework 1 out (due Feb 17)	ML ch. 3 (cont'd) Optional (digital humanities): Marche (2012), Literature Is not Data: Against Digital Humanities Underwood (2015), Seven ways humanists are using computers to understand text.
M Feb 8	Validity (frequentist hypothesis tests; multiple hypothesis tests; A/B tests; Bayes factors) [slides]	Krippendorff (2004), "Validity," Content Analysis Introduction to Hypothesis Testing Kohavi et al. (2007), Practical Guide to Controlled Experiments on the Web Etz (2015), Understanding Bayes: Visualization of the Bayes Factor
W Feb 10	Decision trees; random forests [slides]	ML ch. 5 Silverstein and Shieber (1996), Predicting individual book use for off-site storage using decision trees
~~M Feb 15~~	No class (holiday)
W Feb 17	Probabilistic models: probability/stats review; Naive Bayes. Homework 1 due; homework 2 out. Authorship attribution [slides]	ML ch. 9 (intro), 9.2 Koppel et al. (2009), Computational methods in authorship attribution
F Feb 19	Project proposal/literature reviews due
M Feb 22	Probabilistic models: logistic regression; (stochastic) gradient descent; regularization; [slides] Attribute prediction	ML ch. 9.3 Rao et al (2010), Classifying Latent User Attributes in Twitter Cohen and Ruths (2013), Classifying Political Orientation on Twitter: It’s Not Easy!
W Feb 24	Probabilistic models: latent variable models; generative models [slides]	ML ch. 9.4 Blei (2014), Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models (pp. 203-218)
M Feb 29	Probabilistic models: latent variable models; topic models [slides]	Blei (2012), Probabilistic Topic Models Goldstone and Underwood (2014), The Quiet Transformations of Literary Studies Grimmer (2010), A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases
W Mar 2	Interpretability. Homework 2 due; homework 3 out.	Burrell (2016), How the Machine 'Thinks:' Understanding Opacity in Machine Learning Algorithms Freitas (2014), Comprehensible classification models Chen et al. (2015), Enhancing transparency and control when drawing data-driven inferences about individuals
M Mar 7	Linear models: linear regression; Predicting movie revenue [slides]	ML ch. 7.1 Joshi et al. (2010), Movie Reviews and Revenues: An Experiment in Text Regression
W Mar 9	Linear models: PCA; Dimensionality reduction [slides]	Smith (2002), A tutorial on Principal Components Analysis Powell, Principal Component Analysis, Explained Visually Witmore (2015), Finding "Distances" Between Shakespeare’s Plays 2: Projecting Distances onto New Bases with PCA
M Mar 14	Linear models: SVM; nonlinear models: kernelized SVM; Music genre classification [slides]	ML ch. 7.3-7.5 Scaringella (2006), Automatic genre classification of music content: a survey
W Mar 16	Nonlinear models: neural networks. Homework 3 due; homework 4 out. Visual style classification [slides]	Nielson (2015), Using neural nets to recognize handwritten digits Britz (2015), Understanding Convolutional Neural Networks for NLP Goodfellow et al. (2016), Introduction, Deep Learning [optional] Gaytas et al. (2015), A Neural Algorithm of Artistic Style [optional]
F Mar 18	Project midterm reports due
~~M Mar 21~~	No class (spring break)
~~W Mar 23~~	No class (spring break)
M Mar 28	Distance models: classification (nearest neighbors) and similarity; Text reuse [slides]	ML ch. 8.1-8.3 Leskovec et al. (2014), Finding Similar Items, Mining Massive Datasets [optional] Smith et al. (2014), Detecting and Modeling Local Text Reuse Leskovec (2009), Meme-tracking and the Dynamics of the News Cycle
W Mar 30	Distance models: clustering (K-means; hierarchical); Genre clustering [slides]	ML ch. 8.4-8.5 Allison et al. (2011), Quantitative Formalism: an Experiment
M Apr 4	Ethics; Predictive policing	Rand (2013), Predicting Policing: The Role of Crime Forecasting in Law Enforcement Operations, chs. 1 and 2 Crawford and Schultz (2015), Big Data and Due Process: Toward a Framework to Redress Predictive Privacy Harms
W Apr 6	Networks: structural properties; strong and weak ties. Homework 4 dueHomophily [slides]	NCM ch. 2 ("Graphs") and 3 ("Strong and Weak Ties") Al Zamal et al. (2011), Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors
M Apr 11	Networks: information diffusion [slides]	NCM ch. 16 ("Information Cascades") Adar et al. (2004), Implicit Structure and the Dynamics of Blogspace Kramer et al. (2015), Experimental evidence of massive-scale emotional contagion through social networks Tufecki (2015), Facebook and Engineering the Public
W Apr 13	Affective computing (Noura)	Boehner et al. (2007), How emotion is made and measured Affectiva Leahu and Sengers (2014), Freaky: Performing Hybrid Human-Machine Emotion Barrett (2009), Variety is the spice of life: A psychological construction approach to understanding variability in emotion [optional]
M Apr 18	Fairness and accountability	Wallach (2014), Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency Zafar et al. (2015), Fairness Constraints: A Mechanism for Fair Classification Moritz Hardt (2014), How big data is unfair: Understanding sources of unfairness in data driven decision making
W Apr 20	Review; Predicting elections and the stock market. Homework 5 out.	Gayo-Avello (2013), A Meta-Analysis of State-of-the- Art Electoral Prediction From Twitter Data Bollen et al. (2010), Twitter mood predicts the stock market The junk science behind the 'Twitter Hedge Fund' [optional]
M Apr 25	Student project presentations
W Apr 27	Student project presentations
F May 6	Final project report due

Grading

10%	Class participation
50%	Homeworks (4 x 12.5%)
40%	Project:
	5% Proposal/literature review
	10% Midterm report
	20% Final report
	5% Presentation

Project

The course will be capped by a semester-long collaborative project (involving 2 or 3 students), where the methods learned in class will be used to draw inferences about the world and critically assess the quality of those results. The project will be comprised of four components:

— Project proposal and literature review. Students will formulate a hypothesis to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
— Midterm report. By the middle of the course, students should have a.) completed data collection; b.) established a validation strategy to be performed at the end of experimentation, and c.) present initial experimental results. (4 pages; 10 sources)
— Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (8 pages)
— Presentation. At the end of the semester, teams will present their work in a conference-style presentation. (10-15 minutes, with time for questions).

All reports should use the ACL 2015 style files for either LaTeX or Microsoft Word.

Policies

Academic Integrity

All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks should be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. Late homeworks will not be accepted.

Students with Disabilities

Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.

Acknowledgments

This course draws inspiration from courses by Jacob Eisenstein (Georgia Tech), Andrew Goldstone (Rutgers), Justin Grimmer (Stanford), David Mimno (Cornell), Brendan O'Connor (UMass) and Cosma Shalizi (CMU). Design: HTML5 UP.