Info 290. Deconstructing Data Science

Info

Many products of human invention — political speeches, product reviews, status updates on Twitter and Facebook, literary texts, music, and paintings — have been analyzed, not uncontroversially, as “data”.

In this graduate-level course (open to all departments, especially those in the humanities and social sciences), we will pursue two ends: we will investigate the landscape of modern quantitative methods for treating data as a lens onto the world, surveying a range of methods in machine learning and data analysis that leverage information produced by people in order to draw inferences (such as discerning the authorship of documents and the political position of social media users, charting the reuse of language in legislative bills, tagging the genres of songs, and extracting social networks from literary texts). Second, we will cast a critical eye on those methods, and investigate the assumptions those algorithms make about the world and the data through which we see it, in order to understand their limitations and when to apply them. How and when can empirical methods support other forms of argumentation, and what are their limits?

Many of these techniques are shared among the nascent communities of practice known as “computational social science”, “computational journalism” and the “digital humanities”; this course provides foundational skills for students to conduct their own research in these areas.

No computational background is required; the Python programming language will be used during instruction. Homeworks will be designed to give students a choice depending on their background — either a.) implementing and evaluating a quantitative method on a dataset, or b.) writing an analysis/critique of an algorithm and published work that has used it. The course will be capped with a final collaborative project.

Texts

[ML] Peter Flach, Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Cambridge, 2012) [Amazon]
[NCM] Easley and Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World (Cambridge, 2010) [online]

Syllabus

(Subject to final changes.) We'll spend the first three weeks introducing the big pillars of data science (clustering, classification and regression as predictive/descriptive tasks) from the perspective of designing and evaluating experiments, and then dive more deeply into the different models that comprise them. Most lectures will be structured as a.) an in-depth description of a model/algorithm, followed by b.) discussion and critique of a specific application that makes use of that method. Our goal is to cultivate critical computational thinking by example.

Date	Topic	Readings
T Jan 17	Overview [slides]
Th Jan 19	Survey of methods [slides]	ML ch. 1 boyd and Crawford (2012), Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon [Optional] Iliadis and Russo (2016), Critical data studies: an introduction
T Jan 24	Classification (design + evaluation) [slides]	ML ch. 2, 3 Optional (computational social science): Lazer et al. (2009), Computational Social Science Grimmer (2015), We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together
Th Jan 26	Regression (design + evaluation) [slides]	ML ch. 3 (cont'd), ch. 10 Optional (computational journalism): Cohen (2011), Computational journalism Parasie (2015), Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of "big data"
T Jan 31	Clustering (design + evaluation) [slides] Homework 1 out.	ML ch. 3 (cont'd) Optional (digital humanities): Marche (2012), Literature Is not Data: Against Digital Humanities Underwood (2015), Seven ways humanists are using computers to understand text.
Th Feb 2	Decision trees; random forests [slides]	ML ch. 5 Silverstein and Shieber (1996), Predicting individual book use for off-site storage using decision trees
T Feb 7	Data and representation [slides]	Gitelman (2013), Raw Data is an Oxymoron The Quartz Guide to Bad Data (2015)
Th Feb 9	Probabilistic models: probability/stats review; Naive Bayes. Authorship attribution [slides]	ML ch. 9 (intro), 9.2 Koppel et al. (2009), Computational methods in authorship attribution [Optional] Long and So (2016), Literary Pattern Recognition: Modernism between Close Reading and Machine Learning
Fri Feb 10	Homework 1 due
T Feb 14	Probabilistic models: logistic regression; (stochastic) gradient descent; regularization; Attribute prediction [slides]	ML ch. 9.3 Rao et al (2010), Classifying Latent User Attributes in Twitter Cohen and Ruths (2013), Classifying Political Orientation on Twitter: It’s Not Easy!
Th Feb 16	Validity: hypothesis tests. [slides] Project proposals due	Krippendorff (2004), "Validity," Content Analysis [on bCourses] Introduction to Hypothesis Testing Kohavi et al. (2007), Practical Guide to Controlled Experiments on the Web
T Feb 21	Validity: causal inference [slides] Homework 2 out.	Gelman and Hill (2009), Causal inference using regression on the treatment variable, Data Analysis Using Regression and Multilevel/Hierarchical Models Lopez (2017), Matching to estimate the causal effects of firing an NFL coach
Th Feb 23	Probabilistic models: latent variable models; generative models [slides]	ML ch. 9.4 Blei (2014), Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models (pp. 203-218)
T Feb 28	Probabilistic models: latent variable models; topic models [slides]	Blei (2012), Probabilistic Topic Models Goldstone and Underwood (2014), The Quiet Transformations of Literary Studies Grimmer (2010), A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases
Th Mar 2	Interpretability	Burrell (2016), How the Machine 'Thinks:' Understanding Opacity in Machine Learning Algorithms Freitas (2014), Comprehensible classification models [Optional] Chen et al. (2015), Enhancing transparency and control when drawing data-driven inferences about individuals
Sun Mar 5	Homework 2 due
T Mar 7	Linear models: linear regression; Predicting movie revenue [slides] Homework 3 out.	ML ch. 7.1 Joshi et al. (2010), Movie Reviews and Revenues: An Experiment in Text Regression
T Mar 14	Linear models: PCA; Dimensionality reduction [slides]	Smith (2002), A tutorial on Principal Components Analysis Powell, Principal Component Analysis, Explained Visually Witmore (2015), Finding "Distances" Between Shakespeare’s Plays 2: Projecting Distances onto New Bases with PCA
Th Mar 16	Neural networks. Word embeddings [slides]	Nielson (2015), Using neural nets to recognize handwritten digits Bolukbasi et al. (2016), Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings Daumé III (2016), Language bias and black sheep
Sun Mar 19	Homework 3 due
T Mar 21	Neural networks. Visual style classification [slides]	Britz (2015), Understanding Convolutional Neural Networks for NLP [Optional] Goodfellow et al. (2016), Introduction, Deep Learning [Optional] Gaytas et al. (2015), A Neural Algorithm of Artistic Style
Th Mar 23	Ethics; Predictive policing	Crawford and Calo (2016), There is a blind spot in AI research Markham (2016), OKCupid data release fiasco Rand (2013), Predicting Policing: The Role of Crime Forecasting in Law Enforcement Operations, chs. 1 and 2 [Optional] Goel et al. (2016), Combatting Police Discrimination in the Age of Big Data
Fri Mar 24	Midterm reports due
T Mar 28	No class (spring break)
Th Mar 30	No class (spring break)
T Apr 4	Distance models: classification (nearest neighbors) and similarity; Text reuse [slides]	ML ch. 8.1-8.3 Leskovec et al. (2014), Finding Similar Items, Mining Massive Datasets [optional] Leskovec (2009), Meme-tracking and the Dynamics of the News Cycle [Optional] Smith et al. (2014), Detecting and Modeling Local Text Reuse
Th Apr 6	Distance models: clustering (K-means; hierarchical); Genre clustering [slides] Homework 4 out.	ML ch. 8.4-8.5 Allison et al. (2011), Quantitative Formalism: an Experiment
T Apr 11	Networks: structural properties; strong and weak ties. Homophily	NCM ch. 2 ("Graphs") and 3 ("Strong and Weak Ties") Al Zamal et al. (2011), Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors [Optional] Hargittai (2015), Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites
Th Apr 13	Networks (Rob Kuvinka)
T Apr 18	Networks: information diffusion	NCM ch. 16 ("Information Cascades") Kramer et al. (2015), Experimental evidence of massive-scale emotional contagion through social networks [Optional] Adar et al. (2004), Implicit Structure and the Dynamics of Blogspace [Optional] Tufecki (2015), Facebook and Engineering the Public
Th Apr 20	Fairness and accountability	Wallach (2014), Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency Moritz Hardt (2014), How big data is unfair: Understanding sources of unfairness in data driven decision making [Optional] Zafar et al. (2015), Fairness Constraints: A Mechanism for Fair Classification
Fri Apr 21	Homework 4 due
T Apr 25	Student project presentations
Th Apr 27	Student project presentations
Fri May 5	Final projects due

Grading

10%	Class participation
50%	Homeworks (4 x 12.5%)
40%	Project:
	5% Proposal/literature review
	10% Midterm report
	20% Final report
	5% Presentation

Project

The course will be capped by a semester-long collaborative project (involving 2 or 3 students), where the methods learned in class will be used to draw inferences about the world and critically assess the quality of those results. The project will be comprised of four components:

— Project proposal and literature review. Students will formulate a hypothesis to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
— Midterm report. By the middle of the course, students should have a.) completed data collection; b.) established a validation strategy to be performed at the end of experimentation, and c.) present initial experimental results. (4 pages; 10 sources)
— Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (8 pages)
— Presentation. At the end of the semester, teams will present their work in a conference-style presentation. (10-15 minutes, with time for questions).

All reports should use the ACL 2015 style files for either LaTeX or Microsoft Word.

Participation

Most classes will include discussion of an application as documented in a research paper. While everyone is expected to read these papers, one student each class will act as a discussion leader, coming prepared with questions and discussion topics for the class a whole to discuss.

Policies

Academic Integrity

All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks should be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. Late homeworks will not be accepted.

Students with Disabilities

Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.

Late assignments

All assignments are expected to be turned in by the specified date and time. However, students are free to use a total of 2 "free days," each of which extends the due date by 24 hours. Use them wisely! Assignments turned in late after both free days have been used up with not receive credit.

Acknowledgments

This course draws inspiration from courses by Jacob Eisenstein (Georgia Tech), Andrew Goldstone (Rutgers), Justin Grimmer (Stanford), David Mimno (Cornell), Brendan O'Connor (UMass), Cosma Shalizi (CMU), Jonathan Stray (Columbia) and Emily Bender (UW). Design: HTML5 UP.