Many products of human invention — political speeches, product reviews, status updates on Twitter and Facebook, literary texts, music, and paintings — have been analyzed, not uncontroversially, as “data”.
In this graduate-level course (open to all departments, especially those in the humanities and social sciences), we will pursue two ends: we will investigate the landscape of modern quantitative methods for treating data as a lens onto the world, surveying a range of methods in machine learning and data analysis that leverage information produced by people in order to draw inferences (such as discerning the authorship of documents and the political position of social media users, charting the reuse of language in legislative bills, tagging the genres of songs, and extracting social networks from literary texts). Second, we will cast a critical eye on those methods, and investigate the assumptions those algorithms make about the world and the data through which we see it, in order to understand their limitations and when to apply them. How and when can empirical methods support other forms of argumentation, and what are their limits?
Many of these techniques are shared among the nascent communities of practice known as “computational social science”, “computational journalism” and the “digital humanities”; this course provides foundational skills for students to conduct their own research in these areas.
No computational background is required; the Python programming language will be used during instruction. Homeworks will be designed to give students a choice depending on their background — either a.) implementing and evaluating a quantitative method on a dataset, or b.) writing an analysis/critique of an algorithm and published work that has used it. The course will be capped with a final collaborative project.
(Subject to final changes.) We'll spend the first three weeks introducing the big pillars of data science (clustering, classification and regression as predictive/descriptive tasks) from the perspective of designing and evaluating experiments, and then dive more deeply into the different models that comprise them. Most lectures will be structured as a.) an in-depth description of a model/algorithm, followed by b.) discussion and critique of a specific application that makes use of that method. Our goal is to cultivate critical computational thinking by example.
Date | Topic | Readings |
W Jan 20 | Overview [slides] [perceptron.ipynb] | |
M Jan 25 | Survey of methods [slides] |
|
W Jan 27 | Classification (design + evaluation) [slides] |
|
M Feb 1 | Regression (design + evaluation) [slides] |
- ML ch. 3 (cont'd), ch. 10
|
W Feb 3 | Clustering (design + evaluation) [slides]. Homework 1 out (due Feb 17) |
|
M Feb 8 | Validity (frequentist hypothesis tests; multiple hypothesis tests; A/B tests; Bayes factors) [slides] |
|
W Feb 10 | Decision trees; random forests [slides] |
|
M Feb 15 | No class (holiday) | |
W Feb 17 | Probabilistic models: probability/stats review; Naive Bayes. Homework 1 due; homework 2 out. Authorship attribution [slides] |
|
F Feb 19 | Project proposal/literature reviews due | |
M Feb 22 | Probabilistic models: logistic regression; (stochastic) gradient descent; regularization; [slides] Attribute prediction |
|
W Feb 24 | Probabilistic models: latent variable models; generative models [slides] |
|
M Feb 29 | Probabilistic models: latent variable models; topic models [slides] |
|
W Mar 2 | Interpretability. Homework 2 due; homework 3 out. |
|
M Mar 7 | Linear models: linear regression; Predicting movie revenue [slides] |
|
W Mar 9 | Linear models: PCA; Dimensionality reduction [slides] |
|
M Mar 14 | Linear models: SVM; nonlinear models: kernelized SVM; Music genre classification [slides] |
|
W Mar 16 | Nonlinear models: neural networks. Homework 3 due; homework 4 out. Visual style classification [slides] |
|
F Mar 18 | Project midterm reports due | |
M Mar 21 | No class (spring break) | |
W Mar 23 | No class (spring break) | |
M Mar 28 | Distance models: classification (nearest neighbors) and similarity; Text reuse [slides] |
|
W Mar 30 | Distance models: clustering (K-means; hierarchical); Genre clustering [slides] |
|
M Apr 4 | Ethics; Predictive policing |
|
W Apr 6 | Networks: structural properties; strong and weak ties. Homework 4 dueHomophily [slides] |
|
M Apr 11 | Networks: information diffusion [slides] |
|
W Apr 13 | Affective computing (Noura) |
|
M Apr 18 | Fairness and accountability |
|
W Apr 20 | Review; Predicting elections and the stock market. Homework 5 out. |
|
M Apr 25 | Student project presentations | |
W Apr 27 | Student project presentations | |
F May 6 | Final project report due | |
The course will be capped by a semester-long collaborative project (involving 2 or 3 students), where the methods learned in class will be used to draw inferences about the world and critically assess the quality of those results. The project will be comprised of four components:
- — Project proposal and literature review. Students will formulate a hypothesis to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
- — Midterm report. By the middle of the course, students should have a.) completed data collection; b.) established a validation strategy to be performed at the end of experimentation, and c.) present initial experimental results. (4 pages; 10 sources)
- — Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (8 pages)
- — Presentation. At the end of the semester, teams will present their work in a conference-style presentation. (10-15 minutes, with time for questions).
All reports should use the ACL 2015 style files for either LaTeX or Microsoft Word.
Academic Integrity
All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks should be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. Late homeworks will not be accepted.
Students with Disabilities
Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.