Info

Many products of human invention — political speeches, product reviews, status updates on Twitter and Facebook, literary texts, music, and paintings — have been analyzed, not uncontroversially, as “data”.

In this graduate-level course (open to all departments, especially those in the humanities and social sciences), we will pursue two ends: we will investigate the landscape of modern quantitative methods for treating data as a lens onto the world, surveying a range of methods in machine learning and data analysis that leverage information produced by people in order to draw inferences (such as discerning the authorship of documents and the political position of social media users, charting the reuse of language in legislative bills, tagging the genres of songs, and extracting social networks from literary texts). Second, we will cast a critical eye on those methods, and investigate the assumptions those algorithms make about the world and the data through which we see it, in order to understand their limitations and when to apply them. How and when can empirical methods support other forms of argumentation, and what are their limits?

Many of these techniques are shared among the nascent communities of practice known as “computational social science”, “computational journalism” and the “digital humanities”; this course provides foundational skills for students to conduct their own research in these areas.

No computational background is required; the Python programming language will be used during instruction. Homeworks will be designed to give students a choice depending on their background — either a.) implementing and evaluating a quantitative method on a dataset, or b.) writing an analysis/critique of an algorithm and published work that has used it. The course will be capped with a final collaborative project.

Texts

  • [ML] Peter Flach, Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Cambridge, 2012) [Amazon]
  • [NCM] Easley and Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World (Cambridge, 2010) [online]

Syllabus

(Subject to final changes.) We'll spend the first three weeks introducing the big pillars of data science (clustering, classification and regression as predictive/descriptive tasks) from the perspective of designing and evaluating experiments, and then dive more deeply into the different models that comprise them. Most lectures will be structured as a.) an in-depth description of a model/algorithm, followed by b.) discussion and critique of a specific application that makes use of that method. Our goal is to cultivate critical computational thinking by example.

DateTopicReadings
W Jan 20Overview [slides] [perceptron.ipynb]
M Jan 25Survey of methods [slides]
W Jan 27Classification (design + evaluation) [slides]
M Feb 1Regression (design + evaluation) [slides]
W Feb 3Clustering (design + evaluation) [slides]. Homework 1 out (due Feb 17)
M Feb 8Validity (frequentist hypothesis tests; multiple hypothesis tests; A/B tests; Bayes factors) [slides]
W Feb 10Decision trees; random forests [slides]
M Feb 15No class (holiday)
W Feb 17Probabilistic models: probability/stats review; Naive Bayes. Homework 1 due; homework 2 out. Authorship attribution [slides]
F Feb 19Project proposal/literature reviews due
M Feb 22Probabilistic models: logistic regression; (stochastic) gradient descent; regularization; [slides] Attribute prediction
W Feb 24Probabilistic models: latent variable models; generative models [slides]
M Feb 29Probabilistic models: latent variable models; topic models [slides]
W Mar 2Interpretability. Homework 2 due; homework 3 out.
M Mar 7Linear models: linear regression; Predicting movie revenue [slides]
W Mar 9Linear models: PCA; Dimensionality reduction [slides]
M Mar 14Linear models: SVM; nonlinear models: kernelized SVM; Music genre classification [slides]
W Mar 16Nonlinear models: neural networks. Homework 3 due; homework 4 out. Visual style classification [slides]
F Mar 18Project midterm reports due
M Mar 21No class (spring break)
W Mar 23No class (spring break)
M Mar 28Distance models: classification (nearest neighbors) and similarity; Text reuse [slides]
W Mar 30Distance models: clustering (K-means; hierarchical); Genre clustering [slides]
M Apr 4Ethics; Predictive policing
W Apr 6Networks: structural properties; strong and weak ties. Homework 4 dueHomophily [slides]
M Apr 11Networks: information diffusion [slides]
W Apr 13Affective computing (Noura)
M Apr 18Fairness and accountability
W Apr 20Review; Predicting elections and the stock market. Homework 5 out.
M Apr 25Student project presentations
W Apr 27Student project presentations
F May 6Final project report due

Grading

10%Class participation
50%Homeworks (4 x 12.5%)
40%Project:
    5% Proposal/literature review
    10% Midterm report
    20% Final report
    5% Presentation

Project

The course will be capped by a semester-long collaborative project (involving 2 or 3 students), where the methods learned in class will be used to draw inferences about the world and critically assess the quality of those results. The project will be comprised of four components:

  • — Project proposal and literature review. Students will formulate a hypothesis to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
  • — Midterm report. By the middle of the course, students should have a.) completed data collection; b.) established a validation strategy to be performed at the end of experimentation, and c.) present initial experimental results. (4 pages; 10 sources)
  • — Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (8 pages)
  • — Presentation. At the end of the semester, teams will present their work in a conference-style presentation. (10-15 minutes, with time for questions).
All reports should use the ACL 2015 style files for either LaTeX or Microsoft Word.

Policies

Academic Integrity

All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks should be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. Late homeworks will not be accepted.

Students with Disabilities

Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.