Info

Many products of human invention — political speeches, product reviews, status updates on Twitter and Facebook, literary texts, music, and paintings — have been analyzed, not uncontroversially, as “data”.

In this graduate-level course (open to all departments, especially those in the humanities and social sciences), we will pursue two ends: we will investigate the landscape of modern quantitative methods for treating data as a lens onto the world, surveying a range of methods in machine learning and data analysis that leverage information produced by people in order to draw inferences (such as discerning the authorship of documents and the political position of social media users, charting the reuse of language in legislative bills, tagging the genres of songs, and extracting social networks from literary texts). Second, we will cast a critical eye on those methods, and investigate the assumptions those algorithms make about the world and the data through which we see it, in order to understand their limitations and when to apply them. How and when can empirical methods support other forms of argumentation, and what are their limits?

Many of these techniques are shared among the nascent communities of practice known as “computational social science”, “computational journalism” and the “digital humanities”; this course provides foundational skills for students to conduct their own research in these areas.

No computational background is required; the Python programming language will be used during instruction. Homeworks will be designed to give students a choice depending on their background — either a.) implementing and evaluating a quantitative method on a dataset, or b.) writing an analysis/critique of an algorithm and published work that has used it. The course will be capped with a final collaborative project.

Texts

  • [ML] Peter Flach, Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Cambridge, 2012) [Amazon]
  • [NCM] Easley and Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World (Cambridge, 2010) [online]

Syllabus

(Subject to final changes.) We'll spend the first three weeks introducing the big pillars of data science (clustering, classification and regression as predictive/descriptive tasks) from the perspective of designing and evaluating experiments, and then dive more deeply into the different models that comprise them. Most lectures will be structured as a.) an in-depth description of a model/algorithm, followed by b.) discussion and critique of a specific application that makes use of that method. Our goal is to cultivate critical computational thinking by example.

Date Topic Readings
T Jan 17 Overview [slides]  
Th Jan 19 Survey of methods [slides]
T Jan 24 Classification (design + evaluation) [slides]
Th Jan 26 Regression (design + evaluation) [slides]
T Jan 31 Clustering (design + evaluation) [slides] Homework 1 out.
Th Feb 2 Decision trees; random forests [slides]
T Feb 7 Data and representation [slides]
Th Feb 9 Probabilistic models: probability/stats review; Naive Bayes. Authorship attribution [slides]
Fri Feb 10 Homework 1 due
T Feb 14 Probabilistic models: logistic regression; (stochastic) gradient descent; regularization; Attribute prediction [slides]
Th Feb 16 Validity: hypothesis tests. [slides] Project proposals due
T Feb 21 Validity: causal inference [slides] Homework 2 out.
Th Feb 23 Probabilistic models: latent variable models; generative models [slides]
T Feb 28 Probabilistic models: latent variable models; topic models [slides]
Th Mar 2 Interpretability
Sun Mar 5 Homework 2 due
T Mar 7 Linear models: linear regression; Predicting movie revenue [slides] Homework 3 out.
T Mar 14 Linear models: PCA; Dimensionality reduction [slides]
Th Mar 16 Neural networks. Word embeddings [slides]
Sun Mar 19 Homework 3 due
T Mar 21 Neural networks. Visual style classification [slides]
Th Mar 23 Ethics; Predictive policing
Fri Mar 24 Midterm reports due  
T Mar 28 No class (spring break)  
Th Mar 30 No class (spring break)  
T Apr 4 Distance models: classification (nearest neighbors) and similarity; Text reuse [slides]
Th Apr 6 Distance models: clustering (K-means; hierarchical); Genre clustering [slides] Homework 4 out.
T Apr 11 Networks: structural properties; strong and weak ties. Homophily
Th Apr 13 Networks (Rob Kuvinka)  
T Apr 18 Networks: information diffusion
Th Apr 20 Fairness and accountability
Fri Apr 21 Homework 4 due
T Apr 25 Student project presentations  
Th Apr 27 Student project presentations  
Fri May 5 Final projects due

Grading

10% Class participation
50% Homeworks (4 x 12.5%)
40% Project:
      5% Proposal/literature review
      10% Midterm report
      20% Final report
      5% Presentation

Project

The course will be capped by a semester-long collaborative project (involving 2 or 3 students), where the methods learned in class will be used to draw inferences about the world and critically assess the quality of those results. The project will be comprised of four components:

  • — Project proposal and literature review. Students will formulate a hypothesis to be examined, motivate its rationale as an interesting question worth asking, and assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (2 pages; 5 sources)
  • — Midterm report. By the middle of the course, students should have a.) completed data collection; b.) established a validation strategy to be performed at the end of experimentation, and c.) present initial experimental results. (4 pages; 10 sources)
  • — Final report. The final report will include a complete description of work undertaken for the project, including data collection, development of methods, experimental details (complete enough for replication), comparison with past work, and a thorough analysis. Projects will be evaluated according to standards for conference publication—including clarity, originality, soundness, substance, evaluation, meaningful comparison, and impact (of ideas, software, and/or datasets). (8 pages)
  • — Presentation. At the end of the semester, teams will present their work in a conference-style presentation. (10-15 minutes, with time for questions).
All reports should use the ACL 2015 style files for either LaTeX or Microsoft Word.

Participation

Most classes will include discussion of an application as documented in a research paper. While everyone is expected to read these papers, one student each class will act as a discussion leader, coming prepared with questions and discussion topics for the class a whole to discuss.

Policies

Academic Integrity

All students will follow the UC Berkeley code of conduct. While the group project is a collaborative effort, all homeworks should be completed independently. All writing must be your own; if you mention the work of others, you must be clear in citing the appropriate source (For additional information on plagiarism, see here.) This holds for source code as well: if you use others' code (e.g., from StackOverflow), you must cite its source. Late homeworks will not be accepted.

Students with Disabilities

Our goal is to make class a learning environment accessible to all students. If you need disability-related accommodations and have a Letter of Accommodation from the DSP, have emergency medical information you wish to share with me, or need special arrangements in case the building must be evacuated, please inform me immediately. I'm happy to discuss privately after class or at my office.

Late assignments

All assignments are expected to be turned in by the specified date and time. However, students are free to use a total of 2 "free days," each of which extends the due date by 24 hours. Use them wisely! Assignments turned in late after both free days have been used up with not receive credit.