Info 290. Deconstructing Data Science

Homework 1: Representation (due 2/17)

In this homework, we'll be predicting the winners of 2016 Oscars (taking place on Feb. 28), for six categories: Best Picture, Best Director, Best Actor/Actress, and Best Supporting Actor/Actress. The quality of predictive models is heavily dependent on the representation of the input data through features, so your choice of how your represent your data has important consequences for what can be learned. Consider the category of "Best Picture," with 2015 nominees of "The Big Short", "Bridge of Spies", "Brooklyn", "Mad Max: Fury Road", "The Martian", "The Revenant", "Room" and "Spotlight". One example featurization would be to consider only whether the movie contains Matt Damon or Charlize Theron, and represent the data to be predicted as a binary vector of whether it stars those actors.

	The Big Short	Bridge of Spies	Brooklyn	Mad Max: Fury Road	The Martian	The Revenant	Room	Spotlight
stars Matt Damon?	0	0	0	0	1	0	0	0
stars Charlize Theron?	0	0	0	1	0	0	0	0

This is clearly a pretty incomplete representation with limited predictive (and explanatory) potential; what would a better featurization look like? Keep in mind that (especially for this task), features need not be scoped only over a single movie in isolation, but could be scoped over all movies in the dataset, or (perhaps more appropriately), all movies among those it's competing against. For example, rather than having a single feature that captures the box office revenue (would this be a good feature given our historical data?), we could have a feature that assesses the competitors for a given year and creates a feature signifying whether a given movie has the maximum box office among that set.

	Brooklyn	The Martian	The Revenant	Room
Max box office $$$ among competitors	0	1	0	0
Min box office $$$ among competitors	0	0	0	1
Max runtime among competitors	0	0	1	0
Min runtime among competitors	1	0	0	0

Other possibilities include critical reviews, reactions from professional critics and users on Twitter. Your task here is to design an optimal representation for this task.

I. (everyone)

a.) Consider the set of ideal features by which you'd represent the input data if our task is to optimize predictive accuracy while retaining some measure of explanatory power: if you had access to any information imaginable (excluding knowledge of the future), how would you represent the data points? Be thorough in your enumeration, and argue why, a priori, those features might help discriminate Oscar winners from losers.
b.) From that ideal set, select a subset of features that you can tangibly instantiate for the data points and describe where you'd find information to instantiate them. Resources to consider include: Wikipedia, IMDB, Twitter, Rotten tomatoes, plot descriptions or transcripts of the movies themselves, information about the actors, and many more. In this part, you do not need to actually extract data from these resources, but simply identify a real source (with URL) you might be able to pull it from. The goal of this section is to not simply be opportunistic with data as you think about future models, but to think first about what ideal data would look like, and how you can find (or create) it.

The deliverables for part I are a written document (2 pages, single-spaced).

II. (pick one)

a.) Implementation

From some subset of the features you identified in part I.b. above, create a representation for the data being predicted. In this part, you will submit 6 files, one for each of the categories being predicted. Each file should contain features for the training data (1960-2014) and the predictions you will make (2015). You may include features that range only over specific dates (such as Golden Globe nominees from 1981-present).

We'll train a model on your supplied featurizations and use that model to make predictions for each category. In this task, your only degree of freedom is in the representation of the data; the model itself will be fixed (and everyone will use that model). Since we've only covered the details of the perceptron so far, we'll use a very similar model (binary logistic regression) so you understand its assumptions; we'll train our model to learn the binary classification of {winner, not winner} for each data point in the training set (with cross-validated L2 regularization optimized to minimize overfitting); to predict the winner for 2015, we'll use the model trained on your data to predict the probability of being a winner for your representation of the nominees. The nominee with the highest probability among its competitors will be selected as the winner.

Data for this challenge can be found in the GitHub repository. This includes a description of the format of the file you will turn in, a list of the nominees and winners for all of the years we'll be considering (1960-present), supplementary data (such as all of the Wikipedia pages for all the nominees), and the code for the training/predicting script we'll be running on your data. Feel free to use that script to optimize the features you turn in.

The deliverables for II.a are a total of 6 feature files (in the tab-separated format illustrated in the Github examples), one for each category being predicted (Best Picture, Best Director, Best Actor/Actress, and Best Supporting Actor/Actress), and a brief description of the feature classes you've implemented (as a README).

b.) Critical assessment

The feature representations and predictions we're making in the first part of this homework are, in fact, conditional predictions: conditioned on being nominated, who is the winner? This implicit conditioning may lead us to forget that a very important process has already taken place---that of selecting who the nominees are in the first place. [This has been subject to public critique during this year's Academy Awards, where no minority actors were nominated.]

In this section, consider two larger questions. First, how would you design a model of the Academy's current, human nomination process using the concepts we've been studying? What are the ways in which such a (human) process could result in systemic biases like the underpresentations of minorities? Second, now think toward an algorithmic approach to predicting nominees from among a pool of candidate actors. What are the ways in which a similar underpresentation can occur? What are the risks of training a predictive model of nomination in a supervised fashion, using historical data? For both questions, to what degree can your choice of how you represent the data you're making predictions about influence these processes?

The deliverable for II.b is a written document (3 pages, single-spaced).

Due date

This homework is due on February 17 at 11:59pm. All files (.pdfs for written documents and .tsv files for feature representations) should be submitted through bCourses. (Late homework won't be accepted!)

Integrity

Feel free to use feature sets others have proposed for Oscar prediction (e.g., 538); be sure to cite ideas you borrow. You should also feel free to brainstorm ideas with others in class (and outside); again be sure to give credit where it's due. All writing must be your own.