I 256: Applied Natural Language Processing

   Fall 2006, Prof. Marti Hearst

Course Information

Assignment 3

You are encouraged to work in pairs on this assignment, but only if you work together. If you work with someone else, please include a description of who did what, and what you did together, in the writeup.

Part A: Community-based Summarizer

Feature code and documentation.

The feature recognizers are due in class on Wed, Oct 11; they should be uploaded to the wiki (shown below). The classifier and the evaluator are due before class on Monday, Oct 16. This code should also be uploaded to the wiki. Only the people doing the evaluation will have access to the test data.

We'll put it all together in class on Mon, Oct 16.

Together we're going to write the components for a text summarizer, and put the components together in class. Here are pieces that need to get done:
  1. Feature recognizers, each of which read in the input file and produce an output file (many students can work on this part).
  2. Naive Bayes classifier that takes the feature files and learns the probabilities, and also reports scores for the sentences in the test set.
  3. Code that evalutes the results of the classifier on the test collection.

Some students will write feature recognizers. These students will read the input file and produce as output another file of the format (with tab separators):
    DocNumber SentNumber FeatureCount
The feature count is a score or weight indicating how often the feature occurs in the sentence, if at all.

The Bayesian reasoner will have to compute the probabilities for each feature for the training set, and then use the output of this as input for the test collection.

I have downloaded some DUC data and wrote a bunch of code to do a bunch of massaging of it to put it into a easy form for us to use. We are only looking at extracts taken verbatim from the documents (which is why there isn't much data; most of the data has hand-generated abstracts).

There are 100 training docs and 43 testing docs, but most of you will only see the training data. It contains files in two formats.

In the ``annotation'' format, the original text markup is retained and sentence boundaries for candidate extraction sentences are indicated via annotation in position within the text. Annotations include sentence numbers (beginning at 1) and whether or not the sentence is included in the gold-standard extract. For example,
    President Bush on Monday nominated Clarence Thomas, a conservative Republican with a controversial record on civil rights, to replace retiring Justice Thurgood Marshall on the Supreme Court.
In the ``segmented'' format, the original markup is removed and only those sentences that are candidates for inclusion are presented. Each line of the file is tab-separated, of the format:

    sentence_number is_in_extract? sentence_words
The sentence words are not tokenized and may contain tabs.

I have set up a wiki so people can sign up for different tasks. We'll post the code there when it is due. I called the wiki anlp06_a3. The url is http://123.writeboard.com/4377b62f9ab431267/login and the password is the name of the email list for this course.

I have signed agreements stating that we will keep the data confidential Thus data is NOT TO BE REDISTRIBUTED. After the assignment is done, I want everyone to DELETE THE DATA from their filesystems. Anyone who cannot or does not agree to these conditions can do an alternative assignment.

In class on Monday Oct 16th we'll run the code and see how it works.


Part B: Practice with Shallow Parser

Turn in second part of assignment here

This part is due Monday, Oct 16.

In the previous assignment, we looked at Amazon.com's concordanc feature. They have some other text statistics as well, including Statistically Improbable Phrases (SIPs) and (Capitalized Phrases (CAPs), and various readability and complexity scores, one of which is the Gunning-Fog index (see the definition of the index at Wikipedia).   See the Amazon statistics run on de Tocqueville's Democracy in America.

Your job is to write code to compute the Gunning-Fox index, and test it on stretches of text from several different kinds of articles, e.g., a Wikipedia article, an academic research paper, and a blog post. Be sure to tell me about the results in your writeup.

Use a stemmer, preferably the one I showed in class using WordNet, to normalize the inflectional morphological variants. Use a tagger and chunker to find and exclude the proper nouns. There are heuristics for how to count syllables in English; the standard one is that every new vowel that is not adjacent to another vowel is a new syllable. However, there are exceptions, such as that most of the time, a final "e" is not a new syllable, unless preceded by an "l". Your syllable recognizer doesn't have to be perfect, but it should work decently well.

Use a chunker to separate the text into noun phrases and verb phrases (as defined for shallow parsers). Does the length of the noun phrases correlate at all with the Gunning index in the documents that you tested?