Course Information
|
Assignment 3
You are encouraged to work in pairs on this assignment, but only if
you work together. If you work with someone else, please
include a description of who did what, and what you did together, in
the writeup.
Part A: Community-based Summarizer
Feature code and documentation.
The feature recognizers are due in class on Wed, Oct 11; they should be uploaded to
the wiki (shown below). The classifier and the
evaluator are due before class on Monday, Oct 16. This code
should also be uploaded to the wiki. Only the people doing the
evaluation will have access to the test data.
We'll put it all together in class on Mon, Oct 16.
Together we're going to write the components for a text summarizer,
and put the components together in class. Here are pieces that need
to get done:
- Feature recognizers, each of which read in the input file and produce an
output file (many students can work on this part).
- Naive Bayes classifier that takes the feature files and learns
the probabilities, and also reports scores for the sentences in the
test set.
- Code that evalutes the results of the classifier on the test
collection.
Some students will write feature recognizers. These students will read
the input file and produce as output another file of the format (with
tab separators):
DocNumber SentNumber FeatureCount
The feature count is a score or weight indicating how often the
feature occurs in the sentence, if at all.
The Bayesian reasoner will have to compute the probabilities for each
feature for the training set, and then use the output of this as input
for the test collection.
I have downloaded some DUC data and wrote a bunch of code to do a
bunch of massaging of it to put it into a easy form for us to use. We
are only looking at extracts taken verbatim from the documents (which
is why there isn't much data; most of the data has hand-generated abstracts).
There are 100 training docs and 43 testing docs, but most of you will
only see the training data.
It contains files in two formats.
In the ``annotation'' format, the original text markup is retained and
sentence boundaries for candidate extraction sentences are indicated
via annotation in position within the text. Annotations include
sentence numbers (beginning at 1) and whether or not the sentence is
included in the gold-standard extract. For example,
President Bush on Monday nominated
Clarence Thomas, a conservative Republican with a controversial record
on civil rights, to replace retiring Justice Thurgood Marshall on the
Supreme Court.
In the ``segmented'' format, the original markup is removed and only
those sentences that are candidates for inclusion are presented. Each
line of the file is tab-separated, of the format:
sentence_number is_in_extract? sentence_words
The sentence words are not tokenized and may contain tabs.
I have set up a wiki so people can sign up for different tasks. We'll
post the code there when it is due. I called the wiki
anlp06_a3. The url is
http://123.writeboard.com/4377b62f9ab431267/login and the
password is the name of the email list for this course.
I have signed agreements stating that we will keep the data confidential
Thus data is NOT TO BE REDISTRIBUTED. After the assignment is done, I
want everyone to DELETE THE DATA from their filesystems. Anyone who
cannot or does not agree to these conditions can do an alternative assignment.
In class on Monday Oct 16th we'll run the code and see how it works.
Part B: Practice with Shallow Parser
Turn in
second part of assignment here
This part is due Monday, Oct 16.
In the previous assignment, we looked at Amazon.com's concordanc
feature. They have some other text statistics as well, including
Statistically Improbable Phrases (SIPs) and (Capitalized Phrases
(CAPs), and various readability and complexity scores, one of which is
the Gunning-Fog index (see the definition of the index at
Wikipedia).
See the Amazon statistics run on de Tocqueville's
Democracy in America.
Your job is to write code to compute the Gunning-Fox index, and test it on
stretches of text from several different kinds of articles, e.g., a
Wikipedia article, an academic research paper, and a blog post.
Be sure to tell me about the results in your writeup.
Use a stemmer, preferably the one I showed in class using
WordNet, to normalize the inflectional morphological variants.
Use a tagger and chunker to find and exclude the proper
nouns. There are heuristics for how to count syllables in English; the
standard one is that every new vowel that is not adjacent to another
vowel is a new syllable. However, there are exceptions, such as that
most of the time, a final "e" is not a new syllable, unless preceded
by an "l". Your syllable recognizer doesn't have to be perfect, but
it should work decently well.
Use a chunker to separate the text into noun phrases and verb
phrases (as defined for shallow parsers). Does the length of the noun
phrases correlate at all with the Gunning index in the documents that
you tested?
|