I 256: Applied Natural Language Processing

   Fall 2006, Prof. Marti Hearst

Course Information

Assignment 2

Sample solutions.

For Wed, Sep 27:

Word stats and Tagging, due before class on Sept 27.

You are encouraged to work in pairs on this assignment, but only if you work together. If you work with someone else, please include a description of who did what, and what you did together, in the writeup.

Part A

Amazon.com has full text content for some books, and for some of these they provide what they call a concordance which they show in the form of a tag cloud. For example, for the book The Social Life of Information, here is the concordance tag cloud. Note that really common words are omitted.

A real concordance lists the words in a document along with their immediate context. The word itself is centered and boldfaced with a whitespace "gutter" on either the left or the right, along with k words of surrounding context (it's ok for the context to cross sentence boundaries).

Your job is to find a book whose text is available freely online, and make (1) an Amazon-style concordance, and (2) a real concordance for a word given as input. Feel free to put the results of (1) into a tag cloud visualization (you can use one that is available on the web, or output the results in a standard paragraph style). Be sure to output (2) in the "gutter" format (hint: look at the code we saw in class for outputting the modal statistics).

Hint: you might want to use a stoplist. Here is a sample stoplist.

Here is a page pointing to a large collection of online books.

Be sure to describe your code.

Part B

(1) Write code to compute the following for the treebank corpus's part-of-speech tags:
  • (a) Which word has the greatest number of distinct tags?
  • (b) How many words are ambiguous, meaning they appear with at least two tags?
  • (c) What percentage of word occurrences involve these ambiguous words?
  • (d) Which nouns are more common in their plural form than their singular form (only consider regular plurals, formed with the s-suffix)?
(2) In class we computed the likelihood of one word to follow another. Choose a character from one of the gutenberg novels and write a program that analyzes the words that follows his or her name. Make use of part of speech tags and conditional frequency distributions. Summarize your findings.

Part C

We saw in class that when we used a bigram tagger that backed off to a unigram tagger which in turn backed off to a default tagger, the bigram tagger did worse than the unigram tagger.

Try to figure out why this is, and try to improve the bigram tagger's accuracy. (Hint: think about the parameters, and think about how the training set is used as well.) Use the error analysis module to help you see what is and isn't working.

After you improve the bigram tagger, see if this make the trigram tagger work accurately when it backs off to the bigram tagger. If it doesn't work better now, see if you can make it work better.