I 256: Applied Natural Language Processing

   Fall 2006, Prof. Marti Hearst

Course Information

Assignment 1

For Wed, Sep 13:

Tokenizer assignment, due before class on Sept 13.

Sample solutions.

You are encouraged to work in pairs on this assignment.

Assignment: Write a good word tokenizer and sentence boundary recognizer in Python. Design it to work well on the WSJ collection. First decide how you want to define what word tokens are (e.g., should you be combining multi-word proper nouns or not?) and then write the code. You may want to refine your definitions as you run your code and see what things you are getting wrong.

The WSJ collection is actually the text from the nltk_lite.corpora treebank corpus. Since that text is already tokenized, tagged, and parsed, it's not in a good format for this assignment. I've written some code to convert it into an untokenized format and placed the text below (I've also included a zip file so you can access it without the browser adding extraneous characters: In my conversion I used only the "straight-quote" mark for both single and double quotations. Also, this file doesn't seem to contain any parantheses for some reason. It does have square brackets for text inserted by the news paper.

Feel free to look at the parsed text in the treebank corpus for comparing your results. Feel free to use the abbreviation list below and any other wordlist or related resource you like.

To turn in: Write up a description of your tokenizer, explaining how the regular expressions and other aspects work. Illustrate the kinds of tokens you get right and what you get wrong, for words and for sentence boundaries.

Resources:


For Aug 30:

Activity: Download and install the software needed Python version 2.4.3, the IDLE programming environment, and the NLTK-Lite toolkit onto your laptop and whatever other machines you'll be using. (I've only tested the Windows version.)