Course Information
|
Assignment 1
For Wed, Sep 13:
Tokenizer assignment, due before class on Sept 13.
Sample solutions.
You are encouraged to work in pairs on this assignment.
Assignment: Write a good word tokenizer and sentence boundary recognizer in
Python. Design it to work well on the WSJ collection. First decide
how you want to define what word tokens are (e.g., should you be
combining multi-word proper nouns or not?) and then write the code.
You may want to refine your definitions as you run your code and see
what things you are getting wrong.
The WSJ collection is actually the text from the
nltk_lite.corpora treebank corpus. Since that text is already
tokenized, tagged, and parsed, it's not in a good format for this
assignment. I've written some code to convert it into an untokenized
format and placed the text below (I've also included a zip file so you
can access it without the browser adding extraneous characters:
In my conversion I used only the "straight-quote" mark for both single
and double quotations. Also, this file doesn't seem to contain any
parantheses for some reason. It does have square brackets for text
inserted by the news paper.
Feel free to look at the parsed text in the treebank corpus for
comparing your results. Feel free to use the abbreviation list below
and any other wordlist or related resource you like.
To turn in:
Write up a description of your tokenizer, explaining how the regular
expressions and other aspects work. Illustrate the kinds of tokens
you get right and what you get wrong, for words and for sentence
boundaries.
Resources:
For Aug 30:
Activity: Download and install the software needed
Python version 2.4.3, the IDLE programming environment, and
the NLTK-Lite toolkit onto your laptop and whatever other machines
you'll be using. (I've only tested the Windows version.)
|