SIMS 290-2: Applied Natural Language Processing

[an error occurred while processing this directive]

SIMS 290-2: ANLP Assignments

Assignment 4: Enron Email Corpus

Part I Due Monday Nov 8th

Monday, Nov 8

The Enron Email Search Interface.

The annotated dataset is now available

enron_with_categories.tar.gz

categories.txt

Part II Due Friday Nov 19th

enronEmail.py

Assignment Suggestions:

Text classification. Use the emails that we have labeled as a training/testing set for classification. Experiment with different feature sets and learning algorithms, keeping in mind what we learned from Assignment 3. You may want to distinguish on only a subset of the category types, such as document genre, to make the task more feasible.
Acronym Dictionary Creation. Automatically create an acronym dictionary for the collection. By this I mean both identify acronyms and their definition, e.g.: RSP: rate stabilization plan, FERC: Federal Energy Regulatory Commission. You may be able to make use of the Schwartz & Hearst algorithm and code directly.
Named Entity Dictionary Creation. Automatically create a list of names of people, organizations, document types, and/or projects. There are quite a variety of these in the collection, e.g.:
- Government actors: California Senator Diane Feinstein, California Attorney General Bill Lockyer
- Document types: Advice Letter 2057-E, Rate Schedules S, E-19, E-20 and E-25
- Government agencies: CAISO, CPUC
This is a bit harder of task than the others; you can use a combination of regular expression pattern matching and machine learning techniques. Toolkits include GATE, Mallet and Minorthird, but I think these are hard to learn and you might want to augment the tools we have already worked with instead.
Cluster the Collection. Use a clustering algorithm, such as the one supplied with Weka, to try to organize the documents into meaningful groups. The goal is to help us understand something about the contents of the collection, giving us a useful, big-picture view.
Clustering can be compute-and-memory intensive, so only try this if you can run it on an appropriately equipped machine and you'll need to do a lot of feature reduction to make the clustering work. It is probably best to cluster the results of a search rather than the entire collection, both because of computation limitations and for better coherence of the results. You'll also need to experiment with different numbers (and hence sizes) of clusters. If available, you may want to use a clustering algorithm that allows "soft" clusters, meaning a document can be assigned to more than one cluster. In your assessment of the cluster be sure to do a good job of characterizing the contents of the clusters and whether or not they appear meaningful. Weka's visualizationg facilities will help with understanding the clusters.
Here is a helpful guide to Weka clustering.
Social network analysis. Do some kind of social network analysis of the connections between people, organizations and/or concepts in the collection.
Your idea. If you have some other idea, please run it past me first. Keep in mind that you'll be able to choose your own project and collection for the final project.