Assignment 4: Enron Email Corpus
Part I Due Monday Nov 8th
This assignment has two parts. In the first part everyone will help create a
category scheme for the collection, and then will annotate a small set of emails
with the annotations. This will be done in pairs, so that each message is
annotated by at least two people. We'll do much of this work in class; all
annotations must be done in a week, by Monday, Nov 8.
The Enron Email Search Interface.
The annotated dataset is now available:
This compressed TAR file can also be processed with WinZip.
Inside are directories 1, 2, 3, 4, 5, 6, 7, 8, corresponding to the
coarse genres (top-level category 1). The messages in each directory
were assigned to the corresponding coarse genre. For each message,
identified by numeric ID which matches the "database ID" from the Web
interface, you will find a ".txt" file, which contains the raw text of
the message, including original headers, and a ".cats" file, which
contains a line like "n1,n2,n3" for each category that message is
assigned to. n1 is the top-level category number; n2 is the
second-level category number; and n3 is the number of times this
message was assigned to this category. This file format is also
described briefly in the file categories.txt, which lists the
categories for you.
All the files have unique names, so you can move them all into one
directory if you would prefer to work with them that way instead of in
Part II Due Friday Nov 19th
In the second part of the assignment, you choose some kind of computational
analysis to do using the Enron email collection as the corpus. A set of suggestions
is listed below. This is not meant to be a huge project, which is why it is due
in 2.5 weeks. If you become intrigued by your
efforts on this assignment you can continue working on it as your class project.
You may work on this in pairs if you like.
For several of these ideas you'll need to download a subset of documents.
We've provided a mechanism for doing this: you can save the results of a search
into a zipped file.
Andrew Fiore has posted code for parsing individual messages: enronEmail.py
To use it, specify a filename as a string with the full path of the message you
want to read and then call the parsing code:
>>> (headers, body) = enronEmail.parse_email(filepath)
After the call, headers will be a key-value dictionary with the email
header names ('Subject', 'From', etc.) as keys and the contents of
those headers as values. body will be a string containing the body of
the message with linebreaks still there.
To read many messages, you'll have to loop over a list of files on
disk. Try the os and os.path built-in python modules. You can use
os.chdir() to change directories, os.listdir() to list the contents,
and os.path.walk() (or os.walk() in python 2.3) to go through a
hierarchy of directories and files.
- Text classification. Use the emails that we have labeled as a training/testing set
for classification. Experiment with different feature sets and learning
algorithms, keeping in mind what we learned from Assignment 3. You may want to
distinguish on only a subset of the category types, such as document genre, to
make the task more feasible.
- Acronym Dictionary Creation. Automatically create an acronym dictionary for the collection.
By this I mean both identify acronyms and their definition, e.g.:
RSP: rate stabilization plan, FERC: Federal Energy Regulatory Commission.
You may be able to make use of the
Schwartz & Hearst
- Named Entity Dictionary Creation. Automatically create a list of names of people,
organizations, document types, and/or projects. There are quite a variety of
these in the collection, e.g.:
This is a bit harder of task than the others; you can use a combination of
regular expression pattern matching and machine learning techniques. Toolkits
include GATE, Mallet and
Minorthird, but I think these are hard to learn and you might want
to augment the tools we have already worked with instead.
- Government actors: California Senator Diane Feinstein, California Attorney General Bill Lockyer
- Document types: Advice Letter 2057-E, Rate Schedules S, E-19, E-20 and E-25
- Government agencies: CAISO, CPUC
- Cluster the Collection. Use a clustering algorithm, such as the one
supplied with Weka, to try to organize the documents into meaningful groups.
The goal is to help us understand something about the contents of the
collection, giving us a useful, big-picture view.
Clustering can be compute-and-memory intensive, so only try this if you can run
it on an appropriately equipped machine and you'll need to do a lot of feature
reduction to make the clustering work. It is probably best to cluster the
results of a search rather than the entire collection, both because of
computation limitations and for better coherence of the results. You'll also
need to experiment with different numbers (and hence sizes) of clusters. If
available, you may want to use a clustering algorithm that allows "soft"
clusters, meaning a document can be assigned to more than one cluster.
In your assessment of the cluster be sure to do a good job
of characterizing the contents of the clusters and whether or not they appear
meaningful. Weka's visualizationg facilities will help with understanding the clusters.
Here is a
helpful guide to Weka clustering.
- Social network analysis. Do some kind of social network analysis of
the connections between people, organizations and/or concepts in the collection.
- Your idea. If you have some other idea, please run it past me first.
Keep in mind that you'll be able to choose your own project and collection for
the final project.