[an error occurred while processing this directive]

SIMS 290-2: ANLP Assignments

Assignment 4: Enron Email Corpus

Part I   Due Monday Nov 8th

    This assignment has two parts. In the first part everyone will help create a category scheme for the collection, and then will annotate a small set of emails with the annotations. This will be done in pairs, so that each message is annotated by at least two people. We'll do much of this work in class; all annotations must be done in a week, by Monday, Nov 8.

    The Enron Email Search Interface.

    The annotated dataset is now available: enron_with_categories.tar.gz
    This compressed TAR file can also be processed with WinZip.

    Inside are directories 1, 2, 3, 4, 5, 6, 7, 8, corresponding to the coarse genres (top-level category 1). The messages in each directory were assigned to the corresponding coarse genre. For each message, identified by numeric ID which matches the "database ID" from the Web interface, you will find a ".txt" file, which contains the raw text of the message, including original headers, and a ".cats" file, which contains a line like "n1,n2,n3" for each category that message is assigned to. n1 is the top-level category number; n2 is the second-level category number; and n3 is the number of times this message was assigned to this category. This file format is also described briefly in the file categories.txt, which lists the categories for you. All the files have unique names, so you can move them all into one directory if you would prefer to work with them that way instead of in genre-separated directories.

Part II Due Friday Nov 19th

    In the second part of the assignment, you choose some kind of computational analysis to do using the Enron email collection as the corpus. A set of suggestions is listed below. This is not meant to be a huge project, which is why it is due in 2.5 weeks. If you become intrigued by your efforts on this assignment you can continue working on it as your class project. You may work on this in pairs if you like.

    For several of these ideas you'll need to download a subset of documents. We've provided a mechanism for doing this: you can save the results of a search into a zipped file. Andrew Fiore has posted code for parsing individual messages: enronEmail.py

    To use it, specify a filename as a string with the full path of the message you want to read and then call the parsing code:

    >>> (headers, body) = enronEmail.parse_email(filepath)

    After the call, headers will be a key-value dictionary with the email header names ('Subject', 'From', etc.) as keys and the contents of those headers as values. body will be a string containing the body of the message with linebreaks still there.

    To read many messages, you'll have to loop over a list of files on disk. Try the os and os.path built-in python modules. You can use os.chdir() to change directories, os.listdir() to list the contents, and os.path.walk() (or os.walk() in python 2.3) to go through a hierarchy of directories and files.

    Assignment Suggestions:

    1. Text classification. Use the emails that we have labeled as a training/testing set for classification. Experiment with different feature sets and learning algorithms, keeping in mind what we learned from Assignment 3. You may want to distinguish on only a subset of the category types, such as document genre, to make the task more feasible.

    2. Acronym Dictionary Creation. Automatically create an acronym dictionary for the collection. By this I mean both identify acronyms and their definition, e.g.: RSP: rate stabilization plan, FERC: Federal Energy Regulatory Commission. You may be able to make use of the Schwartz & Hearst algorithm and code directly.

    3. Named Entity Dictionary Creation. Automatically create a list of names of people, organizations, document types, and/or projects. There are quite a variety of these in the collection, e.g.:

      • Government actors: California Senator Diane Feinstein, California Attorney General Bill Lockyer
      • Document types: Advice Letter 2057-E, Rate Schedules S, E-19, E-20 and E-25
      • Government agencies: CAISO, CPUC

      This is a bit harder of task than the others; you can use a combination of regular expression pattern matching and machine learning techniques. Toolkits include GATE, Mallet and Minorthird, but I think these are hard to learn and you might want to augment the tools we have already worked with instead.

    4. Cluster the Collection. Use a clustering algorithm, such as the one supplied with Weka, to try to organize the documents into meaningful groups. The goal is to help us understand something about the contents of the collection, giving us a useful, big-picture view.

      Clustering can be compute-and-memory intensive, so only try this if you can run it on an appropriately equipped machine and you'll need to do a lot of feature reduction to make the clustering work. It is probably best to cluster the results of a search rather than the entire collection, both because of computation limitations and for better coherence of the results. You'll also need to experiment with different numbers (and hence sizes) of clusters. If available, you may want to use a clustering algorithm that allows "soft" clusters, meaning a document can be assigned to more than one cluster. In your assessment of the cluster be sure to do a good job of characterizing the contents of the clusters and whether or not they appear meaningful. Weka's visualizationg facilities will help with understanding the clusters.

      Here is a helpful guide to Weka clustering.

    5. Social network analysis. Do some kind of social network analysis of the connections between people, organizations and/or concepts in the collection.

    6. Your idea. If you have some other idea, please run it past me first. Keep in mind that you'll be able to choose your own project and collection for the final project.