I 256: Applied Natural Language Processing

   Fall 2006, Prof. Marti Hearst

Course Information

Assignment 4: Text Classification

Due Mon November 13 before class. I'd like you to work individually on this assignment. Sample solutions.

In this assignment we will experiment with different features and learning algorithms for classifying newsgroup articles into different topical categories.

We'll be using the Weka toolkit for running experiments. Weka is a powerful tool with a huge number of features and capabilities. These include tools to help you analyze which items were incorrectly labeled, and which features seem particularly important. Download Weka-3.4 from http://www.cs.waikato.ac.nz/ml/weka/

We'll use our python tools for processing the data files to select features of interest, writing these features out to files that we'll use as input to Weka. I am supplying you with some code (weka.py) that can get you started.

Evaluation Procedure

We want to realistically simulate the testing process. Therefore, we'll do training and testing on two different groups of newgroups. You are to first do training only on the training set. Experiment as much as you want with the classifiers and feature selectors.

Before running it on the test set you must freeze your code and decide which Weka parameters you're doing to use in advance. Right these down and take screen shots of the output on the training set (using cross validation or dividing the training set into training and development sets).

Once everything is set, run your best model and parameters on the test set. Do this only once and take screenshots of the results of testing.

Datasets

We'll be using data from the 20 newsgroups collection. To save space, I've organized just a subset of the directories and labeled them a bit more cleanly. I've put them in separate training and testing directories.
  1. Diverse set. First, everyone should train and test comparing these two newsgroups, which are intended to be quite different from one another and hence easier to get good scores on:
    • rec.motorcycles, sci.space
  2. Homogeneous set. Second, choose from one of the following two sets of newsgroups:
    • rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey
    • sci.crypt, sci.electronics, sci.med.original, sci.space

I've put the data and weka.py in a zip file (8.4M compressed).
It's also available in a gzip'd tar file (5.8M compressed).

Getting Started

It's always a good idea to take a look at the data before you get started, so look at some of the plain text of the different groups to get a feeling for how long they are, what format they take, what kind of language they use, and so on.

The code we are giving you creates features that consist of all the words in the text minus the stopwords, which you can (optionally) remove. These features are weighted by their TF values (note: this erroneously said DF earlier). You should consider these features and their weights as a baseline to compare different features/weighting approaches against. To compare the two approaches, you'd run the same learning algorithms using different feature sets in order to compare them. You can also use different learning algorithms for the different feature sets, as some learning algorithms may do better with reduced features or weighted features than others. You can also experiment with the feature selection methods in Weka rather than doing it yourself in the code.

You may want to experiment with changing the parameters of some of the Weka learning programs on the training set. You can also use their feature analysis programs to see which features are doing well. This may give you ideas about which features will result in better results.

Ideas for Feature Weighting

  • Give more weight to features from the subject line.
  • Use tf.idf weighting on words.
  • Some of the articles make use of technical terminology, which may include noun compounds, numbers (or alpha-numeric terms) and/or abbreviations. It may be a good idea to give these features more weight.

To Turn In

This assignment is due Nov 13.

You must try at least 2 different additional types of features and at least 2 different classifiers in your experiments. You must experiment on the diverse and on one homogenous document set. Tell me which homogenous set you used -- rec or sci.

Turn in a description of which features/feature weighting/classifiers you tried and the accuracy scores for how well the best ones worked. Contrast how things worked with the diverse set vs. the homogenous set of newsgroups. Describe the results of your experiments:
    Which features helped/hurt -- why?
    Did you use weka for feature selection? Did it help?
    Which classifiers helped/hurt -- which setting variations did you try?
    Did you try binary or multi-way classification?
    How were the results for the two different collections different or similar?
    How do you think the results could be further improved?
Turn in screenshot(s) of the window(s) that shows the results of running your best classifier(s) on the training set just before freezing your approach. Then run your settings on the test set and take screenshots of that as well. (You can do this from either the explorer or the experimenter.) Turn in screenshots for training and testing on both the diverse set and the homogenous set of your choosing. Include these screenshots in your narrative writeup.

Turn in assignment here