SIMS 290-2: Applied Natural Language Processing

[an error occurred while processing this directive]

SIMS 290-2: ANLP Assignments

Assignment 3: Text Classification

In this assignment we will experiment with different features and learning algorithms for classifying newsgroup articles into different topical categories.

We'll be using the Weka toolkit for running experiments. We'll use our python tools for processing the data files to select features of interest, writing these features out to files that we'll use as input to Weka. We are supplying you with some code (weka.py) that can get you started. We have also just added some code (sentence_split.py) which will split raw strings into sentences.

We will classify a subset of the twenty_newsgroups corpus. (We use the last one listed, which comes with NLTK.) Each newsgroup contains nearly 1000 documents. You may train and test on the first 780 of these (you don't have to use all 780 of them) but DO NOT train or test on the last 200 documents. We will be using these last 200 to compare how people's algorithms did after the assignments are turned in. So you should act like you don't have the last 200 documents available even to look at; just leave them alone. You can, however, use any subset of the first 780 for training and testing.

Weka is a powerful tool with a huge number of features and capabilities. These include tools to help you analyze which items were incorrectly labeled, and which features seem particularly important.

You may want to experiment with changing the parameters of some of the learning programs. You can also use their feature analysis programs to see which features are doing well. This may give you ideas about which features will result in better results.

Datasets

We'll do training and testing on two different groups of newgroups.

Diverse set. First, everyone should train and test comparing these two newsgroups, which are intended to be quite different from one another and hence easier to get good scores on:
- rec.motorcycles, sci.space
Homogeneous set. Second, choose from one of the following two sets of newsgroups:
- rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey
- sci.crypt, sci.electronics, sci.med.original, sci.space

Getting Started

First, be sure you have twenty_newsgroups (sometimes called 20_newsgroups) and stopwords installed in Python23/nltk (get them from the nltk data zip file). Second, if you're running on windows on your own machine, be sure to modify the init file as described below to get twenty_newsgroups working (we've made this fix on the lab machines).

It's always a good idea to take a look at the data before you get started, so look at some of the plain text of the different groups to get a feeling for how long they are, what format they take, what kind of language they use, and so on.

The code we are giving you creates features that consist of all the words in the text minus the stopwords, which you can (optionally) remove. These features are weighted by their DF values. You should consider these features and their weights as a baseline to compare different features/weighting approaches against. For example, you can see if using stemming improves on this set of features or not. To compare the two approaches, you'd run the same learning algorithms using the different feature sets in order to compare them. You can also use different learning algorithms for the different feature sets, as some learning algorithms may do better with reduced features or weighted features than others.

After you've experimented with this feature set, you should then try other kinds of features and weighting strategies; some ideas are suggested below.

Ideas for Features

You should try a subset of these ideas, or introduce your own.

Tag the words with POS and experiment with excluding certain parts of speech, or using phrase types that may be of interest, such as noun-noun compounds or sequences of proper nouns. (As of this writing, the corpus is not divided into sentences; we may provide a sentence tokenizer for you to use.)
Use the porter stemmer or the morphological analyzer (morphy) from WordNet. Of particular interest may be to convert plural nouns to singular, and verbs to their root form.
Some messages contain quoted text, which is repeated text from other articles posted to the list. It may be useful to use this.
Similarly, the same person may send multiple messages and that could potentially help in some cases.

Ideas for Feature Weighting

Give more weight to features from the subject line.
Use tf.idf weighting on words.
Give all features the same weight
Some of the articles make use of technical terminology, which may include noun compounds, numbers (or alpha-numeric terms) and/or abbreviations. It may be a good idea to give these features more weight.

To Turn In

This assignment is due at 10:30am on Monday Oct 18 (note the extension).
Turn in using this link.

You must try at least 2 different additional types of features and at least 3 different classifiers in your experiments. You must experiment on the diverse and on one homogenous document set. Tell us which homogenous set you used -- rec or sci.

Turn in a description of which features/feature weighting/classifiers you tried and the accuracy scores for how well the best ones worked. Contrast how things worked with the diverse set vs. the homogenous set of newsgroups. Describe the results of your experiments:

Turn in your best models and the corresponding arff files for the test collection, as well as the window that shows the results of running your best classifier on the test set. (You can do this from either the explorer or the experimenter.)

To save models and results buffers, right click on the model name in the explorer view. An illustrating screenshot

You can turn in 2 models for each document set, but you have to pick them before running on the test data.

You'll need to put everything into a tar or zip file. Hopefully you will have reduced your feature space some so your arff and model files are not enormous.

Please use the following filename structure for your data files, to make our jobs easier. The "_1" and _2" are for your first and second model in each case (you can just submit one if you like). "Output" refers to the output buffer that show the results of running the model on the test data.

Your Writeup

yourlastname_writeup.{txt,doc,pdf,html,whatever}

Set1 (the "diverse" groups)

Set2 (the four "homogenous groups")

Information about Weka 3-4

Download Weka-3.4 from http://www.cs.waikato.ac.nz/ml/weka/

Unfortunately, the documentation is somewhat sparse and marred by the fact that different documentation talks about different versions of Weka which are not compatible. However, we are providing a resource in the form of Wednesday's lecture notes. Some helpful additional information can be found below. We think you'll mainly want to use the Explorer interface for experimenting with features and learning algorithms, and the Experimenter interface for comparing algorithms.

Explorer Guide (pdf)
Lecture notes by Eibe Frank on how to use the GUIs
Experiment GUI README file
Knowledge Flow GUI README file

Technical Notes

Under Windows, In order to make sure Weka has enough memory for some of the more intensive operations, you need to modify the RunWeka.bat file to read something like:
If this is too much memory for your machine, you can make the Xmx number smaller. You can give it more memory if you machine can handle it; java -Xmx1000m -jar weka.jar works well if you have a lot of memory.
In order to make twenty_newsgroups work under Windows, you need to make some changes in the file "__init__.py" in
C:\Python23\Lib\site-packages\nltk\corpus
You have to change "/" to "\\" as shown below.
<<< ORIGINAL >>>
Newsgroups groups = [(ng, ng+'/.*') for ng in ''' alt.atheism rec.autos sci.space comp.graphics rec.motorcycles soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast comp.sys.mac.hardware sci.crypt talk.politics.misc comp.windows.x sci.electronics talk.religion.misc misc.forsale sci.med'''.split()] twenty_newsgroups = SimpleCorpusReader( '20_newsgroups', '20_newsgroups/', '.*/.*', groups, description_file='../20_newsgroups.readme') del groups
<<< THE FIX >>>
groups = [(ng, ng+'\\.*') for ng in ''' alt.atheism rec.autos sci.space comp.graphics rec.motorcycles soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast comp.sys.mac.hardware sci.crypt talk.politics.misc comp.windows.x sci.electronics talk.religion.misc misc.forsale sci.med'''.split()] twenty_newsgroups = SimpleCorpusReader( '20_newsgroups', '20_newsgroups\\', '.*\\.*', groups, description_file='..\\20_newsgroups.readme')