I 256: Applied Natural Language Processing |
Fall 2006, Prof. Marti Hearst |
Assignment 4: Text ClassificationDue Mon November 13 before class. I'd like you to work individually on this assignment. Sample solutions. In this assignment we will experiment with different features and learning algorithms for classifying newsgroup articles into different topical categories. We'll be using the Weka toolkit for running experiments. Weka is a powerful tool with a huge number of features and capabilities. These include tools to help you analyze which items were incorrectly labeled, and which features seem particularly important. Download Weka-3.4 from http://www.cs.waikato.ac.nz/ml/weka/ We'll use our python tools for processing the data files to select features of interest, writing these features out to files that we'll use as input to Weka. I am supplying you with some code (weka.py) that can get you started.Evaluation ProcedureWe want to realistically simulate the testing process. Therefore, we'll do training and testing on two different groups of newgroups. You are to first do training only on the training set. Experiment as much as you want with the classifiers and feature selectors. Before running it on the test set you must freeze your code and decide which Weka parameters you're doing to use in advance. Right these down and take screen shots of the output on the training set (using cross validation or dividing the training set into training and development sets). Once everything is set, run your best model and parameters on the test set. Do this only once and take screenshots of the results of testing.DatasetsWe'll be using data from the 20 newsgroups collection. To save space, I've organized just a subset of the directories and labeled them a bit more cleanly. I've put them in separate training and testing directories.
It's also available in a gzip'd tar file (5.8M compressed). Getting StartedIt's always a good idea to take a look at the data before you get started, so look at some of the plain text of the different groups to get a feeling for how long they are, what format they take, what kind of language they use, and so on. The code we are giving you creates features that consist of all the words in the text minus the stopwords, which you can (optionally) remove. These features are weighted by their TF values (note: this erroneously said DF earlier). You should consider these features and their weights as a baseline to compare different features/weighting approaches against. To compare the two approaches, you'd run the same learning algorithms using different feature sets in order to compare them. You can also use different learning algorithms for the different feature sets, as some learning algorithms may do better with reduced features or weighted features than others. You can also experiment with the feature selection methods in Weka rather than doing it yourself in the code. You may want to experiment with changing the parameters of some of the Weka learning programs on the training set. You can also use their feature analysis programs to see which features are doing well. This may give you ideas about which features will result in better results.Ideas for Feature Weighting
To Turn InThis assignment is due Nov 13. You must try at least 2 different additional types of features and at least 2 different classifiers in your experiments. You must experiment on the diverse and on one homogenous document set. Tell me which homogenous set you used -- rec or sci. Turn in a description of which features/feature weighting/classifiers you tried and the accuracy scores for how well the best ones worked. Contrast how things worked with the diverse set vs. the homogenous set of newsgroups. Describe the results of your experiments:
Did you use weka for feature selection? Did it help? Which classifiers helped/hurt -- which setting variations did you try? Did you try binary or multi-way classification? How were the results for the two different collections different or similar? How do you think the results could be further improved? |