|
SIMS 290-2: ANLP AssignmentsAssignment 3: Text ClassificationIn this assignment we will experiment with different features and learning algorithms for classifying newsgroup articles into different topical categories. We'll be using the Weka toolkit for running experiments. We'll use our python tools for processing the data files to select features of interest, writing these features out to files that we'll use as input to Weka. We are supplying you with some code (weka.py) that can get you started. We have also just added some code (sentence_split.py) which will split raw strings into sentences. We will classify a subset of the twenty_newsgroups corpus. (We use the last one listed, which comes with NLTK.) Each newsgroup contains nearly 1000 documents. You may train and test on the first 780 of these (you don't have to use all 780 of them) but DO NOT train or test on the last 200 documents. We will be using these last 200 to compare how people's algorithms did after the assignments are turned in. So you should act like you don't have the last 200 documents available even to look at; just leave them alone. You can, however, use any subset of the first 780 for training and testing. Weka is a powerful tool with a huge number of features and capabilities. These include tools to help you analyze which items were incorrectly labeled, and which features seem particularly important. You may want to experiment with changing the parameters of some of the learning programs. You can also use their feature analysis programs to see which features are doing well. This may give you ideas about which features will result in better results.DatasetsWe'll do training and testing on two different groups of newgroups.
Getting StartedFirst, be sure you have twenty_newsgroups (sometimes called 20_newsgroups) and stopwords installed in Python23/nltk (get them from the nltk data zip file). Second, if you're running on windows on your own machine, be sure to modify the init file as described below to get twenty_newsgroups working (we've made this fix on the lab machines). It's always a good idea to take a look at the data before you get started, so look at some of the plain text of the different groups to get a feeling for how long they are, what format they take, what kind of language they use, and so on. The code we are giving you creates features that consist of all the words in the text minus the stopwords, which you can (optionally) remove. These features are weighted by their DF values. You should consider these features and their weights as a baseline to compare different features/weighting approaches against. For example, you can see if using stemming improves on this set of features or not. To compare the two approaches, you'd run the same learning algorithms using the different feature sets in order to compare them. You can also use different learning algorithms for the different feature sets, as some learning algorithms may do better with reduced features or weighted features than others. After you've experimented with this feature set, you should then try other kinds of features and weighting strategies; some ideas are suggested below.Ideas for FeaturesYou should try a subset of these ideas, or introduce your own.
Ideas for Feature Weighting
To Turn InThis assignment is due at 10:30am on Monday Oct 18 (note the extension).Turn in using this link. You must try at least 2 different additional types of features and at least 3 different classifiers in your experiments. You must experiment on the diverse and on one homogenous document set. Tell us which homogenous set you used -- rec or sci. Turn in a description of which features/feature weighting/classifiers you tried and the accuracy scores for how well the best ones worked. Contrast how things worked with the diverse set vs. the homogenous set of newsgroups. Describe the results of your experiments:
Did you use weka for feature selection? Did it help? Which classifiers helped/hurt -- what setting variations did you try? Did you try binary or multi-way classification? How were the results for the two different collections different or similar? How do you think the results could be further improved?
yourlastname_diverse_model_1.model yourlastname_diverse_output_1.txt yourlastname_diverse_test_2.arff yourlastname_diverse_model_2.model yourlastname_diverse_output_2.txt
yourlastname_rec_test_1.arff yourlastname_rec_output_1.txt yourlastname_rec_model_2.model yourlastname_rec_test_2.arff yourlastname_rec_output_2.txt OR yourlastname_sci_model_1.model yourlastname_sci_test_1.arff yourlastname_sci_output_1.txt yourlastname_sci_model_2.model yourlastname_sci_test_2.arff yourlastname_sci_output_2.txt Information about Weka 3-4Download Weka-3.4 from http://www.cs.waikato.ac.nz/ml/weka/ Unfortunately, the documentation is somewhat sparse and marred by the fact that different documentation talks about different versions of Weka which are not compatible. However, we are providing a resource in the form of Wednesday's lecture notes. Some helpful additional information can be found below. We think you'll mainly want to use the Explorer interface for experimenting with features and learning algorithms, and the Experimenter interface for comparing algorithms.
Technical Notes
|