SIMS 290-2: Applied Natural Language Processing

[an error occurred while processing this directive]

SIMS 290-2: ANLP Assignments

Final Project

Due dates:

Mon Nov 29: Proposal Due
Dec 6 and 8: Class Presentations
Fri Dec 17: Writeup Due

Project Expectations and Grading:

The goal of this project is to allow you to tie together some of the different ideas and skills you've acquired in this class (and elsewhere), and to learn them in more depth by applying them to a topic that interests you. I've deliberately given you a little less than a month for this so that you won't treat it as a huge project. It should be a bit more ambitious than what you did for assignment 4, but not hugely more.

You may (and are encouraged to) work in pairs for the final project. I expect a bit more from people working in pairs than those working individually, but projects done in pairs often turn out better.

The most important aspect of the project is the quality of the work. I also give weight to creativity, both in what you attempt to do (provided it is realistic) and in how you do it. For your writeup, I give as much credit to thoughtful analysis of the data and results as to implementation and algorithms. Your writeup should show me what you've learned by doing the project and how you've incorporated things you've learned from this class and elsewhere. You should be thorough in your descriptions of your algorithms, resources used, and results obtained.

Project Proposal:

The purpose of the proposal is to allow me to ensure that you've proposed something doable in the time available. Please be as specific as you can about the project goals and the means by which you intend to achieve those goals. This should be no more than 2-3 paragraphs, but you can write more if you like. In an earlier email I said text only, but if you want to turn in a pdf or doc file, that's ok. Be sure to state who is working on the project if you are working in pairs. If you want feedback earlier, feel free to send me an email with your proposal before the due date (plain text only).

Class Presentation:

The purpose of the class presentation is to let other students know what you're doing and to also get their feedback and ideas. The amount time you'll have to speak depends on how many groups we end up with; you'll be notified of your talk slot by Wed Dec 1. It's best if you prepare a few slides or a webpage; it's best if they are accessible from the web so we can switch quickly between speakers (if needed, feel free to send your information to me and I can make it easily available).

Final Project Suggestions:

These suggestions are entirely optional; you are free to do a project of your own devising.

Improve an Automatic Faceted Hierarchy Creation Tool. Hierarchical faceted classification for search and browse interfaces (a la Flamenco) is "an idea whose time has come" according to Peter Brantley, technical lead of the California Digital Library. However, there is a dearth of tools for helping create such category systems. Emilia Stoica and I have created an algorithm and a tool that makes use of WordNet for both category system creation and for assignment of items to categories. It seems to work well, but work remains to be done. In particular:
- Semantically related categories in WordNet appear in different places and in some cases should be linked together (e.g., different occupations occur in different places; there are many other examples). This paper may have some good ideas: Piek Vossen, Extending, Trimming and Fusing WordNet for Technical Documents NAACL 2001,
- When items are assigned to categories, the word senses of the categories should be taken into account. Currently the system doesn't know which sense of "bass" (fish or musical instrument) to use to classify different items. There is a big literature on Word Sense Disambiguation that can be tapped into here.
If you are interested, we can make the code available to you (it is written in perl) and also provide you with example output and information about what needs to be improved.
Implement and compare some WordNet similarity algorithms. In particular, it would be interesting to compare the algorithm mentioned in Ramakrishnan '04 with those implemented in the WordNet Similarity package by Ted Pedersen. The hardest part of this project is determining how to evaluate the results. You have to establish what the evaluation metric will be before starting this project.
Create a Negativity/Emotion/Flame Recognizer. I've been told that Google has a text categorizer that recognizes "negative" news or information; it is apparently used as a filter to avoid placing advertisements in tasteless locations. Additionally, there has been work done on automatically recognizing "flames" in newsgroup postings (see E. Spertus, Smokey: Automatic Recognition of Hostile Messages, '97, and Pang et al, Thumbs up?'02). Finally, there has been a bit of work done on "sentiment" recognition. The project would be to try to use text classification methods to create a recognizer of this sort. The hardest part is collecting the testing/training data but there is probably a good resource out there.
Apply Text Segmentation/Categorization/Clustering/Anchor Text Analysis to Blog Analysis. A big problem with automatically analyzing blogs is that many of them cover a large number of topics. This impedes any effort to automatically build a directory of blogs. Web search engines make heavy use of the anchor text on web page inlinks to determine the main topic of a web page. This may be useful for understanding the different topics that a blog covers (some probably target individual pages and others the blog author as a whole). Text segmentation algorithms might be useful for determining when topics switch within a blog, (although this is probably indicated pretty well with formatting). One idea though is to use text clustering methods on different subparts of different blogs to see if some parts are in common with one another. The work from the TDT competition might be helpful here. (Note: I think this would a pretty difficult project unless you can get a good angle on it.)
Continue analyzing the Enron collection. Do social network analysis, emotion recognition, incorporate results from the acronym recognizers developed for assignment 4, etc.
Create a back-of-the-book indexer. Automatically creating an index for a nonfiction book is still an unsolved problem. It seems though that a semi-automated solution should be able to help authors at least see terms that they neglected to include. Some ideas are to find important phrases from each segment, titles, etc., use distributional similarity to group similar items, use WordNet to group similar topics, find compound nouns and then find differences in their modifiers, (e.g., peanut butter, chunky; peanut butter, smooth). (Note: this is a difficult project; I'd expect anyone doing this to only get partway through it.)