|
Final Project
Due dates:
Mon Nov 29: Proposal Due
Dec 6 and 8: Class Presentations
Fri Dec 17: Writeup Due
Project Expectations and Grading:
The goal of this project is to allow you to tie together some of the different
ideas and skills you've acquired in this class (and elsewhere), and to learn
them in more depth by applying them to a topic that interests you.
I've deliberately given you a little less than a month for this so that you won't
treat it as a huge project. It should be a bit more ambitious than what you did
for assignment 4, but not hugely more.
You may (and are encouraged to) work in pairs for the final project.
I expect a bit more from people
working in pairs than those working individually, but projects done in pairs
often turn out better.
The most important aspect of the project is the quality of the work.
I also give weight to creativity, both
in what you attempt to do (provided it is realistic) and in how you do it.
For your writeup, I give as much credit to thoughtful analysis of the data and
results as to implementation and algorithms. Your writeup should show me what
you've learned by doing the project and how you've incorporated things you've
learned from this class and elsewhere. You should be thorough in your
descriptions of your algorithms, resources used, and results obtained.
Project Proposal:
The purpose of the proposal is to allow me to ensure that you've proposed
something doable in the time available. Please be as specific as you can about
the project goals and the means by which you intend to achieve those goals.
This should be no more than 2-3 paragraphs, but you can write more if you like.
In an earlier email I said text only, but if you want to turn in a pdf or doc
file, that's ok. Be sure to state who is working on the project if you are
working in pairs. If you want feedback earlier, feel free to send me an email
with your proposal before the due date (plain text only).
Class Presentation:
The purpose of the class presentation is to let other students know what you're
doing and to also get their feedback and ideas. The amount time you'll have to
speak depends on how many groups we end up with; you'll be notified of your talk
slot by Wed Dec 1. It's best if you prepare a few slides or a webpage; it's
best if they are accessible from the web so we can switch quickly between speakers
(if needed, feel free to send your information to me and I can make it easily
available).
Final Project Suggestions:
These suggestions are entirely optional; you are free to do a project of your
own devising.
- Improve an Automatic Faceted Hierarchy Creation Tool. Hierarchical
faceted classification for search and browse interfaces (a la Flamenco) is "an idea whose time has come"
according to Peter Brantley, technical lead of the California Digital Library.
However, there is a dearth of tools for helping create such category systems.
Emilia Stoica and I have
created an
algorithm and a tool that makes use of WordNet for both
category system creation and for assignment of items to categories.
It seems to work well, but work remains to be done. In particular:
- Semantically related categories in WordNet appear in different places and
in some cases should be linked together (e.g., different occupations occur in
different places; there are many other examples). This paper may have some good ideas:
Piek Vossen,
Extending, Trimming and Fusing WordNet for Technical Documents
NAACL 2001,
- When items are assigned to categories, the word senses of the categories
should be taken into account. Currently the system doesn't know which sense of
"bass" (fish or musical instrument) to use to classify different items. There
is a big literature on Word Sense Disambiguation that can be tapped into here.
If you are interested, we can make the code available to you (it is written in
perl) and also provide you with example output and information about what needs
to be improved.
- Implement and compare some WordNet similarity algorithms.
In particular, it would be interesting to compare the algorithm mentioned in
Ramakrishnan '04 with
those implemented in the
WordNet Similarity
package by Ted Pedersen. The hardest part of this project is determining how to
evaluate the results. You have to establish what the evaluation metric will be
before starting this project.
- Create a Negativity/Emotion/Flame Recognizer. I've been told that
Google has a text categorizer that recognizes "negative" news or information; it
is apparently used as a filter to avoid placing advertisements in tasteless
locations. Additionally, there has been work done on automatically recognizing
"flames" in newsgroup postings (see E. Spertus, Smokey:
Automatic Recognition of Hostile Messages, '97, and
Pang et al,
Thumbs up?'02). Finally, there has been
a bit of work done on "sentiment" recognition. The project would be to try to
use text classification methods to create a recognizer of this sort. The
hardest part is collecting the testing/training data but there is probably a
good resource out there.
- Apply Text Segmentation/Categorization/Clustering/Anchor Text Analysis to
Blog Analysis.
A big problem with automatically analyzing blogs is that many
of them cover a large number of topics. This impedes any effort to
automatically build a directory of blogs.
Web search engines make heavy use of the anchor text on web page inlinks to
determine the main topic of a web page. This may be useful for understanding
the different topics that a blog covers (some probably target individual pages
and others the blog author as a whole).
Text segmentation algorithms might be useful for determining when topics switch
within a blog, (although this is probably indicated pretty well with
formatting). One idea though is to use text clustering methods on different
subparts of different blogs to see if some parts are in common with one another.
The work from the
TDT competition might be helpful here.
(Note: I think this would a pretty difficult project unless you can get a good
angle on it.)
- Continue analyzing the Enron collection. Do social network
analysis, emotion recognition, incorporate results from the acronym recognizers
developed for assignment 4, etc.
- Create a back-of-the-book indexer. Automatically creating an index
for a nonfiction book is still an unsolved problem. It seems though that a
semi-automated solution should be able to help authors at least see terms that
they neglected to include. Some ideas are to find important phrases from each
segment, titles, etc., use distributional similarity to group similar items, use
WordNet to group similar topics, find compound nouns and then find differences
in their modifiers, (e.g., peanut butter, chunky; peanut butter, smooth).
(Note: this is a difficult project; I'd expect anyone doing this to only get
partway through it.)
|