A9. Text Toolkit (due 11/30)

Assigned: 
November 16, 2011
Due: 
November 30, 2011 - 09:00

Create a new Assignment Submission Page titled: "A9
- Your Name".  Tag it “A9”.

Assignment 9: Text Toolkit

Posted: November 16, 2011

Due: November 30, 2011



Assignment Overview

In this assignment, you will:

1.     Learn how to use a toolkit
for text analysis

2.     Use the toolkit to analyze
various texts related to the topics of the class

3.     Reflect on your experiences

Deadline

You must submit your work by creating a new assignment
submission page before 9 a.m. on Wednesday, November 30. Late assignments will
not be accepted unless you have an exceptionally good excuse.

Submission Requirements

You will submit a text file called YourNameA9report.pdf.
Be sure that you submit a PDF. The file will include short answers (50-100
words) to each of the eight reflections below.

Instructions

Voyeur. This is a web-based text analysis tool
designed to work on text collections.

a.           Skim
this description of Voyeur, a web-based text analysis tool: http://hermeneuti.ca/book/export/html/2.

b.           Go
to http://voyeur.hermeneuti.ca/.

c.           We’ve
gathered five recent news articles from different sources. Most of them
document the Occupy Wall Street movement New York City. Paste these links to
the stories in the “Add Texts” section (one per line):

http://people.ischool.berkeley.edu/~pdgoodman/202/nyt.html

http://people.ischool.berkeley.edu/~pdgoodman/202/nypost.html

http://people.ischool.berkeley.edu/~pdgoodman/202/wsj.html

http://people.ischool.berkeley.edu/~pdgoodman/202/watimes.html

http://people.ischool.berkeley.edu/~pdgoodman/202/nytk.html

d.           Click
reveal”.

e.           Look
at the Distinctive words in the summary tab. Why do these
words appear on this list? Can you learn anything about an article from these
words? How much about the article can be inferred from the words on this list?
Are there things we can’t infer from the distinctive words? (Reflection 1)

f.             Click
on one of the documents linked in blue under the distinctive word list. At
right you will see the words that appear in that document while at left you see
the words that appear in the collection. What is the benefit of seeing the term
frequency compared to other documents in the collection? (Reflection 2)

g.           Voyeur
does not remove stop-words. What would be the effect of stop-word removal?
Looking at the word frequency (count), where would you draw the line to
eliminate stop words? (Reflection 3)

h.           Voyeur
does not stem words. What is the effect of this? What would the frequency
distribution list look like if Voyeur stemmed words? (Reflection 4)

i.             Describe
a way you might use this tool to improve your graduate school life. (Reflection
5
)

2.      Man vs. Machine. We’ll
use this tool to compare tagging by humans to the tags automatically extracted
via computational algorithms. This tool was created by Ram Joshi (MIMS 2012),
Arthur Suermondt (MIMS 2012), Daniel Chiang (MIMS 2011), and David Rolnitzky
(MIMS 2011).

a.           Go
to http://tagfight.appspot.com/

b.           In
the URL box, you can put the link to the web page you want to
analyze. Paste the following link and click on Gohttp://www.shirky.com/writings/ontology_overrated.html

c.           You
will see a graph showing the number of users who have used a given tag to
describe the URL. The graph also shows the raw count of the terms in the
document. Do you see some overlap between the most frequent words in the text
and the tags used by delicious users? What types of tags can be inferred by the
specific content of the text and what types cannot? (Reflection 6)

d.           The
second graph shows the distribution of tags if the tags had been run through a
stemmer (in this case, the Porter 2 stemmer). What is the effect of stemming in
the long tail/the shape of the curve? What kind of benefits/problems does it
introduce? (Reflection 7)

e.           The Top
Machine Tags
 lists entities that were recognized by entity
extraction services (these services use dictionaries and NLP techniques to
identify named entities like persons, places and organizations in the
document). Do you see overlap between the extracted entities and the tags used
by humans? What types of tags can be inferred by these services and what types
cannot? (Reflection 8)