A7. Text Toolkit (due 11/22)

Create a new Assignment Submission Page titled: "A7 - Your Name".  Make sure to tag it with the correct assignment tag ("A1" or "A2", etc). You must do this to ensure that we can see your assignment once you submit it. If you fail to do this or forget to tag your assignment, you may receive a late penalty since we will not be able to find your work.

 

Assignment 7: Text Toolkit

Posted: November 15, 2010

Due: November 22, 2010

Author: Bob Glushko, glushko@ischool.berkeley.edu

Lead TA: Julián Limón Núñez (limon@ischool.berkeley.edu)

Course: Information Organization and Retrieval (INFO 202)


Assignment Overview

In this assignment, you will:

  1. Learn how to use a toolkit for text analysis
  2. Use the toolkit to analyze various texts related to the topics of the class
  3. Reflect on your experiences

Deadline

You must submit your work by creating a new assignment submission page before 9 a.m. on Monday, November 22. Late assignments will not be accepted unless you have an exceptionally good excuse.

Submission Requirements

You will submit a text file called YourNameA7report.doc. The file will include short answers (50-100 words) to the 10 reflections of this assignment.

Detailed Instructions

You will use some text tools to analyze a collection of documents and you will be asked to answer some questions. Please include short answers (50-100 words) to every question marked as “Reflection” in your assignment submission.

Voyeur. This is a web-based text analysis tool designed to work on text collections.

a.       Go to http://voyeur.hermeneuti.ca/

b.      In the Add Texts section you can put the links to the documents you want to analyze. Paste the following links (one per row). You will probably recognize them. They are five chapters of the IFIOR book

http://people.ischool.berkeley.edu/~glushko/IFIOIR/Chapter2-20100908.pdf

http://people.ischool.berkeley.edu/~glushko/IFIOIR/Chapter3-20100909.pdf

http://people.ischool.berkeley.edu/~glushko/IFIOIR/Chapter4-20100917.pdf

http://people.ischool.berkeley.edu/~glushko/IFIOIR/Chapter5-20100917.pdf

http://people.ischool.berkeley.edu/~glushko/IFIOIR/Chapter6-20100917.pdf

c.       Click on reveal

d.      You will see a number of statistics. On the upper center box, go to Distinctive words. Take note of the five most distinctive words for each document. Why do you think they are distinctive? What do they tell you about the document? Which of them could be used as the “big concepts” of every chapter? Which of them look more accidental? What is the benefit of seeing the term frequency compared to other documents in the collection? (Reflection 1)

e.       The upper left box contains the term frequency for all the words in the collection, select classification and descriptions. In the lower left you will see a graph based on the term frequency in every document. What does this graph tell you? (Reflection 2)

f.        Now click on people and information. The graph will change. What does this graph tell you? (Reflection 3)

g.       Voyeur does not remove stop-words. What would be the effect of stop-word removal? (Reflection 4)

h.      Voyeur does not do stemming of words. What is the effect of this? (Reflection 5)

i.         Would this tool have been of help in Assignment 1 (202 in the news)? Why or why not? (Reflection 6)

j.         Would this tool have been helpful for your midterm study? Why or why not? (Reflection 7)

 

For more information, check out http://hermeneuti.ca/voyeur

 

2.      Man vs. Machine. This tool compares tagging by humans to the tags automatically extracted by computational algorithms.

Authors: Ram Joshi (MIMS 2012), Arthur Suermondt (MIMS 2012), Daniel Chiang (MIMS 2011), and David Rolnitzky (MIMS 2011)

a.       Go to http://tagfight.appspot.com/

b.      In the URL box you can put the link to the web page you want to analyze. Paste the following link and click on Go:

http://www.shirky.com/writings/ontology_overrated.html

Note: If the graph doesn't work with this URL, do the assignment with the following URL:

http://www.theatlantic.com/doc/194507/bush

c.       You will see a graph showing the number of users who have used a given tag to describe the URL. You can hover over any dot to see detailed information about the tag

d.      The graph also shows the raw count of the terms in the document. Do you see some overlap between the most frequent words in the text and the tags used by delicious users? What types of tags can be inferred by the specific content of the text and what types cannot? (Reflection 8)

e.       The second graph shows the distribution of tags if the tags had been run through a stemmer (in this case, the Porter 2 stemmer). What is the effect of stemming in the long tail/the shape of the curve? What kind of benefits/problems does it introduce? (Reflection 9)

f.        On the right column (Top Machine Tags) you will see some entities that were recognized by entity extraction services (these services use dictionaries and NLP techniques to identify named entities like persons, places and organizations in the document). Do you see overlap between the extracted entities and the tags used by humans? What types of tags can be inferred by these services and what types cannot? (Reflection 10)

 

Have fun with the assignment and make sure you answer the ten reflections in your assignment submission. We hope you'll find these tools useful to play around with other text collections that you want to analyze.