Introduction to Content Analysis

Prof. Marti Hearst
SIMS 202, Lecture 14

Topics for Today
Content Analysis
Techniques for Content Analysis
Text Processing
 
Stemming and
Morphological Analysis
 
Automated Methods
Errors Generated by Porter Stemmer (Krovetz 93)
Statistical Properties of Text
Zipf Distribution
What Kinds of Data Exhibit a Zipf Distribution?
Housing Listing Frequency Data

Words that occur few times
(housing listings)

Medium and very frequent words (housing listings)

A More Standard Collection

Word Frequency vs. Resolving Power
(from van Rijsbergen 79)

Statistical
Independence vs. Dependence
Statistical Independence
Lexical Associations
Interesting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)

Document Vectors
Document Vectors
Documents in 3D Space 

Documents and Query in 3D Space 

Summary