Table of Contents
Introduction to Content Analysis
Topics for Today
Content Analysis
Techniques for Content Analysis
Text Processing
Stemming and Morphological Analysis
Automated Methods
Errors Generated by Porter Stemmer (Krovetz 93)
Statistical Properties of Text
Zipf Distribution
What Kinds of Data Exhibit a Zipf Distribution?
Housing Listing Frequency Data
Words that occur few times (housing listings)
Medium and very frequent words (housing listings)
A More Standard Collection
Word Frequency vs. Resolving Power (from van Rijsbergen 79)
StatisticalIndependence vs. Dependence
Statistical Independence
Lexical Associations
Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)
PPT Slide
Document Vectors
Document Vectors
Documents in 3D Space
Documents and Query in 3D Space
Summary
|
Author: hearst
Email: hearst@sims.berkeley.edu
Home Page: http://sims.berkeley.edu/~hearst
Download presentation source
View text as html
|