Introduction to Content Analysis

10/14/97


Click here to start

Click here to start-only


Table of Contents

Introduction to Content Analysis

Topics for Today

Content Analysis

Techniques for Content Analysis

Text Processing

Stemming and Morphological Analysis

Automated Methods

Errors Generated by Porter Stemmer (Krovetz 93)

Statistical Properties of Text

Zipf Distribution

What Kinds of Data Exhibit a Zipf Distribution?

Housing Listing Frequency Data

Words that occur few times (housing listings)

Medium and very frequent words (housing listings)

A More Standard Collection

Word Frequency vs. Resolving Power (from van Rijsbergen 79)

Statistical Independence vs. Dependence

Statistical Independence

Lexical Associations

Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

PPT Slide

Document Vectors

Document Vectors

Documents in 3D Space

Documents and Query in 3D Space

Summary

Author: hearst

Email: hearst@sims.berkeley.edu

Home Page: http://sims.berkeley.edu/~hearst

Download presentation source

View text as html