Content Analysis

10/14/97


Click here to start

Click here to start text-only


Table of Contents

Content Analysis

Review

Zipf Distribution

Consequences of Zipf

Word Frequency vs. Resolving Power (from van Rijsbergen 79)

Statistical Independence vs. Dependence

Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

PPT Slide

Computing Co-occurence

Document Vectors

Document Vectors

Topics for Today

Documents in 3D Space

Document Similarity

Document Space has High Dimensionality

Text Clustering

Text Clustering

Text Clustering

Pair-wise Document Similarity

Pair-wise Document Similarity (no normalization for simplicity)

Pair-wise Document Similarity (cosine normalization)

Document/Document Matrix

Agglomerative Clustering

Agglomerative Clustering

Agglomerative Clustering

K-Means Clustering

Scatter/Gather

S/G Example: query on “star”

PPT Slide

PPT Slide

PPT Slide

Another use of clustering

Clustering Multi-Dimensional Document Space (image from Wise et al 95)

Clustering Multi-Dimensional Document Space (image from Wise et al 95)

Concept “Landscapes”

Clustering

Author: hearst

Email: hearst@sims.berkeley.edu

Home Page: http://sims.berkeley.edu/~hearst

Download presentation source

View text as html