Content Analysis

Prof. Marti Hearst

SIMS 202, Lecture 15

Review

Zipf Distribution

Consequences of Zipf

Word Frequency vs. Resolving Power (from van Rijsbergen 79)

Statistical
Independence vs. Dependence

 

Interesting Associations with "Doctor"
(AP Corpus, N=15 million, Church & Hanks 89)

 

Computing Co-occurence

Document Vectors

Document Vectors

nova galaxy heat h’wood film role diet fur

1.0 0.5 0.3

0.5 1.0

1.0 0.8 0.7

0.9 1.0 0.5

1.0 1.0

0.9 1.0

0.5 0.7 0.9

0.6 1.0 0.3 0.2 0.8

0.7 0.5 0.1 0.3

 

Topics for Today

 

Documents in 3D Space

Document Similarity

Document Space has High Dimensionality

Clustering

Text Clustering

Text Clustering

Clustering is

"The art of finding groups in data."

-- Kaufmann and Rousseeu

 

 

Text Clustering

Clustering is

"The art of finding groups in data."

-- Kaufmann and Rousseeu

 

 

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

Pair-wise Document Similarity
(no normalization for simplicity)

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

Pair-wise Document Similarity
(cosine normalization)

Document/Document Matrix

Agglomerative Clustering

Agglomerative Clustering

Agglomerative
Clustering

K-Means Clustering

 

Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93

Hearst & Pedersen 95

 

 

 

 

S/G Example: query on "star"

Encyclopedia text

14 sports

8 symbols 47 film, tv

68 film, tv (p) 7 music

97 astrophysics

67 astronomy(p) 12 steller phenomena

10 flora/fauna 49 galaxies, stars

29 constellations

7 miscelleneous

 

Clustering and re-clustering is entirely automated

 

 

 

 

 

 

 

Another use of clustering

Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)

Clustering Multi-Dimensional
Document Space

(image from Wise et al 95)

Concept "Landscapes"

Pharmocology

Clustering