Click here to start text-only
Content Analysis
Review
Zipf Distribution
Consequences of Zipf
Word Frequency vs. Resolving Power (from van Rijsbergen 79)
StatisticalIndependence vs. Dependence
Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)
PPT Slide
Computing Co-occurence
Document Vectors
Topics for Today
Documents in 3D Space
Document Similarity
Document Space has High Dimensionality
Text Clustering
Pair-wise Document Similarity
Pair-wise Document Similarity(no normalization for simplicity)
Pair-wise Document Similarity(cosine normalization)
Document/Document Matrix
Agglomerative Clustering
AgglomerativeClustering
K-Means Clustering
Scatter/Gather
S/G Example: query on “star”
Another use of clustering
Clustering Multi-Dimensional Document Space(image from Wise et al 95)
Concept “Landscapes”
Clustering
Email: hearst@sims.berkeley.edu
Home Page: http://sims.berkeley.edu/~hearst
Download presentation source
View text as html