Content Analysis
Prof. Marti Hearst
SIMS 202, Lecture 15
Review
Content Analysis:
Transformation of raw text into more computationally useful forms
Words in text collections exhibit interesting statistical properties
Zipf distribution
Word co-occurrences non-independent
Text documents are transformed to vectors
Pre-processing
Vectors represent multi-dimensional space
Zipf Distribution
Consequences of Zipf
There are always a few very frequent tokens that are not good discriminators.
Called "stop words" in IR
Usually correspond to linguistic notion of "closed-class" words
- English examples: to, from, on, and, the, ...
- Grammatical classes that don’t take on new members.
There are always a large number of tokens that occur almost once and can mess up algorithms.
Medium frequency words most descriptive
Word Frequency vs. Resolving Power
(from van Rijsbergen 79)
Statistical
Independence vs. Dependence
- How likely is token W to appear, given that we’ve seen token V?
- Non-independence implies that tokens that co-occur may be related in some meaningful way.
- Very simple corpus-processing algorithms producing meaningful results.
Interesting Associations with "Doctor"
(AP Corpus, N=15 million, Church & Hanks 89)
Computing Co-occurence
Compute for a window of words
Document Vectors
Documents are represented as "bags of words"
Represented as vectors when used computationally
A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the collection
Therefore, most vectors are sparse
Document Vectors
nova galaxy heat h’wood film role diet fur
1.0 0.5 0.3
0.5 1.0
1.0 0.8 0.7
0.9 1.0 0.5
1.0 1.0
0.9 1.0
0.5 0.7 0.9
0.6 1.0 0.3 0.2 0.8
0.7 0.5 0.1 0.3
Topics for Today
Multiple-dimensionality of Document Space
Automatic Methods for
Clustering
Creating Thesaurus Terms
Review and Sample Questions for Midterm
Documents in 3D Space
Document Similarity
Numbers represent how many documents share the indicated subset of terms.
How to represent similarity among five terms? Six?
Document Space has High Dimensionality
What happens beyond three dimensions?
Similarity still has to do with how many tokens are shared in common.
More terms -> harder to understand which subsets of words are shared among similar documents.
One approach to handling high dimensionality:
Clustering
Text Clustering
Finds overall similarities among groups of documents
Finds overall similarities among groups of tokens
Picks out some themes, ignores others
Text Clustering
Clustering is
"The art of finding groups in data."
-- Kaufmann and Rousseeu
Text Clustering
Clustering is
"The art of finding groups in data."
-- Kaufmann and Rousseeu
Pair-wise Document Similarity
nova galaxy heat h’wood film role diet fur
1 3 1
5 2
2 1 5
4 1
Pair-wise Document Similarity
(no normalization for simplicity)
nova galaxy heat h’wood film role diet fur
1 3 1
5 2
2 1 5
4 1
Pair-wise Document Similarity
(cosine normalization)
Document/Document Matrix
Agglomerative Clustering
Agglomerative Clustering
Agglomerative
Clustering
K-Means Clustering
1 Create a pair-wise similarity measure
2 Find K centers using agglomerative clustering
take a small sample
group bottom up until K groups found
3 Assign each document to nearest center, forming new clusters
4 Repeat 3 as necessary
Scatter/Gather
Cutting, Pedersen, Tukey & Karger 92, 93
Hearst & Pedersen 95
Cluster sets of documents into general "themes", like a table of contents
Display the contents of the clusters by showing topical terms and typical titles
User chooses subsets of the clusters and re-clusters the documents within
Resulting new groups have different "themes"
S/G Example: query on "star"
Encyclopedia text
14 sports
8 symbols 47 film, tv
68 film, tv (p) 7 music
97 astrophysics
67 astronomy(p) 12 steller phenomena
10 flora/fauna 49 galaxies, stars
29 constellations
7 miscelleneous
Clustering and re-clustering is entirely automated
Another use of clustering
Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.
"Project" these onto a 2D graphical representation:
Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)
Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)
Concept "Landscapes"
Pharmocology
Clustering
Advantages:
See some main themes
Disadvantage:
Many ways documents could group together are hidden
Thinking point: what is the relationship to classification systems and facets?