Content Analysis

Content Analysis

Prof. Marti Hearst

SIMS 202, Lecture 15

Review

Content Analysis:

Transformation of raw text into more computationally useful forms

Words in text collections exhibit interesting statistical properties

Zipf distribution

Word co-occurrences non-independent

Text documents are transformed to vectors

Pre-processing

Vectors represent multi-dimensional space

Zipf Distribution

Consequences of Zipf

There are always a few very frequent tokens that are not good discriminators.

Called "stop words" in IR

Usually correspond to linguistic notion of "closed-class" words

English examples: to, from, on, and, the, ...

Grammatical classes that don’t take on new members.

There are always a large number of tokens that occur almost once and can mess up algorithms.

Medium frequency words most descriptive

Word Frequency vs. Resolving Power (from van Rijsbergen 79)

Statistical
Independence vs. Dependence

How likely is token W to appear, given that we’ve seen token V?

Non-independence implies that tokens that co-occur may be related in some meaningful way.

Very simple corpus-processing algorithms producing meaningful results.

Interesting Associations with "Doctor"
(AP Corpus, N=15 million, Church & Hanks 89)

Computing Co-occurence

Compute for a window of words

Document Vectors

Documents are represented as "bags of words"

Represented as vectors when used computationally

A vector is like an array of floating point

Has direction and magnitude

Each vector holds a place for every term in the collection

Therefore, most vectors are sparse

Document Vectors

nova galaxy heat h’wood film role diet fur

1.0 0.5 0.3

0.5 1.0

1.0 0.8 0.7

0.9 1.0 0.5

1.0 1.0

0.9 1.0

0.5 0.7 0.9

0.6 1.0 0.3 0.2 0.8

0.7 0.5 0.1 0.3

Topics for Today

Multiple-dimensionality of Document Space

Automatic Methods for

Clustering

Creating Thesaurus Terms

Review and Sample Questions for Midterm

Documents in 3D Space

Document Similarity

Numbers represent how many documents share the indicated subset of terms.

How to represent similarity among five terms? Six?

Document Space has High Dimensionality

What happens beyond three dimensions?

Similarity still has to do with how many tokens are shared in common.

More terms -> harder to understand which subsets of words are shared among similar documents.

One approach to handling high dimensionality:

Clustering

Text Clustering

Finds overall similarities among groups of documents

Finds overall similarities among groups of tokens

Picks out some themes, ignores others

Text Clustering

Clustering is

"The art of finding groups in data."

-- Kaufmann and Rousseeu

Text Clustering

Clustering is

"The art of finding groups in data."

-- Kaufmann and Rousseeu

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

Pair-wise Document Similarity
(no normalization for simplicity)

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

Pair-wise Document Similarity
(cosine normalization)

Document/Document Matrix

Agglomerative Clustering

Agglomerative Clustering

Agglomerative
Clustering

K-Means Clustering

1 Create a pair-wise similarity measure

2 Find K centers using agglomerative clustering

take a small sample

group bottom up until K groups found

3 Assign each document to nearest center, forming new clusters

4 Repeat 3 as necessary

Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93

Hearst & Pedersen 95

Cluster sets of documents into general "themes", like a table of contents

Display the contents of the clusters by showing topical terms and typical titles

User chooses subsets of the clusters and re-clusters the documents within

Resulting new groups have different "themes"

S/G Example: query on "star"

Encyclopedia text

14 sports

8 symbols 47 film, tv

68 film, tv (p) 7 music

97 astrophysics

67 astronomy(p) 12 steller phenomena

10 flora/fauna 49 galaxies, stars

29 constellations

7 miscelleneous

Clustering and re-clustering is entirely automated

Another use of clustering

Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

"Project" these onto a 2D graphical representation:

Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)

Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)

Concept "Landscapes"

Pharmocology

Clustering

Advantages:

See some main themes

Disadvantage:

Many ways documents could group together are hidden

Thinking point: what is the relationship to classification systems and facets?