is202\Lecture14

Introduction to Content Analysis

Prof. Marti Hearst SIMS 202, Lecture 14

Topics for Today

Overview of Content Analysis
Text Representation
Statistical Characteristics of Text Collections

Content Analysis

Automated Transformation of raw text into a form that represent some aspect(s) of its meaning
Including, but not limited to:

Automated Thesaurus Generation
Phrase Detection
Categorization
Clustering
Summarization

Techniques for Content Analysis

Statistical

Single Document
Full Collection

Linguistic

Syntactic
Semantic
Pragmatic

Knowledge-Based (Artificial Intelligence)
Hybrid (Combinations)

Text Processing

Standard Steps:

Recognize document structure

titles, sections, paragraphs, etc.

Break into tokens

usually space and punctuation delineated
special issues with Asian languages

Stemming/morphological analysis
Store in inverted index (to be discussed later)

Stemming and Morphological Analysis

Goal: “normalize” similar words
Morphology (“form” of words)

Inflectional Morphology

E.g,. inflect verb endings and noun number
Never change grammatical class

dog, dogs
tengo, tienes, tiene, tenemos, tienen

Derivational Morphology

Derive one word from another,
Often change grammatical class

build, building; health, healthy

Automated Methods

Powerful multilingual tools exist for morphological analysis

PCKimmo, Xerox Lexical technology
Require a grammar and dictionary
Use “two-level” automata

Stemmers:

Very dumb rules work well (for English)
Porter Stemmer: Iteratively remove suffixes
Improvement: pass results through a lexicon

Errors Generated by Porter Stemmer (Krovetz 93) Statistical Properties of Text

Token occurrences in text are not uniformly distributed
They are also not normally distributed
They do exhibit a Zipf distribution

(in-class demonstration of distribution types)

Zipf Distribution

The product of the frequency of words (f) and their rank (r) is approximately constant

Rank = order of words’ frequency of occurrence

Main Characteristics

a few elements occur very frequently
a medium number of elements have medium frequency
many elements occur very infrequently

What Kinds of Data Exhibit a Zipf Distribution?

Words in a text collection
Library book checkout patterns
Incoming Web Page Requests (Nielsen)
Outgoing Web Page Requests (Cunha & Crovella)
Document Size on Web (Cunha & Crovella)

Housing Listing Frequency Data

Words that occur few times (housing listings)

Medium and very frequent words (housing listings)

A More Standard Collection

Word Frequency vs. Resolving Power (from van Rijsbergen 79)

Statistical Independence vs. Dependence

How likely is a red car to drive by given we've seen a black one?
How likely is word W to appear, given that we’ve seen word V?
Color of cars driving by are independent (although more frequent colors are more likely)
Words in text are not independent (although again more frequent words are more likely)

Statistical Independence

Compute for a window of words

Lexical Associations

Subjects write first word that comes to mind

doctor/nurse; black/white (Palermo & Jenkins 64)

Text Corpora yield similar associations
One measure: Mutual Information (Church and Hanks 89)
If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

Document Vectors

Documents are represented as “bags of words”
Represented as vectors when used computationally

A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the collection
Therefore, most vectors are sparse

Document Vectors

nova galaxy heat h’wood film role diet fur

Documents in 3D Space

Documents and Query in 3D Space

Summary

Content Analysis: transforming raw text into more computationally useful forms

Words in text collections exhibit interesting statistical properties

Word frequencies have a Zipf distribution
Word co-occurrences exhibit dependencies

Text documents are transformed to vectors

pre-processing includes tokenization, stemming, collocations/phrases
Documents occupy multi-dimensional space