Introduction to Content Analysis
Prof. Marti Hearst
SIMS 202, Lecture 14
Topics for Today
-
Overview of Content Analysis
-
Text Representation
-
Statistical Characteristics of Text Collections
Content Analysis
-
Automated Transformation of raw text into a form that represent some
aspect(s) of its meaning
-
Including, but not limited to:
-
Automated Thesaurus Generation
-
Phrase Detection
-
Categorization
-
Clustering
-
Summarization
Techniques for Content Analysis
-
Statistical
-
Single Document
-
Full Collection
-
Linguistic
-
Syntactic
-
Semantic
-
Pragmatic
-
Knowledge-Based (Artificial Intelligence)
-
Hybrid (Combinations)
Text Processing
-
Standard Steps:
-
Recognize document structure
-
titles, sections, paragraphs, etc.
-
Break into tokens
-
usually space and punctuation delineated
-
special issues with Asian languages
-
Stemming/morphological analysis
-
Store in inverted index (to be discussed later)
Stemming and
Morphological Analysis
-
Goal: “normalize” similar words
-
Morphology (“form” of words)
-
Inflectional Morphology
-
E.g,. inflect verb endings and noun number
-
Never change grammatical class
-
dog, dogs
-
tengo, tienes, tiene, tenemos, tienen
-
Derivational Morphology
-
Derive one word from another,
-
Often change grammatical class
-
build, building; health, healthy
Automated Methods
-
Powerful multilingual tools exist for morphological analysis
-
PCKimmo, Xerox Lexical technology
-
Require a grammar and dictionary
-
Use “two-level” automata
-
Stemmers:
-
Very dumb rules work well (for English)
-
Porter Stemmer: Iteratively remove suffixes
-
Improvement: pass results through a lexicon
Errors Generated by Porter Stemmer (Krovetz
93)
Statistical Properties of Text
-
Token occurrences in text are not uniformly distributed
-
They are also not normally distributed
-
They do exhibit a Zipf distribution
-
(in-class demonstration of distribution types)
Zipf Distribution
-
The product of the frequency of words (f) and their rank (r) is approximately
constant
-
Rank = order of words’ frequency of occurrence
-
Main Characteristics
-
a few elements occur very frequently
-
a medium number of elements have medium frequency
-
many elements occur very infrequently
What Kinds of Data Exhibit a Zipf Distribution?
-
Words in a text collection
-
Library book checkout patterns
-
Incoming Web Page Requests (Nielsen)
-
Outgoing Web Page Requests (Cunha & Crovella)
-
Document Size on Web (Cunha & Crovella)
Housing Listing Frequency Data
Words that occur few times
(housing listings)
Medium and very frequent words (housing
listings)
A More Standard Collection
Word Frequency vs. Resolving Power
(from van Rijsbergen 79)
Statistical
Independence vs. Dependence
-
How likely is a red car to drive by given we've seen a black one?
-
How likely is word W to appear, given that we’ve seen word V?
-
Color of cars driving by are independent (although more frequent
colors are more likely)
-
Words in text are not independent (although again more frequent words
are more likely)
Statistical Independence
-
Compute for a window of words
Lexical Associations
-
Subjects write first word that comes to mind
-
doctor/nurse; black/white (Palermo & Jenkins 64)
-
Text Corpora yield similar associations
-
One measure: Mutual Information (Church and Hanks 89)
-
If word occurrences were independent, the numerator and denominator would
be equal (if measured across a large collection)
Interesting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
Document Vectors
-
Documents are represented as “bags of words”
-
Represented as vectors when used computationally
-
A vector is like an array of floating point
-
Has direction and magnitude
-
Each vector holds a place for every term in the collection
-
Therefore, most vectors are sparse
Document Vectors
nova galaxy heat h’wood film role diet fur
1.0 0.5
0.3
0.5 1.0
1.0
0.8 0.7
0.9
1.0 0.5
1.0
1.0
0.9
1.0
0.5
0.7 0.9
0.6 1.0
0.3 0.2 0.8
0.7 0.5
0.1 0.3
Documents in 3D Space
Documents and Query in 3D Space
Summary
-
Content Analysis: transforming raw text into more computationally useful
forms
-
Words in text collections exhibit interesting statistical properties
-
Word frequencies have a Zipf distribution
-
Word co-occurrences exhibit dependencies
-
Text documents are transformed to vectors
-
pre-processing includes tokenization, stemming, collocations/phrases
-
Documents occupy multi-dimensional space