Summary
Content Analysis: transforming raw text into more computationally useful forms
Words in text collections exhibit interesting statistical properties
- Word frequencies have a Zipf distribution
- Word co-occurrences exhibit dependencies
Text documents are transformed to vectors
- pre-processing includes tokenization, stemming, collocations/phrases
- Documents occupy multi-dimensional space