Summary

Words in text collections exhibit interesting statistical properties
- Word frequencies have a Zipf distribution
- Word co-occurrences exhibit dependencies

Text documents are transformed to vectors
- pre-processing includes tokenization, stemming, collocations/phrases
- Documents occupy multi-dimensional space

Content Analysis: transforming raw text into more computationally useful forms