Text Processing
Standard Steps:
- Recognize document structure
- titles, sections, paragraphs, etc.
- Break into tokens
- usually space and punctuation delineated
- special issues with Asian languages
- Stemming/morphological analysis
- Store in inverted index (to be discussed later)