Consequences of Zipf
There are always a few very frequent tokens that are not good discriminators.
- Called “stop words” in IR
- Usually correspond to linguistic notion of “closed-class” words
- English examples: to, from, on, and, the, ...
- Grammatical classes that don’t take on new members.
There are always a large number of tokens that occur almost once and can mess up algorithms.
Medium frequency words most descriptive