L20. TEXT PROCESSING; BOOLEAN MODELS (11/4)

4 November 2009

The core problems of information retrieval are finding relevant documents and ordering the found documents according to relevance. The IR model explains how these problems are solved by (1) designing the representations of queries and documents in the collection being searched and  (2) specifying the information used, and the calculations performed, that order the retrieved documents by relevance.   Different IR models solve these problems in different ways; the better they solve it, the more computationally complex they are, so there are tradeoffs.  The  simplest, most familiar, and least effective model is the Boolean model -- representations are sets of index terms, and relevance is calculated in an all-or-none way according to set theory operations with Boolean algebra.

"Text processing" in IR consists of a sequence of steps that transform the text of documents or other entities so that they can be more efficiently stored and matched.  Most of these steps are conceptually trivial - like extracting the text content from its storage format, and separating a set of words into tokens - but they pose subtle and interesting challenges.

 

Download recorded lecture from http://courses.ischool.berkeley.edu/i202/f09/files/202-20091104.mp3