L22. DIMENSIONALITY REDUCTION (11/16)

16 November 2009

Because the calculations used by simple vector models use the frequency of words and word forms,  they can't distinguish different meanings of the same word (polysymy) and they can't detect equivalent meaning expressed with different words (synonymy).  The dimensionality of the space in the simple vector model is the number of different terms in it, but the "semantic dimensionality" of the space is the number of distinct topics represented in it, which is much smaller.   

Somewhat paradoxically, these reduced dimensionality vectors that define "topic space" rather than "term space" are calculated using the statistical co-occurrence of the terms in the collection, so the process is completely automatable -- it requires no humanly constructed dictionaries, knowledge bases, ontologies, semantic networks, grammars, syntactic parsers, morphologies, or anything else that represents "language".  For this reason these approaches are said to extract "latent" semantics.

 

 

Download recorded lecture from http://courses.ischool.berkeley.edu/i202/f09/files/202-20091116.mp3