Copyright © 2005 Robert J. Glushko
Recall and Relevance
Relevance in the Boolean Model
The Vector Model
Term Weighting
Similarity Calculation
Latent Semantic Analysis
Thursday 11/10: Structure Models
Monday 11/14: Brad Horowitz (Yahoo!) on Multimedia Search and Retrieval; Guest Lecturer in Marti Hearst's Search Engines Class (M 4-6 in 100 GBP, NW campus)
Tuesday 11/15: regular class will not meet
Thursday 11/17: Ray Larson on Probabilistic Models (slight change in reading assignment)
Tuesday 11/22: Marti Hearst on User Interfaces for IR
RECALL is the proportion of the relevant documents that are retrieved
PRECISION is the proportion of the retrieved documents that are relevant
Goal: High recall and precision - Get as much good stuff as possible while getting as little junk as possible
In the Boolean model, documents and queries are represented as sets of index terms
So index terms are either present or absent in a document
How is the relevance of a document calculated?
On what basis are the retrieved documents ordered in a list presented to the searcher?
The core problems of information retrieval are finding relevant documents and ordering the found documents according to relevance
The IR model explains how these problems are solved:
...By specifying the representations of queries and documents in the collection being searched
...And the information used, and the calculations performed, that order the retrieved documents by relevance
(And optionally, the model provides mechanisms for using relevance feedback to improve precision and results ordering)
Boolean model -- representations are sets of index terms, set theory operations with Boolean algebra calculate relevance as binary
Vector models -- representations are vectors with non-binary weighted index terms, linear algebra operations yield continuous measure of relevance
Structure models -- combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships) to determine which parts of documents and which documents are most important and relevant
Probabilistic models -- documents are represented by index terms, and the key assumption is that the terms are distributed differently in relevant and non relevant documents.
Vectors
Summation Notation
Cosines
Vectors are an abstract way to think about a list of numbers
Any point in a vector space can be represented as a list of numbers called "coordinates" which represent values on the "axes" or "basis vectors" of the space
Adding and multiplying vectors gives us a way to represent a continuous space in any number of dimensions
We can multiply a coordinate value in a vector to "scale" its length on a particular basic vector to "weight" that value (or axis)
We will use this notation when we calculate the weightings on the terms in document and query vectors and the similarity of documents represented as vectors
We'll encounter cosines when we compute the similarity of documents and queries in terms of the "distance" between their vectors
Documents and queries are represented as word or term vectors
Term weights can capture term frequency within a document or importance in discriminating the document in the collection
Vector algebra provides a model for computing similarity between queries and documents and between documents because of assumption that "closeness in space" means "closeness in meaning"
We can create a matrix in which we represent for each document the frequency of the words (or terms created by stemming morphologically related words) that it contains
We can use this same matrix to think of the meaning of a word / terms as a vector whose coordinates measure how much the word indicates the concept or context of a document
Terms that appear in every document have no resolving power because including them retrieves every document
Terms that appear very infrequently have great resolving power, but they are by definition rare terms that most people will never use in queries
So the most useful terms are those that are of intermediate frequency but which tend to occur in clusters, so most of their occurrences are in a small number of documents in the collection
Using the matrix from the "Weighting Using Term Frequency" slide a few slides back
If the weights are not already normalized, we can combine the normalization and the similarity calculation using this equation
If you want to find out about "cats," are documents about "felines" relevant?
Term vectors are affected by polysemy and synonymy, impairing precision and recall
Put another way, term vectors assume that terms are orthogonal or uncorrelated, and that isn't true
How can we take advantage of this latent structure in word usage that is otherwise hidden by word choice?
The basic idea behind LSA is to reduce the dimensionality of vector models by mapping documents and terms to a common conceptual space
By using far fewer representational dimensions than there are unique words, LSA "induces" the "latent" similarities among terms
LSA uses a fully automated statistical approach that makes no use of morphological, syntactic, or semantic relationships among the words
The statistical technique used by LSA is called Singular Value Decomposition, which "factors" a term-document matrix into a set of smaller matrices that approximate it
The axes of the smaller space are linear combinations of the term values in the original space
Instead of needing thousands of terms to index a document collection, LSA can provide superior precision and recall with just a few hundred dimensions (these are not generally interpretable as concepts, however)
How is this like using a thesaurus in an IR system? How is it different?
Chapter 6 (182-210) of Finding Out About
"Web Crawling" from Wikipedia
"Anatomy" article by Brin and Page; skim or skip 4.1 and 4.2