Plan for Today's Class

Recall and Relevance
Relevance in the Boolean Model
The Vector Model
Term Weighting
Similarity Calculation
Latent Semantic Analysis

Schedule for Upcoming Lectures

Thursday 11/10: Structure Models
Monday 11/14: Brad Horowitz (Yahoo!) on Multimedia Search and Retrieval; Guest Lecturer in Marti Hearst's Search Engines Class (M 4-6 in 100 GBP, NW campus)
Tuesday 11/15: regular class will not meet
Thursday 11/17: Ray Larson on Probabilistic Models (slight change in reading assignment)
Tuesday 11/22: Marti Hearst on User Interfaces for IR

Recall and Precision

Recall and Precision [2]

RECALL is the proportion of the relevant documents that are retrieved
PRECISION is the proportion of the retrieved documents that are relevant
Goal: High recall and precision - Get as much good stuff as possible while getting as little junk as possible

High Recall but Low Precision

Low Recall but High Precision

High Recall and High Precision

Boolean Search with Inverted Indexes (last slide on 11/3)

Relevance in the Boolean Model

In the Boolean model, documents and queries are represented as sets of index terms
So index terms are either present or absent in a document
How is the relevance of a document calculated?
On what basis are the retrieved documents ordered in a list presented to the searcher?

Models of Information Retrieval [1]

The core problems of information retrieval are finding relevant documents and ordering the found documents according to relevance
The IR model explains how these problems are solved:
- ...By specifying the representations of queries and documents in the collection being searched
- ...And the information used, and the calculations performed, that order the retrieved documents by relevance
- (And optionally, the model provides mechanisms for using relevance feedback to improve precision and results ordering)

Models of Information Retrieval [2]

Boolean model -- representations are sets of index terms, set theory operations with Boolean algebra calculate relevance as binary
Vector models -- representations are vectors with non-binary weighted index terms, linear algebra operations yield continuous measure of relevance

Models of Information Retrieval [2]

Structure models -- combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships) to determine which parts of documents and which documents are most important and relevant
Probabilistic models -- documents are represented by index terms, and the key assumption is that the terms are distributed differently in relevant and non relevant documents.

Some Mathematical Foundations (and Review, I Hope)

Vectors
Summation Notation
Cosines

Vectors [2]

Vectors are an abstract way to think about a list of numbers
Any point in a vector space can be represented as a list of numbers called "coordinates" which represent values on the "axes" or "basis vectors" of the space
Adding and multiplying vectors gives us a way to represent a continuous space in any number of dimensions
We can multiply a coordinate value in a vector to "scale" its length on a particular basic vector to "weight" that value (or axis)

Summation Notation

We will use this notation when we calculate the weightings on the terms in document and query vectors and the similarity of documents represented as vectors

Cosines

We'll encounter cosines when we compute the similarity of documents and queries in terms of the "distance" between their vectors

Overview of Vector Model

Documents and queries are represented as word or term vectors
Term weights can capture term frequency within a document or importance in discriminating the document in the collection
Vector algebra provides a model for computing similarity between queries and documents and between documents because of assumption that "closeness in space" means "closeness in meaning"

Document x Term Matrix

We can create a matrix in which we represent for each document the frequency of the words (or terms created by stemming morphologically related words) that it contains

Word (or Term) Vectors

We can use this same matrix to think of the meaning of a word / terms as a vector whose coordinates measure how much the word indicates the concept or context of a document

Documents in Term Space - 2D Example

Weighting Using Term Frequency

Weighted Vectors in 3D

Word Frequency vs Discriminability / Resolving Power

Term Resolving Power

Weighting with Inverse Document Frequency

Terms that appear in every document have no resolving power because including them retrieves every document
Terms that appear very infrequently have great resolving power, but they are by definition rare terms that most people will never use in queries
So the most useful terms are those that are of intermediate frequency but which tend to occur in clusters, so most of their occurrences are in a small number of documents in the collection

Weighting Term Frequency with IDF (Simplified)

tf x idf Example Calculations

Using the matrix from the "Weighting Using Term Frequency" slide a few slides back

Normalized tf x idf Example Calculations

Similarity in Vector Models

Cosine Similarity with Weighting Example Calculations

Similarity in Unnormalized Vectors

If the weights are not already normalized, we can combine the normalization and the similarity calculation using this equation

Latent Semantic Analysis -- Motivation

If you want to find out about "cats," are documents about "felines" relevant?
Term vectors are affected by polysemy and synonymy, impairing precision and recall
Put another way, term vectors assume that terms are orthogonal or uncorrelated, and that isn't true
How can we take advantage of this latent structure in word usage that is otherwise hidden by word choice?

Latent Semantic Analysis -- Key Concepts

The basic idea behind LSA is to reduce the dimensionality of vector models by mapping documents and terms to a common conceptual space
By using far fewer representational dimensions than there are unique words, LSA "induces" the "latent" similarities among terms
LSA uses a fully automated statistical approach that makes no use of morphological, syntactic, or semantic relationships among the words

Latent Semantic Analysis -- Implementation

The statistical technique used by LSA is called Singular Value Decomposition, which "factors" a term-document matrix into a set of smaller matrices that approximate it
The axes of the smaller space are linear combinations of the term values in the original space
Instead of needing thousands of terms to index a document collection, LSA can provide superior precision and recall with just a few hundred dimensions (these are not generally interpretable as concepts, however)
How is this like using a thesaurus in an IR system? How is it different?

Readings for IO & IR Lecture #22

Chapter 6 (182-210) of Finding Out About
"Web Crawling" from Wikipedia
"Anatomy" article by Brin and Page; skim or skip 4.1 and 4.2

21. Vector Models

IS 202 - 8 November 2005

Plan for Today's Class

Schedule for Upcoming Lectures

Recall and Precision

Recall and Precision [2]

High Recall but Low Precision

Low Recall but High Precision

High Recall and High Precision

The Boolean Model

Boolean Search with Inverted Indexes (last slide on 11/3)

Relevance in the Boolean Model

Models of Information Retrieval [1]

Models of Information Retrieval [2]

Models of Information Retrieval [2]

Some Mathematical Foundations (and Review, I Hope)

Vectors [1]

Vectors [2]

Summation Notation

Cosines

Overview of Vector Model

Document x Term Matrix

Document Vector [1]

Document Vector [2]

Word (or Term) Vectors

Documents in Term Space - 2D Example

Term Weighting

Weighting Using Term Frequency

Weighted Vectors in 3D

Word Frequency vs Discriminability / Resolving Power

Term Resolving Power

Weighting with Inverse Document Frequency

IDF Calculations

Weighting Term Frequency with IDF (Simplified)

tf x idf Example Calculations

Normalized tf x idf

Normalized tf x idf Example Calculations

Similarity in Vector Models

Cosine Similarity with Weighting Example Calculations

Similarity in Unnormalized Vectors

Latent Semantic Analysis -- Motivation

Latent Semantic Analysis -- Key Concepts

Latent Semantic Analysis -- Implementation

Readings for IO & IR Lecture #22