I 240: Information Retrieval

Mission statement

Outline & Schedule

Texts & Readings

Information Retrieval Concepts

Outline of Course Topics

I. Introduction to Information Retrieval

A. What is Information Retrieval?
B. The notion of Relevance.
C. The IR problems.

1. Finding relevant documents.
2. Question answering systems.
3. Intelligent Filtering.
4. synthesizing and merging information.

D. Conceptual Models of IR systems.

E. Characteristics of text collections.

1. Structured text and Bibliographic records.
2. Multimedia documents.
3. Full Text.

II. Conceptual Models of Information Retrieval Systems.

A. Boolean Systems.
B. The Vector Space Model.
C. Probabilistic Models.
D. Natural Language Processing models of the IR task.
E. Document similarity measures.

III. Term and Language properties for IR collections.

A. Properties of language collections.

1. Zipf's law,
2. Statistical distributions.
3. Stochastic language models.

B. Full-text Documents vs. Bibliographic records.

IV. Data and File Structures for Information Retrieval.

A. Inverted files.
B. Signature file.
C. Other file structures (PAT trees, Grid Files, Hashing).
D. DBMS-based Information Retrieval.

V. Automatic Indexing.

A. Indexing goals.

1. Passage vs. document retrieval.
2. Salton's "Blueprint for automatic indexing"

B. Lexical Analysis.

1. Stoplists.
2. Stemming algorithms.
3. Stemming and Morphological analysis.
4. Part-of-speech tagging and Parsing
5. Segmentation strategies for long texts

a. Othographic and author-indentified segments.
b. Fixed size windows
c. TextTiling.

6. Phrase recognition

a. Syntactic
b. Collocational and statistical methods.

7. Disambiguation

a. Algorithms (Liddy, Yarowsky, Wilensky & Chen)

C. Thesaurus Construction.

1. Collection-sensitive thesauri.
2. Manually derived thesauri (WordNet, MESH, LCSH).
3. Automatically derived thesauri.

a. Term associations (Spark Jones).
b. Latent Semantic Indexing.

D. Does linguistic and detailed language analysis help?

1. Boggess et al. (semantic tagging)
2. Robust discourse analysis (Liddy)

E. Indexing and storage issues.

1. Index compression.

F. Other automatic indexing issues.

1. Automating Hypertext linkages.
2. Positional information in indexes.

VI. Automatic Classification.

A. Origins in manual classification.

1. Classification schemes.
2. The manual classification task.

B. Numerical Taxonomy and automatic methods.

C. Experiments in automatic assignments of pre-defined classifications

1. Early work (Maron, Borko).
2. Probabilistic methods of class assignment.
3. NLP category assignment methods.

D. Automatic Classification -- Clustering.

1. The cluster hypothesis.
2. Hierarchical Classification

a. Single Link.
b. Complete link.

3. Heuristic classification

a. Rocchio's method.
b. Datola method.
c. Scatter-gather.

E. Using classification for searching and retrieval.

1. Library classification as an organizing principle
2. Yahoo and WWW classification.
3. Cheshire 2-stage retrieval.

VII. Machine Learning Techniques in IR.

A. Techniques for machine learning.

1. Neural Networks.
2. Genetic Algorithms.
3. Symbolic Learning.

B. Applications in IR.

VIII. Search strategies.

A. Information seeking behavior and query formulation.

1. Information seeking and information needs.
2. Query construction and visualization

a. Standard Boolean
b. Augmented Boolean
c. Similarity Searching (vector and probabilistic)
d. Weighted queries.

B. Relevance Feedback and other query modification methods.

1. Algorithms for relevance feedback.

C. Browsing/Navigation vs. Search

IX. Examples of Information Retrieval Systems. (In-depth examination of the IR features of several systems, including:)

A. Boolean Systems.

1. MELVYL.
2. Lexis/Nexis
3. Dialog.

B. Vector Space systems.

1. SMART.
2. WAIS.

C. Probabilistic systems.

1. Cheshire II.
2. Inktomi.
3. INQUERY (Bayesian Networks).

D. NLP/hybrid systems.

1. LYCOS.

E. Frame and Knowledge-based systems

1. Case studies: MUC task
2. SCISOR
3. TOPIC/Verity
4. FERRET

F. Question Answering (robust approaches)

1. Kupiec's MURAX (encyclopedia question answering)

X. Evaluation.

A. Assumptions in IR performance evaluation.

1. Fully automated vs. interactive systems.
2. Who determines relevance?

B. Measures of IR performance.

1. Precision
2. Recall.
3. Fallout.
4. Combining measures

a. Van Rijsbergen's E-score.
b. Frei's preference scores.

5. Statistical significance tests.

C. Test collections.

1. Cranfield.
2. CACM.
3. MEDLINE.
4. other small collections.
5. TIPSTER.

D. TREC (Text Retrieval Evaluation Conferences)

1. Overview of the TREC collection and queries.
2. Results and lessons from TREC.

E. Evaluation methodology

1. What is a good IR experiment?
2. Experimental design.
3. Selection of test collections.
4. Running experiments and collecting results.
5. Standard analyses and analysis tools.
6. Graphical and tabular display of results.

XI. Applications of IR technology.

A. Commercial IR systems.
B. Library Catalogs.
C. Digital Libraries.
D. Routing and filtering.
E. World Wide Web/Internet search engines.