Information Retrieval Concepts
Outline of Course Topics
- I. Introduction to Information Retrieval
- A. What is Information Retrieval?
- B. The notion of Relevance.
- C. The IR problems.
- 1. Finding relevant documents.
- 2. Question answering systems.
- 3. Intelligent Filtering.
- 4. synthesizing and merging information.
- D. Conceptual Models of IR systems.
- E. Characteristics of text collections.
- 1. Structured text and Bibliographic records.
- 2. Multimedia documents.
- 3. Full Text.
- II. Conceptual Models of Information Retrieval Systems.
- A. Boolean Systems.
- B. The Vector Space Model.
- C. Probabilistic Models.
- D. Natural Language Processing models of the IR task.
- E. Document similarity measures.
- III. Term and Language properties for IR collections.
- A. Properties of language collections.
- 1. Zipf's law,
- 2. Statistical distributions.
- 3. Stochastic language models.
- B. Full-text Documents vs. Bibliographic records.
- IV. Data and File Structures for Information Retrieval.
- A. Inverted files.
- B. Signature file.
- C. Other file structures (PAT trees, Grid Files, Hashing).
- D. DBMS-based Information Retrieval.
- V. Automatic Indexing.
- A. Indexing goals.
- 1. Passage vs. document retrieval.
- 2. Salton's "Blueprint for automatic indexing"
- B. Lexical Analysis.
- 1. Stoplists.
- 2. Stemming algorithms.
- 3. Stemming and Morphological analysis.
- 4. Part-of-speech tagging and Parsing
- 5. Segmentation strategies for long texts
- a. Othographic and author-indentified segments.
- b. Fixed size windows
- c. TextTiling.
- 6. Phrase recognition
- a. Syntactic
- b. Collocational and statistical methods.
- 7. Disambiguation
- a. Algorithms (Liddy, Yarowsky, Wilensky & Chen)
- C. Thesaurus Construction.
- 1. Collection-sensitive thesauri.
- 2. Manually derived thesauri (WordNet, MESH, LCSH).
- 3. Automatically derived thesauri.
- a. Term associations (Spark Jones).
- b. Latent Semantic Indexing.
- D. Does linguistic and detailed language analysis help?
- 1. Boggess et al. (semantic tagging)
- 2. Robust discourse analysis (Liddy)
- E. Indexing and storage issues.
- 1. Index compression.
- F. Other automatic indexing issues.
- 1. Automating Hypertext linkages.
- 2. Positional information in indexes.
- VI. Automatic Classification.
- A. Origins in manual classification.
- 1. Classification schemes.
- 2. The manual classification task.
- B. Numerical Taxonomy and automatic methods.
- C. Experiments in automatic assignments of pre-defined classifications
- 1. Early work (Maron, Borko).
- 2. Probabilistic methods of class assignment.
- 3. NLP category assignment methods.
- D. Automatic Classification -- Clustering.
- 1. The cluster hypothesis.
- 2. Hierarchical Classification
- a. Single Link.
- b. Complete link.
- 3. Heuristic classification
- a. Rocchio's method.
- b. Datola method.
- c. Scatter-gather.
- E. Using classification for searching and retrieval.
- 1. Library classification as an organizing principle
- 2. Yahoo and WWW classification.
- 3. Cheshire 2-stage retrieval.
- VII. Machine Learning Techniques in IR.
- A. Techniques for machine learning.
- 1. Neural Networks.
- 2. Genetic Algorithms.
- 3. Symbolic Learning.
- B. Applications in IR.
- VIII. Search strategies.
- A. Information seeking behavior and query formulation.
- 1. Information seeking and information needs.
- 2. Query construction and visualization
- a. Standard Boolean
- b. Augmented Boolean
- c. Similarity Searching (vector and probabilistic)
- d. Weighted queries.
- B. Relevance Feedback and other query modification methods.
- 1. Algorithms for relevance feedback.
- C. Browsing/Navigation vs. Search
- IX. Examples of Information Retrieval Systems. (In-depth examination of the IR features of several systems, including:)
- A. Boolean Systems.
- 1. MELVYL.
- 2. Lexis/Nexis
- 3. Dialog.
- B. Vector Space systems.
- 1. SMART.
- 2. WAIS.
- C. Probabilistic systems.
- 1. Cheshire II.
- 2. Inktomi.
- 3. INQUERY (Bayesian Networks).
- D. NLP/hybrid systems.
- 1. LYCOS.
- E. Frame and Knowledge-based systems
- 1. Case studies: MUC task
- 2. SCISOR
- 3. TOPIC/Verity
- 4. FERRET
- F. Question Answering (robust approaches)
- 1. Kupiec's MURAX (encyclopedia question answering)
- X. Evaluation.
- A. Assumptions in IR performance evaluation.
- 1. Fully automated vs. interactive systems.
- 2. Who determines relevance?
- B. Measures of IR performance.
- 1. Precision
- 2. Recall.
- 3. Fallout.
- 4. Combining measures
- a. Van Rijsbergen's E-score.
- b. Frei's preference scores.
- 5. Statistical significance tests.
- C. Test collections.
- 1. Cranfield.
- 2. CACM.
- 3. MEDLINE.
- 4. other small collections.
- 5. TIPSTER.
- D. TREC (Text Retrieval Evaluation Conferences)
- 1. Overview of the TREC collection and queries.
- 2. Results and lessons from TREC.
- E. Evaluation methodology
- 1. What is a good IR experiment?
- 2. Experimental design.
- 3. Selection of test collections.
- 4. Running experiments and collecting results.
- 5. Standard analyses and analysis tools.
- 6. Graphical and tabular display of results.
- XI. Applications of IR technology.
- A. Commercial IR systems.
- B. Library Catalogs.
- C. Digital Libraries.
- D. Routing and filtering.
- E. World Wide Web/Internet search engines.