SIMS 202 Information Organization and Retrieval

Midterm Exam Preparation Guide

The exam will be handed out on Thursday October 19 at the end of class, and will be due Tuesday October 24 at the beginning of class.

This will be an open-book, open-note exam.

Each person must work individually. This means you cannot discuss the exam with anyone except for Marti, Ray, or Jennifer.

To study for the exam,

Be sure you understand the material that was covered in lecture and have read and absorbed the corresponding material in the readings.
Be sure you can do activities similar to what was done in the homeworks.
We will try to write some questions that require you to generalize from what you've learned and synthesize ideas. So be sure you have thought about the ideas covered in lecture, readings, and homeworks.

Below are shown the major topics we've covered so since the midterm and some example questions. Please note that these are examples of the types of questions we will ask. They are (probably) not the exact questions we will ask. Furthermore, we will probably ask some other types of questions too, in particular the kind where we give you an example of some information and ask you to do something with it.

· Topic: Document Representation and Statistical Properties of Text

· Example Questions:

What is the significance of Zipf's law for weighting of terms in information retrieval?

What kinds of errors can a stemming algorithm produce?

· Topic: Queries, Ranking, and the Vector Space Model

· Example Questions:

What is the difference between a search engine that uses the vector space ranking algorithm on natural language queries and a system that uses Boolean queries?

What is the role of coordination level ranking in a faceted Boolean system?

Describe the following information need in terms of a faceted Boolean query. What kinds of weighting algorithms can be applied to a faceted query like this?

``I would like to find articles about the effects of the passage of the independent investigator statute by Congress on how the U.S. president chooses an attorney general.''

Why do different web search engines return different sets of documents for the same query?

Redo the computations of Assignment 4 part 3 using different values for TF.

· Topic: IR systems and Implementation

· Example Questions:

Draw and label a diagram that shows the major components of an IR system.

What are the special features of the Cheshire II information access system?

What is the purpose of an inverted index? How is it used to generate answers to Boolean queries? Convert the contents of a set of documents into an inverted index representation.

· Topic: Evaluation of IR Systems

· Example Questions:

Define precision. Define recall. Define relevance. How are the three interrelated?

Under what circumstances is high recall desirable? Under what circumstances is high precision?

What is the main purpose of TREC? How does it differ from earlier evaluation efforts?

· Topic: The Search Process and User Interfaces

· Example Questions:

Search and retrieval is part of a larger process. Name some other components of that process.

How/why doesn't the Bates berry-picking model fit with the standard information retrieval model?

How (fundamentally) does search on a system like Yahoo or Looksmart differ from search on Altavista or Hotbot?

Name the search modes discussed in the O'Day and Jeffries paper. What kinds of triggers did they find caused transitions from one search strategy to another?

Compare and contrast the current approaches to providing user interfaces for overviews of document collections.

What is the purpose of the TileBars graphical user interface? What are its strengths and weaknesses?

Compare and contrast the attempts that have been made to provide user interfaces to searching text collections in which the documents have been assigned large category hierarchies.

Practice design question: (Note on practice question: we may ask a similar question on the exam.) Consider the DLITE interface for search. Name a type of search task, and design DLITE workspace that would support this task. Describe the functionality it does and does not support and sketch a storyboard of a user completing two tasks using your design. Justify your decisions.

· Topic: Relevance Feedback

· Example Questions:

What is main the difference between relevance feedback as defined in the literature and the more current web-based notion of "more like this"?

Given a query, three documents marked as relevant, and the Rocchio formula for relevance feedback given in class, compute the vector for the new query that results.

The Koenemann & Belkin study found results in three conditions for relevance feedback: opaque, transparent, and penetrable. Consider the different ways people have recently implemented systems for predicting which web page to show the user next. How do the differences in these systems correspond to the different relevance feedback conditions in the K&B study?