UC Berkeley School of Information

I 240: Information Retrieval

Textbooks and Readings

(See also Links)


Christopher D. Manning, Prabhakar Raghavan and Hinrich Schuetze. Introduction to Information Retrieval. Cambridge University Press, 2008. (Also preprint version available online at http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html)

Karen Sparck Jones and Peter Willett. Readings in Information Retrieval. San Francisco : Morgan Kaufmann, 1997 (ISBN 1-55860-454-5) Highly Recommended - there will be readings from this. Parts available through Google Books


David A. Grossman and Ophir Frieder. Information Retrieval: Algorithms and Heuristics. Second Edition. Dordrecht, The Netherlands: Springer, 2004 (ISBN 1-4020-3004-5).

Baeza-Yates and Ribeiro-Neto. Modern Information Retrieval, Addison Wesley, 1999.

C. J. van Rijsbergen. Information retrieval. London : Butterworths, 1975. Available through the preceding link in PDF or HTML.

William R. Hersh. Information Retrieval: A Health and Biomedical Perspective. 2nd Edition. Springer-Verlag, 2003; ISBN: 0-387-95522-4

W. Bruce Croft (ed). Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Rerieval. Kluwer Academic Publishers, 2000; ISBN: 0-7923-7812-1.

Ian H. Witten, Alistair Moffat and Timothy C. Bell. _ Managing Gigabytes : Compressing and Indexing Documents and Images. 2nd Edition_ (Morgan Kaufmann Series in Multimedia Information and Systems) Morgan Kaufmann Publishers, 1999; ISBN: 1558605703

William B. Frakes and Ricardo Baeza-Yates. _Information retrieval: data structures & algorithms_. Englewood Cliffs, N.J. : Prentice Hall, 1992.

Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Reading, Mass. : Addison-Wesley, 1988. [Amazon currently lists this book as "Out of Print--Limited Availability", but it may be available used.]

Charles P. Bourne and Trudi Bellardo Hahn. A History of Online Information Services: 1963-1976. The MIT Press, 2003; ISBN: 0-262-02538-8. For those interested in the early history of online IR services.

Additional Readings:

These are readings for background and discussion. Most of these will be assigned for class discussion, others are for those who wish to dig further into particular subjects. The list is based on the textbook of readings on IR by Peter Willett and Karen Sparck Jones, with some additional items (in case you want to try to hunt down the individual papers in the readings for the course).

In addition, a digital library of early IR report and book literature is being made available through SIGIR at http://www.sigir.org/museum/.

* HISTORICAL: These items cover some early ideas and implementations that provide some of the foundations of information retrieval theory and practice.
Luhn, H.P. (1957). A Statistical Approach to Mechanized Encoding and Searching of Literary Information.IBM Journal of Research and Development, 1, 309-317.
Fairthorne, R.A. (1958). Automatic Retrieval of Recorded Information. _Computer Journal, 1, 36-41. (Also in Fairthorne, R.A. (1961).Towards information retrieval_. London: Butterworths).
Joyce, T. and Needham, R.M. (1958). The thesaurus approach to information retrieval.American Documentation, 9 (3), 192-197.
Luhn, H.P. (1961). The automatic derivation of information retrieval encodements from machine-readable texts.Information retrieval and machine translation (Ed A. Kent), Vol 3, Pt 2, 1021-1028; reprinted in C.K. Schultz, Ed,H.P. Luhn: Pioneer of information science, New York: Spartan Books, 1968,
Maron, M.E. and Kuhns, J.L. (1960). On relevance, probabilistic indexing and information retrieval.Journal of the Association for Computing Machinery, 7, 216-244.
Maron, M.E. (1961). Automatic indexing: an experimental inquiry.Journal of the Association for Computing Machinery, 8, 404-417.
Doyle, L.B. (1962).Indexing and abstracting by association. Part 1. SP-718/001/00, System Development Corporation, Santa Monica CA.
Maron, M.E. (1965). Mechanised documentation: the logic behind a probabilistic interpretation.Statistical methods for mechanised documentation_ (Ed M.E. Stevens, V.E. Giuliano and L.B. Heilprin), National Bureau of Standards Miscellaneous Publication 269, Washington DC: US Government Printing Office, 9-13.
Cleverdon, C.W. (1967). The Cranfield tests on index language devices. _Aslib Proceedings, 19, 1967, 173-192.
Salton, G, and Lesk, M.E. (1968). Computer evaluation of indexing and text processing.Journal of the ACM, 15 (1), 8-36; reprinted in G. Salton, Ed,The SMART retrieval system, Englewood Cliffs NJ: Prentice-Hall, 1971, 143-180.

* KEY CONCEPTS: These papers examine the nature of documents, aboutness, indexing and index languages, requests, relevance, users and searching. Note this section deals with these topics primarily in an analytical and descriptive style, rather than by wholesale modelling of the retrieval process, covered in a later section.
Hutchins, W.J. (1978). The concept of `aboutness' in subject indexing. _Aslib Proceedings, 30. 172-181.
Cleverdon, C.W. and Mills, J. (1963). The testing of index language devices.Aslib Proceedings, 15 (4), 106-130; reprinted in L.M. Chan, P.A. Richmond and E. Svenonius, Eds,Theory of Subject Analysis, Littleton CO: Libraries Unlimited, 1986, 223-246.
Foskett, D.J. (1980). Thesaurus. in A. Kent. H. Lancour and J.E. Daily, Eds,Encyclopedia of Library and Information Science, Vol 30, New York: Marcel Dekker, 416-462; reprinted in E.D. Dym, Ed,Subject and information analysis, New York: Marcel Dekker, 1985, 270-316.
Daniels, P.J., Brooks, H.M. and Belkin, N.J. (1985). Using problem structures for driving human-computer dialogues.RIAO-85, Actes: Recherche d'Informations Assistee par Ordinateur, Grenoble: IMAG, 645-660.
Saracevic, T. (1975). Relevance: a review of and a framework for the thiniking on the notion in information science.Journal of the American Society for Information Science, 39 (3) 321-343.

* EVALUATION: These papers cover the notions of performance issues, criteria for performance evaluation, test design and methodology, with examples illustrating the methods.
Saracevic, T. et al (1988). A study of information seeking and retrieving, Parts 1,2,3.Journal of the American Society for Information Science, 39 (3), 161-216. Pt 1 only
Cooper, W.S. (1973). On selecting a measure of retrieval effectiveness. Pt 1.Journal of the American Society for Information Science, 24 (?2), 87-100.
Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited.Information Processing and Management, 28 (4), 467-490.
Keen, E.M. (1992). Presenting results if experimental retrieval comparisons. _Information Processing and Management, 28 (4), 491-502.
Lancaster, W.F. (1969). MEDLARS: Report on the evaluation of its operating efficiency.American Documentation, 20 (2), 119-142; reprinted in T. Saracevic, Ed,Introduction to Information Science, New York: Bowker, 1970, 640-664.
Blair, D.C. and Maron. M.E. (1985). An evaluation of retrieval effectiveness for a full-text document retrieval system.Communications of the ACM, 28 (??), 289-299.
Salton, G. (1986). Another look at text-retrieval systems.Communications of the ACM, 29(7), 648-656.
Blair, D.C. and Maron, M.E. (1990). Full text information retrieval: further analysis and clarification.Information Processing and Management, 26, 437-447.
Blair, D.C. (1996). STAIRS redux: thoughts on the STAIRS evaluations, ten years after.Journal of the American Society for Information Science, 47, 4-22.
Harman, D. (1995). The TREC Conferences.Hypertext - information retrieval - multimedia: synergieeffekte elektronischer informationssysteme, HIM '95, Proceedings (Ed R. Kuhlen and M. Rittberger), Konstanz: Universitaetsforlag Konstanz, 9-28.

* BASIC IR MODELS: These papers cover models of IR, both qualitative and quantitative (eg cognitive, statistical), concentrating on the general notions of the main IR models. Implementation issues are described later in Techniques.
Robertson, S.E. (1977). Theories and models in information retrieval. _Journal of Documentation, 33, 126-148.
Belkin, N.J., Oddy, R.N. and Brooks, H.M. (1982). ASK for information retrieval: part 1. Background and theory.Journal of Documentation, 38, 61-71.
Cooper, W.S. (1988). Getting beyond Boole.Information Processing and Management, 24, 243-248.
Robertson, S.E. (1977). The probability ranking principle in IR.Journal of Documentation, 33, 294-304.
Salton, G. Wong, A. and Yang, C.S. (1975). A vector space model for automatic indexing.Communications of the ACM, 18 (11), 613-620.
Robertson, S.E., Maron, M.E. and Cooper, W.S. (1982). Probability of Relevance: A Unification of Two Competing Models for Document Retrieval. _Information Technology: Research and Development, 1, 1-21.
Turtle, H.R. and Croft, W.B. (1990). Inference networks for document retrieval.Proceedings of the 13th International Conference on Research and Development in Information Retrieval, 1-24, 1990.
van Rijsbergen, C.J. (1986). A non-classical logic for information retrieval.Computer Journal, 29, 481-485, 1986.

*IR TECHNIQUES: These papers examine the details of various models and other specific techniques and technologies, including reports of testing.
Belkin, N.J. and Croft, W.B. (1987). Retrieval Techniques.Annual Review of Information Science and Technology, 22, 109-145.
Robertson, S.E. and Sparck Jones, K. (1976). Relevance Weighting of Search Terms.Journal of the American Society for Information Science, 27(3), 129-146.
Croft, W.B. and Harper, D.J. (1979). Using probabilistic models of document retrieval without relevance information.Journal of Documentation, 35, 285-295.
Porter, M.F. (1980). An algorithm for suffix stripping.Program, 14, 130-137.
Robertson, S.E. and Walker, S. (1994). Some simple effective approximations to the 2 Poisson model for probabilistic weighted retrieval.SIGIR 94 - Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 232-241.
Salton, G. and Buckley, C. (1988). Term weighting approaches in automatic text retrieval.Information Processing and Management, 24, 513-523.
Salton, G. and Buckley, C. (1990). Improving retrieval performance by relevance feedback,Journal of the American Society for Information Science, 41, 288-297, 1990.
Sparck Jones, K. (1979). Search term relevance weighting given little relevance information.Journal of Documentation, 35 (1), 30-48.
Strzalkowski, T. (1994) Robust text processing in automated information retrieval.Proceedings of the 4th Conference on Applied Natural Language Processing (stuttgart), Association for Computational Lingustics, 168-173.
Griffiths, A., Luckhurst, H.C. and Willett, P. (1986). Using interdocument similarity information in document retrieval systems.Journal of the American Society for Information Science, 37 (1), 3-11.
Belkin, N.J. and Croft, W.B. (1992). Information filtering and information retrieval: two sides of the same coin?Communications of the ACM, 35(12), 29-38.

* SYSTEMS: This section includes papers describing complete IR systems, focussing on those embodying modern views of what such systems should be like, but also illustrating the status of more `conventional' systems.
Salton, G. and McGill, M.J. (1983). The SMART and SIRE experimental retrieval systems.In Introduction To Information Retrieval, New York, McGraw-Hill, pp 118-156.
Harman, D. (1992). User-friendly systems instead of user-friendly front-ends.Journal of the American Society for Information Science, 43 (?), 164-174.
Walker, S. (1989). The Okapi online catalogue research projects.in The online catalogue: developments and directions (Ed C. Hildreth), London: The Library Association, 84-106.
Callan, J.; Croft, W.B. and Broglio, J. (1995). TREC and TIPSTER experiments with INQUERY.Information Processing and Management, 31 (3).
Fox, E.A. and France, R.K. (1987). Architecture of an expert system for composite document analysis, representation and retrieval.Journal of Approximate Reasoning, 1, 151-175.
Fox, E.A. and Koll, M.B. (1988). Practical enhanced Boolean retrieval: experiences with the SMART and SIRE systems.Information Processing and Management, 24, 257-267.
McCune, B.P., Tong, R. and Dean, J. (1985). RUBRIC, a system for rule-based information retrieval.IEEE Transactions on Software Engineering. SE11-9, 939-944.
Jacobs, P.S. and Rau, L.F. (1990). SCISOR: extracting information from on-line news.Communications of the ACM, 33(11), 88-97.
Larson, R.R., McDonough, J., Kuntz, L., O'Leary, P. and Moon, R. ``Cheshire II: Designing a Next-Generation Online Catalog.''Journal of the American Society for Information Science, 47(7) (July 1996), p. 555-567.
Tenopir, C. and Cahn, P. (1994). TARGET and FREESTYLE: DIALOG and Mead join the relevance ranks.Online, 18 (3), 31-47. (shorter after ads deleted)

* EXTENSIONS: These papers move outwards from the classical text document/single query situation to consider other types of `document' and other versions and aspects of the information access task. The object is to illustrate the scope of information retrieval viewed more broadly, and to draw attention to the links between retrieval and other information processing activities. At the same time, since some of the ideas and work covered here also reflect new challenges and possibilities stemming from recent technology developments, this section has papers to be taken as initial leads into the future, rather than as authoritative guides to the established wisdom.
``Hypertext and Information Retrieval: Towards the Next Generation of Information Systems''. In: Borgman, C. L. and Pai, E. Y. H. (Eds.)Information and Technology: Proceedings of the 51st ASIS Annual Meeting, Medford, NJ: Learned Information, Inc., 1988.
Agosti, M. Gradenigo, G. and Marchetti, P.G. (1992). A hypertext environment for interacting with large databasesInformation Processing and Management, 28 93), 371-387.
Salton, G., Allan, J., Buckley, C. and Singhal, A. (1994). Automatic analysis, theme generation, and summarisation of machine-readable texts. _Science, 264, 3 June, 1421-1426.
Hull, D.A. and Grefenstette, G. (1996). Experiments in multilingual retrieval.Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
Rose, R.C. (1991). Techniques for information retrieval from speech messages.Lincoln Laboratory Journal, 4 (1), 45-59.
Zhang, H.J., Low, C.Y., Smoliar, S.W. and Wu, J.H. (1995). Video parsing, retrieval and browsing: an integrated and content-based solution.Proceedings of ACM Multimedia '95, 15-24; reprinted inIntelligent multimedia information retrieval (Ed M. Maybury).
Biebricher, B. et al (1988). The automatic indexing system AIR/PHYS - from research to application.Eleventh International Conference on Research and Development in Information Retrieval, 333-342.
Hayes, P.J., Knecht, L. and Cellio, M. (1988). A news story categorisation system.Proceedings of the Second Conference on Applied Natural Language Processing, Association for Computational Linguistics, 9-17.
Rau, L.F. (1988). Conceptual information extraction and retrieval from natural language input.RIAO 88, 424-437.
Marsh, E., Hamburger, H. and Grishman, R. (1984). A production rule system for message summarisation.AAAI-84, Proceedings, American Association for Artificial Intelligence, 243-246.
Johnson, F.C., Paice, C.D., Black, W.J. and Neal, A.P. (1993). The application of linguistic processing to automatic abstract generation. _Journal of Document and Text Management, 1 (3), 215-241.
Swanson, D.R. (1988). Historical note: information retrieval and the future of an illusion.Journal of the American Society for Information Science, 39 (2), 92-98.

Handouts and Referenced in Lectures:

Singhal, A., Buckley, C. and Mitra, M. (1996). Pivoted Document Length Normalization. In SIGIR '96, pp. 21-29.

Raghavan, V.V. and Wong S.K.M. (1986). A Critical Analysis of the Vector Space Model for Information Retrieval. Journal of the American Society for Information Science. 37(5), pp. 279-287.
Salton, G. (1991). Developments in Automatic Text Retrieval. Science, 253 (30 Aug 1991), pp. 974-980.
Cooper, W.S., Gey, F.C. and Dabney, D.P. (1992). Probabilistic Retrieval Based on Staged Logistic Regression. In: SIGIR '92, pp. 198-210.
Ponte, J.M. and Croft W.B. (1998). A Language Modelling Approach to Information Retrieval. In: SIGIR '98, pp. 275-281.
Froelich, Thomas J. (1994). Relevance Reconsidered -- Towards an agenda for the 21st Century: Introduction to the special issue on Relevance Research. Journal of the American Society for Information Science, 45(3) (April 1994), pp.124-134.
Schamber, Linda, Eisenberg, Michael B. and Nilan, Michael S. (1990) A Re-Examination of Relevance: Toward a Dynamic Situational Definition. Information Processing and Management, 26(6), pp. 755-776.