Syllabus (Extended Version)

Information Organization and Retrieval

INFO 202
MW 9:00-10:30, South Hall 202


26 August 2009

This course has been called "Information Organization and Retrieval" and has been the first core course siince the school opened, but that title only partly describes what the course is about.  The overall focus is on the intellectual foundations of IO  & IR:  conceptual modeling, semantic representation, vocabulary and metadata design, classification, and standardization. These issues are important whenever we develop IO and IR applications and apply technology to make information more accessible, useful, processable, and so on. Some people might call this course "Information Architecture" and that would be accurate if we derived the meaning of IA only from "information" and "architecture" but most of the time the IA phrase is used in a  much more limited and narrower sense, so I tend to avoid it.

There are lots of interesting and deep ideas and questions here, but that's not why we study them. We study them because understanding the ideas and answering the questions enables us to design, build and deploy better information systems and applications. So I try to make this course intellectually deep but ruthlessly practical at the same time.  To do so I'll employ lots of case studies and news stories about "information in the wild" and "information-intensive" applications.   All in all, this is a much broader set of contexts than you'd be learning and talking about if you'd gone to a more traditional library school or to an ISchool where the transition from a library school was more incremental

  "As We May Think" by Vannevar Bush is a classic, nicely complemented by "My Life Bits", which describes an ongoing effort at personal information management that was inspired by the Bush paper.  The Rao paper explains the emergence and evolution of many of the foundational concepts and approaches in this course.  I'll let you decide what to think about the Borges paper. 


31 August 2009 

This course will change the way you look at the world.  In this diverse set of short case studies and news stories you'll see that how information is organized - or not organized - is critical in the success or failure of purposeful activities for individuals, groups, enterprises, or governments.   Technology is often a key enabler, but the success of "information-intensive" activities usually depends on organization, architecture, and policy more than technology.

(Don't be intimidated by the number of readings.  Most are very short newspaper stories that you can read quickly).


2 September 2009

Information organization raises more fundamental issues than retrieval, so we'll start the course with IO and then move on to IR and other language processing applications. But we need a highlevel view of the information life cycle and search/retrieval models so that we have some concepts on which to base forward links from IO to IR.  In particular, we'll discuss how organization and retrieval trade off against each other;  the more we organize, the easier it is to retrieve and use information later.  How much effort we can or should put into organization depends on the relationship between the producer and consumer of the information and their respective organizational contexts.  Who does the work and who gets the benefit?  As you'll see, Svenonius and Weinberger have strongly contrasting opinions about this. 

L4. XML (9/9)

9 September 2009

Many of you already have some familiarity with XML, but perhaps mostly as a data format for applications or programming.  In IO and IR it is essential to take a more abstract and intellectual view of XML and understand how it represents structured information models.  XML encourages the separation of content from presentation, which is the most important principle of information architecture.  Encoding information in XML is an investment in information organization that pays off "downstream" in IR and language processing applications.


14 September 2009

What is meaning? Where is meaning?  We impose meaning on the world by "carving it up" into concepts and categories.  We interact daily with a bewildering variety of objects and information types, and we constantly make choices about how to understand and organize them.  The conceptual and category boundaries we impose treat some things or instances as equivalent and others as different.  Sometimes we do this implicitly and sometimes we do it explicitly.  We do this as members of a culture and language community, as individuals, and as members of organizations or instutitions.  The mechanisms and outcomes of our categorization efforts differ across these contexts.  In most cases the resulting categories are messier than our information systems and applications would like, and understanding why and what to do about it are essential skills for information professionals.


16 September 2009

The easist way to indicate what something means is to give it a name, label, tag, or description .  This additional information about an object or about an instance or type of information is "metadata" because it is not part of the thing or its content.  "What is being described" can be considered on two separate dimensions - the contexts/containers/collections in which it occurs, and the level of abstraction (how large is the set of instances that are treated as equivalent when metadata is assigned).  How much metadata, what kind, and who should provide it are fundamental concerns. Some "contextual" metadata can be assigned automatically, but this raises questions about the identification and scope of the context.


21 September 2009

The words people use to describe things or concepts are "embodied" in their context and experiences and these  naturally-occurring words are an "uncontrolled vocabulary."  As a result, people or enterprises often use different terms or names for the same thing and the same terms or names for different things.  These mismatches often have serious or even drastic consequences.   It might seem straightforward to control or standardize terms or names, and much technology exists for attacking the "vocabulary problem," but technology alone is not a complete solution because language use constantly evolves and the world  being described does too.

Link to the podcast:  (if you click it, it should play in the browser, but you can also right-click (or Ctrl-click) and save to your local machine) 




23 September 2009

A Classification is a system of categories, ordered according to a pre-determined set of principles and used to organize a set of instances or entities. This doesn't mean that the principles are always good or equitable or robust, and indeed, every classification is biased in one way or another (for example, compare the Library of Congress classification with the Dewey Decimal System)..  Classifications are embodied in every information-intensive activity or application.  Faceted or dimensional classification is especially useful in domains that don't have a primary hierarchical structure.

Link to the podcast:  (if you click it, it should play in the browser, but you can also right-click (or Ctrl-click) and save to your local machine) 

L9. ONTOLOGY (9/28)

28 September 2009

An ontology defines the concepts and terms used to describe and represent an area of knowledge and the relationships among them.  A dictionary can be considered a simplistic ontology, and a thesaurus a slightly more rigorous one, but we usually reserve "ontology" for meaning expressed using more formal or structured language.  Put another way, an ontology relies on a controlled vocabulary for describing the relationshiups among concepts and terms.



30 September 2009

The "Semantic Web" vision imagines that all information resources and services have ontology-grounded metadata that enables their automated discovery and seamless integration or composition.  Whether it is possible "to get there from here" with today's mostly HTML-encoded Web, or whether "a little semantics goes a long way" are key issues for us to consider.




5 October 2009

Many information-intensive processes and applications involve both "documents" and "data" that are often transformationally related; consider, for example, the close relationship between tax forms and the instructions for filling them out, or between product brochures and purchase orders.  But many people have contrasted "documents" and "data" and concluded that documents and data cannot be understood and handled with the same terminology, techniques, and tools.  I argue that there is no clear boundary between documents and data because there is systematic and continuous variation in document types and instances from the "narrative" end to the "transactional" end of the Document Type Spectrum.   This view leads to a more abstract and more broadly applicable conception of information modeling that emphasizes what document and data modeling have in common rather than how they differ.




7 October 2009

The technology and techniques for content management vary according to content type, required business processes, and organizational context.  For example, traditional "document management" in publishing relies mostly on searchable metadata with workflow support for authoring, versioning, and production of desired information products from "single-source" content modules.  In contrast, content management for customer relationship management, regulatory compliance, or business intelligence applications involves a much greater variety of information types and more sophisticated analysis of the content and the usage patterns over its life cycle.



L13. REVIEW (10/12)

12 October 2009

We will use this lecture to review the material we have covered so far.

Log your questions/confusions here: 


14 October 2009

In this lecture we look at the vocabulary problem we discussed in Lecture 7 as it manifests itself in enterprise contexts.   Within a firm, different information systems might use data models that are incomplete or incompatible with respect to each other, and between firms these differences can be even greater.  Structural, syntactic, and semantic mismatches cause problems when processes and services attempt to span these system and organizational boundaries (for example, to create a complete model of a "customer" or to conduct a business transaction).   We'll consider how technical standards and transformation techniques can help achieve integration and interoperability, but we'll acknowledge that interoperability is not always possible and that non-technical factors play a huge role in determining the approach.


19 October 2009

An important trend in or feature of many information-intensive applications and services is to support the creation and aggregation of metadata, preference information or other content from users or customers.  This has been described as  user-generated content, tagging, collective intelligence, crowdsourcing, folksonomy,... and so on.   Such activity is partly self-serving because it enhances the quality of future experiences for the contributors, as when people rate restaurants, hotels, photos, or other service establishments or information sources and subsequently choose only highly-rated ones.  But it is also social, and often an act of generosity or altruism because many people contribute far more information or effort than pure self-interest would justify.  Social/distributed categorization methods are increasingly being used in enterprise or institutional contexts, and a key question is the extent to which "tag convergence" yields useful semantics




21 October 2009

Personal information management is "the practice and the study of the activities that people perform to acquire, organize, maintain, and retrieve information for everyday use."   The modern dialog about PIM has been strongly shaped by Bush's Memex, but since PIM is inherently embedded in user activities, things have gotten more complicated as personal information is increasingly managed (or not managed) across multiple devices and contexts (including "the cloud") .  People employ a range of strategies for PIM, rarely consciously or explicitly, and generally in sub-optimal ways.


(Download recorded lecture from  )



26 October 2009

We revisit most of the concerns about metadata from Lecture 6 as they apply to non-text and multimedia objects and resources, but some new challenges arise becasue of the temporal character of audio and video and the semantic opacity of the content.  Because multimedia content can't be (easily) processed to understand what the object means, there is a "semantic gap" between the descriptions that people assign to multimedia content and those that can be assigned by computers or automated processes.  On the other hand, technology for creating multimedia can easily record contextual metadata at the same time.  Thesauri and other aids for professional "metadata makers" are invaluable but rarely used by ordinary people when they tag photos or videos.

Download recorded lecture: [Part 1] from [Part 2] from

L18. MIDTERM EXAM (10/28)

28 October 2009

This will be an in-class short-answer exam and is open book, open notes, but not “open Internet” – your Internet access is limited to the readings, lecture notes, and collaboratively prepared review materials.  Because it is natural in a broad survey course that not every topic is of equal interest or perceived mportance for any given student, you'll have a choice of questions to answer.


Exam (doc)

Exam (txt) 

Sample answers 


2 November 2009

NOTE: we have combined two lectures to make room for the review lecture.


A person with an information need must first convert his internalized, abstract concepts into language, and then convert that expression of language into a query expression, which the search system then uses.  The user interface(s) to the IR system can't help at all with this first task and usually offers just a little help with the second (except for so-called "natural language" question answering systems, which really aren't).  Search UIs influence the kinds of queries that the user can express (or express easily).  It wasn't that long ago that information retrieval was carried out mostly by highly trained professionals, and the user interfaces for the systems they used were complex.  Today, web-based search is ubiquitous, and user interfaces must be vastly simpler.  Best practices in designs for user interfaces in query specification, presentation of results, and query reformulation have emerged from laboratory experiments and from continuous incremental modification-and-test cycles in deployed systems.


Two categories of design questions for search UIs are the kind of information the searcher supplies (a spectrum from full natural language sentences, to keywords and key phrases, to syntax-heavy command language-based queries) and the interface mechanism the user interacts with to supply this information (which include command line interfaces, graphical entry form-based interfaces, and interfaces for navigating links).  Once the system determines the results that satisfy the query, it presents some aspects of the matching documents, usually highlighting the query terms and providing some surrounding text to provide context.  Some systems arrange, cluster, or visualize the results to make it easier for the searcher to identify the most relevant results.


Download recorded lecture from


4 November 2009

The core problems of information retrieval are finding relevant documents and ordering the found documents according to relevance. The IR model explains how these problems are solved by (1) designing the representations of queries and documents in the collection being searched and  (2) specifying the information used, and the calculations performed, that order the retrieved documents by relevance.   Different IR models solve these problems in different ways; the better they solve it, the more computationally complex they are, so there are tradeoffs.  The  simplest, most familiar, and least effective model is the Boolean model -- representations are sets of index terms, and relevance is calculated in an all-or-none way according to set theory operations with Boolean algebra.

"Text processing" in IR consists of a sequence of steps that transform the text of documents or other entities so that they can be more efficiently stored and matched.  Most of these steps are conceptually trivial - like extracting the text content from its storage format, and separating a set of words into tokens - but they pose subtle and interesting challenges.


Download recorded lecture from


9 November 2009

The Boolean model represents documents as a set of index terms that are either present or absent. This binary notion doesn't fit our intuition that terms differ in how much they suggest what the document is about.  Vector models capture this notion by representing documents and queries as word or term vectors and assigning weights that can capture term counts within a document or the importance of the term in discriminating the document in the collection.  Vector algebra provides a model for computing similarity between queries and documents and between documents because of assumption that "closeness in space" means "closeness in meaning".

(Don't worry if you haven't thought about vectors in a long time. We'll review everything you need to know to understand how they work... and you'll get to practice your understanding with an assignment).


Download recorded lecture: Part1:




16 November 2009

Because the calculations used by simple vector models use the frequency of words and word forms,  they can't distinguish different meanings of the same word (polysymy) and they can't detect equivalent meaning expressed with different words (synonymy).  The dimensionality of the space in the simple vector model is the number of different terms in it, but the "semantic dimensionality" of the space is the number of distinct topics represented in it, which is much smaller.   

Somewhat paradoxically, these reduced dimensionality vectors that define "topic space" rather than "term space" are calculated using the statistical co-occurrence of the terms in the collection, so the process is completely automatable -- it requires no humanly constructed dictionaries, knowledge bases, ontologies, semantic networks, grammars, syntactic parsers, morphologies, or anything else that represents "language".  For this reason these approaches are said to extract "latent" semantics.



Download recorded lecture from



18 November 2009

Structure-based IR models combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships). This structural information tells us what documents and parts of documents are most important and relevant, and provides additional justification for determining relevance and ordering a result set.   The nature and pattern of links between documents has been studied for almost a century by "bibliometricians" who measured patterns of scientic citation to quantify the influence of specific documents or authors. The concepts and techniques of citation analysis seem applicable to the web since we can view it as a network of interlinked articles, and Google's "page rank" algorithm is now the classic example.


Download recorded lecture from


23 November 2009

Documents aren't just bags of words; they can have a great deal of internal structure and content encoding. But most IR models don't use anything other than document-level statistics about term occurrence. The use of XML for encoding document models and instances shows where structure can be used to great advantage in IR to add value beyond text retrieval.  We can express queries about document structures (for example, to find all articles written after June 1, 2008 with the words "presidential election" in the title field) and use internal structure to return only the precise parts of large documents that satisfy the query.


Download recorded lecture from


25 November 2009

Many of the concepts, technologies and techniques in information organization, information retrieval, and user interface design were developed for dedicated (as in a library) or desktop-based computing environments.  Mobile (or context-aware) applications pose new problems and provide new value in IR.   Multimedia content is similarly challenging old approaches to IO and IR.  Instead of trying to overcome the semantic gap, we can use the low-level features that can be extracted automatically to index the multimedia collection and then extract the same ones from a multimedia "query by example"  (as in the Shazam application, which can identify a song from a snippet recorded using a cell phone). 


Download recorded lecture from

L26. APPLIED IR & NLP [1](11/30)

30 November 2009

After three months of lectures packed with new concepts and theory it is helpful to finish the semester with lots of examples and applications that show that this course matters "in the real world."   In particular, we'll discuss several applications of "natural language processing" or NLP.   This is a broad field, and involves computer science, linguistics, cognitive psychology, and statistics in addition to everything we've talked about this semester.   NLP illustrates many IR techniques and in some cases illustrates the tradeoffs between IO and IR.


Download recorded lecture from

L27. APPLIED IR and NLP [2] (12/2)

2 December 2009

Part 2 of Applied IR and NLP.


Download recorded lecture: Part1:


L28. ALUMNI DAY (12/7)

7 December 2009

For several years a tradition in the course at the end of the semester is to have ISchool alumni return to talk about their jobs.  Last year the speakers talked about content management in the Obama presidential campaign, catalog integration and taxonomy induction at eBay, geographical information services at Google, and semantic search metadata generation for IR in museum collections.  This year's guests are... (TBD)




9 December 2009

This will be a review session for the final exam.



L30. FINAL EXAM (12/14)

14 December 2009

The final exam follows the same format and rules as the midterm - you choose a subset of questions to answer, and the exam is open book, open notes, but not "open Internet."