Information Organization and Retrieval

INFO 202
MW 9:00-10:30, South Hall 202

L1. COURSE OVERVIEW

August 30, 2010

This course has been called "Information Organization and Retrieval" and has been the first core course since the school opened, but that title only partly describes what the course is about. The overall focus is on the intellectual foundations of IO & IR: conceptual modeling, semantic representation, vocabulary and metadata design, classification, and standardization. These issues are important whenever we develop IO and IR applications and apply technology to make information more accessible, useful, processable, and so on. Some people might call this course "Information Architecture" and that would be accurate if we derived the meaning of IA only from "information" and "architecture" but most of the time the IA phrase is used in a much more limited and narrower sense, so I tend to avoid it.

There are lots of interesting and deep ideas and questions here, but that's not why we study them. We study them because understanding the ideas and answering the questions enables us to design, build and deploy better information systems and applications. So I try to make this course intellectually deep but ruthlessly practical at the same time. To do so I'll employ lots of case studies and news stories about "information in the wild" and "information-intensive" applications. All in all, this is a much broader set of contexts than you'd be learning and talking about if you'd gone to a more traditional library school or to an I School where the transition from a library school was more incremental

"As We May Think" by Vannevar Bush is a classic, nicely complemented by "My Life Bits", which describes an ongoing effort at personal information management that was inspired by the Bush paper. I'll let you decide what to think about the Borges paper.

L2. THE ORGANIZING SYSTEM (9/1)

September 1, 2010

Throughout the day-to-day, we—individuals, organizations, and services—organize information. We organize things, information about things, and information in a variety of formats, digital and otherwise. When we analyze these different contexts—we can be easily distracted by the specific information types, organizing principles, technology, functions or features, and individuals or companies involved in any particular example. We can get lost trying to define “information” in ways that fit these different contexts because it is inherently abstract, and most of its hundreds of definitions treat it as an idea that swirls around equally hard to define terms like “data”, “knowledge,” and “communication.” These challenges in taking a broader look that emphasizes what these contexts have in common rather than how they differ are the motivation for the concept of the Organizing System.

Our concept of Organizing System has to confront head on the duality of information as thing versus information as an intangible concept. When an Organizing System deals with information as physical thing, it follows different principles and must conform to different constraints than when it deals with information as intangible thing. And many organizing systems—like that in the modern library with online catalogs and physical collections—accommodate both notions of information at the same time. The implications for arranging, finding, using and reusing things in any organizing system directly reflect the mix of these two perspectives about information.

Explicitly or by default, an Organizing System makes many interdependent decisions about the identities of entities and information components, their names and descriptions, the classes, relations, structures and collections in which they participate, and the people or technologies who create, transform, combine, compare and use them. Namely: What is being organized? Why it is being organized? How much is it being organized? When is it being organized? By whom (or by what computational processes) it is being organized?

L3. XML (9/8)

September 8, 2010

Many of you already have some familiarity with XML, but perhaps mostly as a data format for applications or programming. In IO and IR it is essential to take a more abstract and intellectual view of XML and understand how it represents structured information models. XML encourages the separation of content from presentation, which is the most important principle of information architecture. Encoding information in XML is an investment in information organization that pays off "downstream" in IR and language processing applications.

L3R1. Robert J. Glushko and Tim McGrath, Document Engineering, Chapter 2, "XML Foundations"

L4. IDENTITY AND IDENTIFICATION (9/13)

September 13, 2010

Identity and Identification

It might seem like the question of identity, of what a single “thing” is, shouldn't be a problem. After all, we live in a world of things, and finding, selecting, organizing, using, and referencing them are everyday activities. We are used to interacting with things in organizing systems we've created ourselves, that were created by other people, or that have been created through institutional or social processes. But it’s really not as simple as it first appears. In order to organize these things, we also need to have a sense of how they will be used. How will we look up, select, assemble, reorganize, put away, or otherwise work with these things? Even though we can’t prepare for every possible use, we need to do our best to understand the potential, primary uses as well as the audience, or users.

Plan for Lecture 4:

Identity — What is a thing?
Identifiers and Names
Identity Over Time

L4R1. Turner, Glushko, McPherson and Hemerly. Chapter 2, "Identity and Identification" of IFIOIR

L5. DESCRIBING INSTANCES; CONTROLLED NAMES AND VOCABULARIES (9/15)

September 15, 2010

The words people use to describe things or concepts are "embodied" in their context and experiences and these naturally-occurring words are an "uncontrolled vocabulary." As a result, people or enterprises often use different terms or names for the same thing and the same terms or names for different things. These mismatches often have serious or even drastic consequences. It might seem straightforward to control or standardize terms or names, and much technology exists for attacking the "vocabulary problem," but technology alone is not a complete solution because language use constantly evolves and the world being described does too.

L6. DOCUMENT / DATA MODELS AND MODELING (9/20)

September 20, 2010

Many information-intensive processes and applications involve both "documents" and "data" that are often transformationally related; consider, for example, the close relationship between tax forms and the instructions for filling them out, or between product brochures and purchase orders. But many people have contrasted "documents" and "data" and concluded that documents and data cannot be understood and handled with the same terminology, techniques, and tools. I argue that there is no clear boundary between documents and data because there is systematic and continuous variation in document types and instances from the "narrative" end to the "transactional" end of the Document Type Spectrum. This view leads to a more abstract and more broadly applicable conception of information modeling that emphasizes what document and data modeling have in common rather than how they differ.

L7. METADATA (9/22)

September 22, 2010

The easiest way to indicate what something means is to give it a name, label, tag, or description. This additional information about an object or about an instance or type of information is "metadata" because it is not part of the thing or its content. "What is being described" can be considered on two separate dimensions - the contexts/containers/collections in which it occurs, and the level of abstraction (how large is the set of instances that are treated as equivalent when metadata is assigned). How much metadata, what kind, and who should provide it are fundamental concerns. Some "contextual" metadata can be assigned automatically, but this raises questions about the identification and scope of the context.

L8. METADATA FOR MULTIMEDIA (9/27)

27 September 2010

We revisit most of the concerns about metadata from Lecture 7 as they apply to non-text and multimedia objects and resources, but some new challenges arise because of the temporal character of audio and video and the semantic opacity of the content. Because multimedia content can't be (easily) processed to understand what the object means, there is a "semantic gap" between the descriptions that people assign to multimedia content and those that can be assigned by computers or automated processes. On the other hand, technology for creating multimedia can easily record contextual metadata at the same time. Thesauri and other aids for professional "metadata makers" are invaluable but rarely used by ordinary people when they tag photos or videos.

L9. DESCRIBING CLASSES AND TYPES (9/29)

September 29, 2010

What is meaning? Where is meaning? We impose meaning on the world by "carving it up" into concepts and categories. We interact daily with a bewildering variety of objects and information types, and we constantly make choices about how to understand and organize them. The conceptual and category boundaries we impose treat some things or instances as equivalent and others as different. Sometimes we do this implicitly and sometimes we do it explicitly. We do this as members of a culture and language community, as individuals, and as members of organizations or instutitions. The mechanisms and outcomes of our categorization efforts differ across these contexts. In most cases the resulting categories are messier than our information systems and applications would like, and understanding why and what to do about it are essential skills for information professionals.

L10. CLASSIFICATION (10/4)

October 4, 2010

A Classification is a system of categories, ordered according to a pre-determined set of principles and used to organize a set of instances or entities. This doesn't mean that the principles are always good or equitable or robust, and indeed, every classification is biased in one way or another (for example, compare the Library of Congress classification with the Dewey Decimal System). Classifications are embodied in every information-intensive activity or application. Faceted or dimensional classification is especially useful in domains that don't have a primary hierarchical structure.

L11. DESCRIBING RELATIONS; ONTOLOGY (10/6)

October 6, 2010

An ontology defines the concepts and terms used to describe and represent an area of knowledge and the relationships among them. A dictionary can be considered a simplistic ontology, and a thesaurus a slightly more rigorous one, but we usually reserve "ontology" for meaning expressed using more formal or structured language. Put another way, an ontology relies on a controlled vocabulary for describing the relationships among concepts and terms.

L12. THE SEMANTIC WEB (10/11)

October 11, 2010

The "Semantic Web" vision imagines that all information resources and services have ontology-grounded metadata that enables their automated discovery and seamless integration or composition. Whether it is possible "to get there from here" with today's mostly HTML-encoded Web, or whether "a little semantics goes a long way" are key issues for us to consider.

L13. THE DOMAINS OF ORGANIZING SYSTEMS (10/13)

October 13, 2010

Now that we've discussed the intellectual foundations for organizing systems - description, classification, vocabulary control, relations, and so on ... we can apply them to a range of domains in which organizing systems are created. We'll see the issues and principles that are shared by these domains, and those that distinguish or are characteristic of them.

In this lecture we'll cover the "classicial" or "core' domains of library and information science -- libraries, archives, and museums -- and then move into other domains to discuss organizing systems in scientific, business, and personal contexts.

L14. ENTERPRISE INFORMATION MANAGEMENT (10/18)

October 18, 2010

L15. INTER-ENTERPRISE INFORMATION MANAGEMENT; COMBINING DESCRIPTIONS (INTEGRATION AND INTEROPERABILITY) (10/20)

October 20, 2010

In this lecture we look at the vocabulary problem we discussed in Lecture 9 as it manifests itself in enterprise contexts. Within a firm, different information systems might use data models that are incomplete or incompatible with respect to each other, and between firms these differences can be even greater. Structural, syntactic, and semantic mismatches cause problems when processes and services attempt to span these system and organizational boundaries (for example, to create a complete model of a "customer" or to conduct a business transaction). We'll consider how technical standards and transformation techniques can help achieve integration and interoperability, but we'll acknowledge that interoperability is not always possible and that non-technical factors play a huge role in determining the approach.

L16. MIDTERM (10/25)

October 25, 2010

This will be an in-class short-answer exam and is open book, open notes, but not “open Internet” – your Internet access is limited to the readings, lecture notes, and collaboratively prepared review materials. Because it is natural in a broad survey course that not every topic is of equal interest or perceived mportance for any given student, you'll have a choice of questions to answer.

L17. PERSONAL INFORMATION MANAGEMENT (10/27)

October 27, 2010

Personal information management is "the practice and the study of the activities that people perform to acquire, organize, maintain, and retrieve information for everyday use." The modern dialog about PIM has been strongly shaped by Bush's Memex, but since PIM is inherently embedded in user activities, things have gotten more complicated as personal information is increasingly managed (or not managed) across multiple devices and contexts (including "the cloud"). People employ a range of strategies for PIM, rarely consciously or explicitly, and generally in sub-optimal ways.

L18. STANDARDS AND GOVERNANCE IN ORGANIZING SYSTEMS (11/1)

November 1, 2010

L19. COMPARING DESCRIPTIONS – INTRO TO IR AND NLP (11/3)

November 3, 2010

Starting to use the "organizing systems" and "description" themes for Information Retrieval and Natural Language Processing domains. The relevant chapter from the IFIOIR book isn't ready. Read the Rao article and if you didn't read it for lecture 2 read the selection from Chapter 3 of Marti's Hearst book on Search User Interfaces.

L19R1. Rao, R. "From IR to Search and Beyond," ACM Queue, May 2004

L20. USER INTERFACES FOR SEARCH AND INFORMATION RETRIEVAL (11/8)

November 8, 2010

A person with an information need must first convert his internalized, abstract concepts into language, and then convert that expression of language into a query expression, which the search system then uses. The user interface(s) to the IR system can't help at all with this first task and usually offers just a little help with the second (except for so-called "natural language" question answering systems, which really aren't). Search UIs influence the kinds of queries that the user can express (or express easily). It wasn't that long ago that information retrieval was carried out mostly by highly trained professionals, and the user interfaces for the systems they used were complex. Today, web-based search is ubiquitous, and user interfaces must be vastly simpler. Best practices in designs for user interfaces in query specification, presentation of results, and query reformulation have emerged from laboratory experiments and from continuous incremental modification-and-test cycles in deployed systems.

Two categories of design questions for search UIs are the kind of information the searcher supplies (a spectrum from full natural language sentences, to keywords and key phrases, to syntax-heavy command language-based queries) and the interface mechanism the user interacts with to supply this information (which include command line interfaces, graphical entry form-based interfaces, and interfaces for navigating links). Once the system determines the results that satisfy the query, it presents some aspects of the matching documents, usually highlighting the query terms and providing some surrounding text to provide context. Some systems arrange, cluster, or visualize the results to make it easier for the searcher to identify the most relevant results.

L20R1. Marti Hearst, Search User Interfaces, Preface, Chapter 1

L21. TEXT PROCESSING; BOOLEAN MODELS (11/10)

November 10, 2010

The core problems of information retrieval are finding relevant documents and ordering the found documents according to relevance. The IR model explains how these problems are solved by (1) designing the representations of queries and documents in the collection being searched and (2) specifying the information used, and the calculations performed, that order the retrieved documents by relevance. Different IR models solve these problems in different ways; the better they solve it, the more computationally complex they are, so there are tradeoffs. The simplest, most familiar, and least effective model is the Boolean model -- representations are sets of index terms, and relevance is calculated in an all-or-none way according to set theory operations with Boolean algebra.

"Text processing" in IR consists of a sequence of steps that transform the text of documents or other entities so that they can be more efficiently stored and matched. Most of these steps are conceptually trivial - like extracting the text content from its storage format, and separating a set of words into tokens - but they pose subtle and interesting challenges.

L21R1. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval, Chapters 1 and 2

L22. VECTOR MODELS (11/15)

November 15, 2010

The Boolean model represents documents as a set of index terms that are either present or absent. This binary notion doesn't fit our intuition that terms differ in how much they suggest what the document is about. Vector models capture this notion by representing documents and queries as word or term vectors and assigning weights that can capture term counts within a document or the importance of the term in discriminating the document in the collection. Vector algebra provides a model for computing similarity between queries and documents and between documents because of assumption that "closeness in space" means "closeness in meaning".

(Don't worry if you haven't thought about vectors in a long time. We'll review everything you need to know to understand how they work... and you'll get to practice your understanding with an assignment).

L22R1. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval, Chapter 6

L23. DIMENSIONALITY REDUCTION (11/17)

November 17, 2010

Because the calculations used by simple vector models use the frequency of words and word forms, they can't distinguish different meanings of the same word (polysymy) and they can't detect equivalent meaning expressed with different words (synonymy). The dimensionality of the space in the simple vector model is the number of different terms in it, but the "semantic dimensionality" of the space is the number of distinct topics represented in it, which is much smaller.

Somewhat paradoxically, these reduced dimensionality vectors that define "topic space" rather than "term space" are calculated using the statistical co-occurrence of the terms in the collection, so the process is completely automatable -- it requires no humanly constructed dictionaries, knowledge bases, ontologies, semantic networks, grammars, syntactic parsers, morphologies, or anything else that represents "language". For this reason these approaches are said to extract "latent" semantics.

L24. STRUCTURE-BASED MODELS [1] (11/22)

November 22, 2010

L25. STRUCTURE-BASED MODELS [2] (11/24)

November 24, 2010

Structure-based IR models combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships). This structural information tells us what documents and parts of documents are most important and relevant, and provides additional justification for determining relevance and ordering a result set. The nature and pattern of links between documents has been studied for almost a century by "bibliometricians" who measured patterns of scientic citation to quantify the influence of specific documents or authors. The concepts and techniques of citation analysis seem applicable to the web since we can view it as a network of interlinked articles, and Google's "page rank" algorithm is now the classic example.

L26. MOBILE AND MULTIMEDIA IR (11/29)

November 29, 2010

Many of the concepts, technologies and techniques in information organization, information retrieval, and user interface design were developed for dedicated (as in a library) or desktop-based computing environments. Mobile (or context-aware) applications pose new problems and provide new value in IR. Multimedia content is similarly challenging old approaches to IO and IR. Instead of trying to overcome the semantic gap, we can use the low-level features that can be extracted automatically to index the multimedia collection and then extract the same ones from a multimedia "query by example" (as in the Shazam application, which can identify a song from a snippet recorded using a cell phone).

L27. APPLIED IR AND NLP (12/1)

December 1, 2010

L28. ALUMNI DAY (12/6)

December 8, 2010

For several years a tradition in the course at the end of the semester is to have ISchool alumni return to talk about their jobs. This year's guests are:

Zach Gillen
- Iglehart, "Pursuing Health IT: The Delicate Dance of Government and the Market", Health Affairs, 2005
- Blumenthal & Tavenner, "The 'Meaningful Use' Regulation for Electronic Health Records," Health Policy & Reform, 2010
Kate Ahern
- Meaningful Use, Certification Criteria and Standards, and HHS Certification Process
Karen Nomorosa
- Rearden Commerce Products Overview
Patrick Schmitz
- You should have already read (for Lecture 26) Schmitz and Black, "The Delphi Toolkit: Enabling Semantic Search for Museum Collections," Museums and the Web, 2008
Andrea Moed
- Knoll, Scott, “Yes, Web Advertising Scales – If You Measure It Right.” Advertising Age, Crain Communications, July 6, 2010
- Walsh, Mark, "Google Gaining on Yahoo in Display." Online Media Daily, Mediapost Communications, May 24, 2010

L29. REVIEW FOR FINAL EXAM (12/8)

L30. FINAL EXAM (12/14)

14 December 2010

Final exam will take place from 12:00 noon to 3:00 pm

i202 Fall 2010 School of Information, UC Berkeley

Navigation

User login

Syllabus (Extended Version)