Information Organization and Retrieval

202
MW 9-10:30


L1. Course Overview (August 29, 2011 - 09:00)

This course has been called "Information Organization and Retrieval" and has been the first core course since the school opened, but that title only partly describes what the course is about.  The overall focus is on the intellectual foundations of IO & IR: conceptual modeling, semantic representation, vocabulary and metadata design, classification, and standardization. These issues are important whenever we develop IO and IR applications and apply technology to make information more accessible, useful, processable, and so on. Some people might call this course "Information Architecture" and that would be accurate if we derived the meaning of IA only from "information" and "architecture" but most of the time the IA phrase is used in a much more limited and narrower sense, so I tend to avoid it.

There are lots of interesting and deep ideas and questions here, but that's not why we study them. We study them because understanding the ideas and answering the questions enables us to design, build and deploy better systems and applications that organize information. So I try to make this course intellectually deep but ruthlessly practical at the same time.  To do so I'll employ lots of case studies and news stories about "information in the wild" and "information-intensive" applications.   All in all, this is a much broader set of contexts than you'd be learning and talking about if you'd gone to a more traditional library school or to an I School where the transition from a library school was more incremental.  That said, there is lots to learn by studying libraries, museums, and other "memory institutions" so we'll discuss them too.

It may seem that there is an overwhelming amount of reading assigned for this first day of the course, but many of these are very short newspaper stories and the total number of pages to read is about 50.   We're taking the broadest possible look at different contexts or situations in which information is being organized.  Stand back and look for common concepts and concerns.

A1. 202 in the News (Due: 09/12/2011)

A0. Bob at Bear's Lair! (Due: 09/12/2011)


L2. The Organizing System (August 31, 2011 - 09:00)

Organizing” is a fundamental issue in several existing disciplines, most notably library and information science, computer science, informatics, economics, and business.  However, these disciplines have only limited agreement in how they approach problems of organizing and in what they seek as their solutions.  For example, library and information science has traditionally studied organizing from a public sector bibliographic perspective, paying careful attention to user requirements and behaviors, and offering prescriptive methods and solutions.  In contrast, computer science and informatics tend to study organizing in the context of information-intensive business applications with a focus on system architecture and implementation.

            This course presents a more abstract framework for issues and problems of organizing that emphasizes the common concepts and goals of the disciplines that study them.  We begin by recognizing that every system of organization involves a collection of resources, and we can treat things, information, and information about things as resources.  Every system of organization involves a choice of properties or principles used to describe and arrange the resources, and ways of accessing and interacting with them.  By comparing and contrasting how these activities take place in different contexts and domains, we can identify patterns of organizing and see that Organizing Systems often follow a common life cycle.  We can create a discipline of organizing in a disciplined way.


L3. The Activities of Organizing Systems (September 7, 2011 - 09:00)

When we take an expansive view of organizing systems we identify four activities that all organizing systems support or perform:  selecting resources, organizing resources, providing resource-based interactions and services, and maintaining resources. These four activities are deeply ingrained in curricula and practice for organizing systems like libraries and museums and can be extended to business, scientific and personal organizing systems.


L4. Design Patterns for Organizing Systems (September 12, 2011 - 09:00)

The great variety in what individuals, groups, and enterprises do is reflected in the huge breadth of organizing systems we encounter and the diversity of the resources that these systems organize. Even so, because every organizing system has a collection of resources at its foundation and shares some of the same general purposes and goals, organizing systems tend to follow patterns in how they organize resources, the interactions they support, and how they are implemented and operated.

A2. Design Patterns (Due: 09/19/2011)


L5. Resources (September 14, 2011 - 09:00)

Resource has an ordinary sense of “anything of value that can support goal-oriented activity.”   This broad definition means that a resource can be a physical thing, information about physical things, information about non-physical things, or anything you want to organize.

Before we can begin to organize any resource we need to identify it. It might seem straightforward to devise an Organizing System around tangible resources, but we must be careful not to beg the question of what a resource is.  In different situations the same thing can be treated as a unique item, as one of many equivalent members of a broad category, or as a sub-part of an item rather than as an item on its own.

When the resources being organized consist of information content, deciding on the unit of organization is challenging because it might be necessary to look beyond physical properties and consider conceptual or intellectual equivalence.

L6. Describing Resources (September 19, 2011 - 09:00)

Many of the design dimensions for organizing systems concern the nature and extent of the descriptions of the resources being organized. Resource descriptions are commonly based on easily perceived intrinsic properties such as size, color, or shape, on intrinsic but intangible properties such as the author or creator’s name and date or creation, or on extrinsic assigned properties such as product or personal identifiers. 

The words people use to describe things or concepts are "embodied" in their context and experiences and these naturally-occurring words are an "uncontrolled vocabulary."  As a result, people or enterprises often use different terms or names for the same thing and the same terms or names for different things.  These mismatches often have serious or even drastic consequences.   It might seem straightforward to control or standardize terms or names, and much technology exists for attacking the "vocabulary problem," but technology alone is not a complete solution because language use constantly evolves and the world being described does too.

A3. XML and XML Editors (Due: 09/26/2011)


L7. XML (September 21, 2011 - 09:00)

Many of you already have some familiarity with XML, but perhaps mostly as a data format for applications or programming.  In Organizing Systems it is essential to take a more abstract and intellectual view of XML and understand how it represents structured information models for resources and for resource descriptions.  XML encourages the separation of content from presentation, which is the most important principle of information architecture.  Encoding information in XML is an investment in information organization that pays off "downstream" in IR and language processing applications.

L8. Implementing Resource Descriptions (September 26, 2011 - 09:00)

The easiest way to indicate what something means is to give it a name, label, tag, or description.  This additional information about a specific resource, or about a or class or collection of resources is "metadata" because it is not part of the resource or its content.  "What is being described" can be considered on two separate dimensions - the contexts/containers/collections in which it occurs, and the level of abstraction (how large is the set of instances that are treated as equivalent when metadata is assigned).  How much metadata, what kind, and who should provide it are fundamental concerns. Some "contextual" metadata can be assigned automatically, but this raises questions about the identification and scope of the context.

A4. Modeling a Vocabulary (Due: 10/05/2011)


L9. Categories (September 28, 2011 - 09:00)

What is meaning? Where is meaning?  We impose meaning on the world by "carving it up" into concepts and categories.  We interact daily with a bewildering variety of objects and information types, and we constantly make choices about how to understand and organize them.  The conceptual and category boundaries we impose treat some things or instances as equivalent and others as different.  Sometimes we do this implicitly and sometimes we do it explicitly.  We do this as members of a culture and language community, as individuals, and as members of organizations or institutions.  The mechanisms and outcomes of our categorization efforts differ across these contexts.  In most cases the resulting categories are messier than our information systems and applications would like, and understanding why and what to do about it are essential skills for information professionals.

L10. Classification (October 3, 2011 - 09:00)

A Classification is a system of categories, ordered according to a pre-determined set of principles and used to organize a set of instances or entities. This doesn't mean that the principles are always good or equitable or robust, and indeed, every classification is biased in one way or another (for example, compare the Library of Congress classification with the Dewey Decimal System). Classifications are embodied in every information-intensive activity or application.  Faceted or dimensional classification is especially useful in domains that don't have a primary hierarchical structure.

Resources can be classified by people, but also by computational processes.

L11. Relations and Structure (October 5, 2011 - 09:00)

A relationship is “an association among several things, with that association having a particular significance”  The reason is an important part of the relationship. Just identifying the pair of things involved is not enough; several different relationships can exist among the same objects, and the order of the objects in the relationship usually matters. 

We care about relationships for two main reasons: 1) everything is related to something, and 2) relationships are powerful means of navigating and finding resources.


An ontology defines the concepts and terms used to describe and represent an area of knowledge and the relationships among them.  A dictionary can be considered a simplistic ontology, and a thesaurus a slightly more rigorous one, but we usually reserve "ontology" for meaning expressed using more formal or structured language.  Put another way, an ontology relies on a controlled vocabulary for describing the relationships among concepts and terms.

A5. Faceted Classification (Due: 10/12/2011)


L12. Relations and Structure (2) (October 10, 2011 - 09:00)



L13. Semantic Web, Linked Data (October 12, 2011 - 09:00)

The "Semantic Web" vision imagines that all information resources and services have ontology-grounded metadata that enables their automated discovery and seamless integration or composition.  Whether it is possible "to get there from here" with today's mostly HTML-encoded Web, or whether "a little semantics goes a long way" are key issues for us to consider.

A6. Ontology (Due: 10/17/2011)


L14. Multimedia IO (October 17, 2011 - 09:00)

We revisit most of the concerns about metadata from Lecture 7 as they apply to non-text and multimedia objects and resources, but some new challenges arise because of the temporal character of audio and video and the semantic opacity of the content.  Because multimedia content can't be (easily) processed to understand what the object means, there is a "semantic gap" between the descriptions that people assign to multimedia content and those that can be assigned by computers or automated processes.  On the other hand, technology for creating multimedia can easily record contextual metadata at the same time. Thesauri and other aids for professional "metadata makers" are invaluable but rarely used by ordinary people when they tag photos or videos.

L15. Course Review (October 19, 2011 - 09:00)


    A7. Personal Information Management and Distributed Classification (Due: 11/07/2011)


    L16. Enterprise Organizing Systems (October 24, 2011 - 09:00)



    L17. Combining Descriptions: Integration and Interoperability (October 26, 2011 - 09:00)



    L18. Personal Information Management (October 31, 2011 - 09:00)



    L19. Midterm (November 2, 2011 - 09:00)


      Midterm (Due: 11/02/2011)


      L20. Introduction to IR and NLP (November 7, 2011 - 09:00)



      L21. User Interfaces for Search and IR (November 9, 2011 - 09:00)


      A8. Search UI Evaluation (Due: 11/16/2011)


      L22. Text Processing; Boolean Models (November 14, 2011 - 09:00)



      L23. Vector Models & Dimensionality Reduction (November 16, 2011 - 09:00)


      A9. Text Toolkit (due 11/30) (Due: 11/30/2011)


      L24. Web Search & Structure-Based Models (November 21, 2011 - 09:00)



      L25. Structure-Based Models (2) (November 23, 2011 - 09:00)



      L26. Mobile and Multimedia IR (November 28, 2011 - 09:00)



      L27. Applied IR and NLP (November 30, 2011 - 09:00)



      L28. Alumni Day (December 5, 2011 - 09:00)


        A10. Optional Alumni Day Reflection (Due: 12/07/2011)


        L29. Course Review (December 7, 2011 - 09:00)



          L30. Final Exam (December 12, 2011 - 09:00)

          Monday, Dec. 12th from 9am to 12pm.

            Final (Due: 12/12/2011)