Information Organization and Retrieval
202
MW 9-10:30
This course has been called "Information Organization and Retrieval" and
has been the first core course since the school opened, but that title
only partly describes what the course is about. The overall focus is on
the intellectual foundations of IO & IR: conceptual modeling,
semantic representation, vocabulary and metadata design,
classification, and standardization. These issues are important whenever
we develop IO and IR applications and apply technology to make
information more accessible, useful, processable, and so on. Some people
might call this course "Information Architecture" and that would be
accurate if we derived the meaning of IA only from "information" and
"architecture" but most of the time the IA phrase is used in a much more
limited and narrower sense, so I tend to avoid it.
There are lots of interesting and deep ideas and questions here, but
that's not why we study them. We study them because understanding the
ideas and answering the questions enables us to design, build and
deploy better systems and applications that organize information. So I try to make
this course intellectually deep but ruthlessly practical at the same
time. To do so I'll employ lots of case studies and news stories about "
information
in the wild" and "information-intensive" applications. All in all,
this is a much broader set of contexts than you'd be learning and
talking about if you'd gone to a more traditional library school or to
an I School where the transition from a library school was more
incremental. That said, there is lots to learn by studying libraries, museums, and other "memory institutions" so we'll discuss them too.
It may seem that there is an overwhelming amount of reading assigned for this first day of the course, but many of these are very short newspaper stories and the total number of pages to read is about 50. We're taking the broadest possible look at different contexts or situations in which information is being organized. Stand back and look for common concepts and concerns.
- R1. Borges, Jorge Luis – "The Library of Babel"
- R2. Bush, Vannevar – "As We May Think"
- R3. Trant, Jennifer – "Emerging Convergence? Thoughts on museums, archives, libraries and professional training" (p. 1-8)
- R4. Poole, Erika Shehan and Grudin, Jonathan, "A Taxonomy of Wiki Genres in Enterprises"
- R5. Boon, Miriam – "Rethinking Scientific Data Management"
- R6. Wang, Joy – "A Seed Library for Heirloom Plants Thrives in the Hudson Valley"
- R7. Wakabayashi, Daisuke – "Japanese Farms Look to the 'Cloud'"
- R8. Bosman, Julie – "Publisher Limits Shelf Life for Library E-Books"
- R9. Homann, Ulrich; Rill, Michael; and Wimmer, Andreas – "Flexible Value Structures in Banking"
- R10. Wright, Alex – "Managing Scientific Inquiry in a Laboratory the Size of the Web"
Organizing” is a fundamental issue in several existing disciplines, most notably library and information science, computer science, informatics, economics, and business. However, these disciplines have only limited agreement in how they approach problems of organizing and in what they seek as their solutions. For example, library and information science has traditionally studied organizing from a public sector bibliographic perspective, paying careful attention to user requirements and behaviors, and offering prescriptive methods and solutions. In contrast, computer science and informatics tend to study organizing in the context of information-intensive business applications with a focus on system architecture and implementation.
This course presents a more abstract framework for issues and problems of organizing that emphasizes the common concepts and goals of the disciplines that study them. We begin by recognizing that every system of organization involves a collection of resources, and we can treat things, information, and information about things as resources. Every system of organization involves a choice of properties or principles used to describe and arrange the resources, and ways of accessing and interacting with them. By comparing and contrasting how these activities take place in different contexts and domains, we can identify patterns of organizing and see that Organizing Systems often follow a common life cycle. We can create a discipline of organizing in a disciplined way.
When we take an expansive view of organizing systems we identify four activities that all organizing
systems support or perform: selecting resources, organizing resources, providing resource-based interactions and
services, and maintaining resources. These four activities are deeply ingrained
in curricula and practice for organizing systems like libraries and museums and
can be extended to business, scientific and personal organizing systems.
The great variety in what individuals, groups, and enterprises do is reflected in the huge breadth
of organizing systems we encounter and the diversity of the resources that these
systems organize. Even so, because every organizing system has a collection of
resources at its foundation and shares some of the same general purposes and
goals, organizing systems tend to follow patterns in how they organize resources,
the interactions they support, and how they are implemented and operated.
Resource has an ordinary sense of “anything of value that can support goal-oriented
activity.” This broad definition means that a resource can be a physical thing, information about physical things,
information about non-physical things, or anything you want to organize.
Before we can begin to organize any resource we need to identify it. It might seem straightforward to devise an Organizing
System around tangible resources, but we must be careful not to beg the question of what a resource is. In different situations the same thing can be treated as a unique item, as one of
many equivalent members of a broad category, or as a sub-part of an item rather
than as an item on its own. When the resources being organized consist of information
content, deciding on the unit of organization is challenging because it might
be necessary to look beyond physical properties and consider conceptual or
intellectual equivalence.
Many of the design dimensions for organizing systems concern the nature
and extent of the descriptions of the resources being organized. Resource
descriptions are commonly based on easily perceived intrinsic properties such
as size, color, or shape, on intrinsic but intangible properties such as the
author or creator’s name and date or creation, or on extrinsic assigned
properties such as product or personal identifiers. The words people use to describe things or concepts are "embodied" in
their context and experiences and these naturally-occurring words are an
"uncontrolled vocabulary." As a result, people or enterprises often
use different terms or names for the same thing and the same terms or
names for different things. These mismatches often have serious or even
drastic consequences. It might seem straightforward to control or
standardize terms or names, and much technology exists for attacking the
"vocabulary problem," but technology alone is not a complete solution
because language use constantly evolves and the world being described
does too.
Many of you already have some familiarity with XML, but perhaps mostly
as a data format for applications or programming. In Organizing Systems it is
essential to take a more abstract and intellectual view of XML and
understand how it represents structured information models for resources and for resource descriptions. XML
encourages the separation of content from presentation, which is the
most important principle of information architecture. Encoding
information in XML is an investment in information organization that
pays off "downstream" in IR and language processing applications.
The easiest way to indicate what something means is to give it a name,
label, tag, or description. This additional information about a specific resource, or about a
or class or collection of resources is "metadata" because it is
not part of the resource or its content. "What is being described" can be
considered on two separate dimensions - the
contexts/containers/collections in which it occurs, and the level of
abstraction (how large is the set of instances that are treated as
equivalent when metadata is assigned). How much metadata, what kind,
and who should provide it are fundamental concerns. Some "contextual"
metadata can be assigned automatically, but this raises questions about
the identification and scope of the context.
What is meaning? Where is meaning? We impose meaning on the world by
"carving it up" into concepts and categories. We interact daily with a
bewildering variety of objects and information types, and we constantly
make choices about how to understand and organize them. The conceptual
and category boundaries we impose treat some things or instances as
equivalent and others as different. Sometimes we do this implicitly and
sometimes we do it explicitly. We do this as members of a culture and
language community, as individuals, and as members of organizations or
institutions. The mechanisms and outcomes of our categorization efforts
differ across these contexts. In most cases the resulting categories
are messier than our information systems and applications would like,
and understanding why and what to do about it are essential skills for
information professionals.
A Classification is a system of categories, ordered according to a
pre-determined set of principles and used to organize a set of instances
or entities. This doesn't mean that the principles are always good or
equitable or robust, and indeed, every classification is biased in one
way or another (for example, compare the Library of Congress
classification with the Dewey Decimal System). Classifications are
embodied in every information-intensive activity or application.
Faceted or dimensional classification is especially useful in domains
that don't have a primary hierarchical structure.
Resources can be classified by people, but also by computational processes.
- R1. The Discipline of Organizing – Chapter 8
- R2. Marlow, Cameron; Naaman, Mor; Boyd, Danah; and Davis, Marc – "HT06, tagging paper, taxonomy, Flickr, academic article, to read"
- R3. Wright, Alex – "Our Sentiments, Exactly"
- R4. Li, Jiexun, et al. – "From Fingerprint to Writeprint"
A relationship is “an association among several things, with that association having a
particular significance” The reason is an important part of the relationship. Just identifying the pair of things
involved is not enough; several different relationships can exist among the
same objects, and the order of the objects in the relationship usually matters.
We care about relationships for two main reasons: 1) everything is related to something, and
2) relationships are powerful means of navigating and finding resources.
An ontology defines the concepts and terms used to describe and
represent an area of knowledge and the relationships among them. A
dictionary can be considered a simplistic ontology, and a thesaurus a
slightly more rigorous one, but we usually reserve "ontology" for
meaning expressed using more formal or structured language. Put another
way, an ontology relies on a controlled vocabulary for describing the
relationships among concepts and terms.
The "Semantic Web" vision imagines that all information resources and
services have ontology-grounded metadata that enables their automated
discovery and seamless integration or composition. Whether it is
possible "to get there from here" with today's mostly HTML-encoded Web,
or whether "a little semantics goes a long way" are key issues for us to
consider.
We revisit most of the concerns about metadata from Lecture 7 as they
apply to non-text and multimedia objects and resources, but some new
challenges arise because of the temporal character of audio and video
and the semantic opacity of the content. Because multimedia content
can't be (easily) processed to understand what the object means, there
is a "semantic gap" between the descriptions that people assign to
multimedia content and those that can be assigned by computers or
automated processes. On the other hand, technology for creating
multimedia can easily record contextual metadata at the same time.
Thesauri and other aids for professional "metadata makers" are
invaluable but rarely used by ordinary people when they tag photos or
videos.
Midterm (Due: 11/02/2011)
Monday, Dec. 12th from 9am to 12pm.
Final (Due: 12/12/2011)