School of
Information Management & Systems
Previously School of Library & Information Studies
296a-1
Seminar: Information Access.
("The Friday Afternoon Seminar")
Summaries
Fall 2001. Fridays 3-5. 107 South Hall.
Schedule.
* indicates student work for credit.
Friday Aug 31:
Clifford LYNCH: Introduction: Overview of the Issues.
I will introduce the seminar, and we will have time for participants
to introduce themselves and speak briefly about their interests.
After this, I will do a brief review of events of interest that have
taken place over the summer, including the Joint Digital Libraries
Meeting, developments in the Library of Congress Digital Preservation
Program and other activities. I will also provide an overview of some
of the topics that I hope to explore in more depth during the
semester and highlight opportunities for student research in some of
these areas.
Sept 7: Michael BUCKLAND: Metadata Research Update.
For more than 10 years the School has had two related
research programs: The "OASIS Research Program," a series of studies
of how the command languages of online bibliographies and catalogs
could be made both easier to use and more powerful, and the
development of the CHESHIRE system by Professor Ray Larson.
More recently, Fredric Gey, Ray Larson, Aitao Chen, student
researchers, and I have been collaborating in Federally-funded
research on how bibliographic descriptive data ("Metadata") can
be put to wider use.
I will summarize recent work and talk about what
we now want to work on, including the use of catalog records as
linguistic corpora in their own right; the detection and use of
"dialects" among populations of searchers; the automatic construction
of bilingual dictionaries from bibliographic records using statistical
association techniques; the formal evaluation mappings (relative
indexes) from searchers' "query vocabulary" to indexers' "entry
vocabulary"; and map-based interfaces for geographical searching
of bibliographical databases. (It is intended to have separate
sessions on cross-lingual retrieval and the design of gazetteer
servers.)
Sept 14: Michael BUCKLAND: The Modern Invention of
'Information': Discourse, History, and Power.
What are the underlying and overlooked assumptions,
values, and consequences of modern discourse about "The Information
Age"? Prof. Ron Day's recent book, The Modern Invention of
"Information": Discourse, History, and Power (Southern
Illinois University Press, 2001) analyzes historically important 20 th century writings about "Information." He shows that the documentalist Paul Otlet (1868-1944) and the librarian Suzanne Briet (1894-1989) associated Information and Information Management (then "Documentation") with cultural progress: efficiency, globalization, and world peace. After World War II, Warren Weaver and Norbert Weiner applied a conduit metaphor (based on Shannon's mathematical theory of signaling, aka "Information Theory") to social applications. More recently the multimedia theorist Pierre Levy has adapted the writings of Deleuze and Guattari to a capitalist understanding of the "virtual society." Common characteristics of these theorists include: utopianism, a reductionist view of human life, totalitarian tendencies, and a narrowly limited conception of "information" as facts or even bits. Further, these assumptions tend to lead to amnesia concerning earlier understandings of information.
Prof Day writes: "This epistemology and history has blocked a more careful examination of the history of which shows that "the information age" has been occurring again and again and erasing its own history by this epistemology." ... The result is a crisis in the meaning of historical agency and freedom, in so far as the room for a hermeneutic or "critical" rereading of history becomes more and more reduced, especially within "scientific" and professional rhetorics (and the social structures they produce and define), to the "factuality" of the past, present, and future."
The book returns to the mid-century critiques of the philosopher Martin Heidegger and the social critic Walter Benjamin in order to see critical strategies to information that have been more or less forgotten.
Prof. Day author teaches Library and Information Science at Wayne State University. He has Ph.D. in Comparative Literature, a Masters in Philosophy as well as a Masters in LIS (from Berkeley). His current work is on Italian Autonomous Marxism and the philosophically and politically informed critiques of globalization and the social meaning of information and communication technologies in, and from, this movement.
Resources: Copies of the first and last chapters are available in the Computer Lab. I have two copies of the book which I can loan. Prof Day's website contains related material. See item 1, parts of 2, 5, 7, 10, and the second part of 11 at
http://www.lisp.wayne.edu/~ai2398/papers.htm
Sept 21: Tom LEONARD, University Librarian:
Strategic Planning and the Library.
Professor Leonard, our University Librarian, serves on the
campus
Strategic Planning Committee and
trying to explain to them where the Library fits in.
He will review the issues and invite discussion.
See
http://himalia.chance.berkeley.edu/opa/spc/
Also Clifford LYNCH will report on recent developments.
Sept 28: Two topics: A short presentation, then a longer one:
Avi RAPPOPORT,
avirr@searchtools.com:
Integrating Question-Answering into Web Site Search.
While site and intranet search engines tend to operate on standard
information retrieval principles of relevance ranking, user queries
do not. Users seem to waver between information seeking and looking
for answers, sometimes in the same query. To address this, I
recommend that search administrators provide recommended pages for
common searches, and adjust the relevance weighting for various kinds
of documents.
Ruth MOSTERN, Electronic Cultural Atlas Initiative, and
Michael BUCKLAND: Designing Better Gazetteers.
Gazetteers are the familiar lists of place-names commonly
found at the back of an atlas. In an online environment gazetteers
acquire a greatly enhanced significance as a linking mechanism between
maps and texts; and, in a networked environment,
"gazetteer servers" could, in principle, be invoked in conjunction with
any text processing or map-related application.
A gazetteer, is a list of datasets, each with three data-elements:
Place-name; Feature Type (What kind of a place is it? City, shrine,
lake,...); and Location (longitude and latitude).
However, complexity increases greatly in the humanities and historical
work because places commonly have multiple names in multiple languages
and multiple scripts and names are unstable over time.
Further, places expand, divide, merge, have unclear and/or disputed
boundaries, and,
occasionally, simply move.
Also the range of Feature Types varies greatly depending on the
application of interest (archeological, architectural, environmental,
linguistic, transportation, military,..), so no single thesaurus of
feature types can be expected to be satisfactory.
The National Science Foundation has awarded a grant
to the Electronic Cultural Atlas Initiative (ECAI,
http://ecai.org/) to design improved format
and content specifications for entries in online gazetteers and for
characterizing an online gazetteer as a whole. This is being done
in collaboration with Academica Sinica (Taiwan, whose digital versions
of ancient Chinese histories with archaic place-names in Chinese
ideographs provide suitably challenging test material) and the
Alexandria Digital Library project at UC Santa Barbara.
Gazetteers are also of wider interest as an instance of
a linking mechanism between two different genres.
We will discuss the project tasks and possible
solutions, and report on a recent project workshop in Taipei.
Ruth MOSTERN is project manager on the ECAI NSF
Gazetteer project and completing a doctoral dissertation on the
longterm stability of political boundaries in China.
Michael BUCKLAND is Co-Director of ECAI in addition to being
a professor in SIMS.
Oct 5: Students' Preliminary Reports on Topics*
Hua AI:
Comparisons of Distributed Information Retrieval Protocols.*
I am planning to look at the area of
distributed information retrieval.
I am right now looking at Java space, and I plan to also look at the Z39.50
protocol and the Open Archive Initiative. My goal is to familiarize
myself
with the ways that are currently used in distributed information
retrieval
and compare their pros and cons.
Margo E. DUNLAP:
The Internet Archive: The vision of an Entrepreneur.*
I am conducting an ethnographic study of Alexa,
a web information
company. Alexa provided the technology enabling the
archiving of the
Internet. Brewester Kahle created the Internet Archive and
now plans to
make it available to the public. I am interested in
reviewing the value
of information access and the possibility of archiving
everything.
B. Hoon KANG:
The Challenges and Solutions Toward the World of
Self-Administrating Data.*
We have designed and prototyped a novel data management model, called
"self-administering data" [1]. In this model, a declarative specification of
how a data object should behave is associated with the object, perhaps
explicitly by the user, or by the action of a data input device. Typically, the
specification, called a Self-administering Data Description (SDD), expresses how
and to whom the data should be transferred, and how it should be incorporated
when it is received. The actions required to implement the specification are
carried out by a distributed infrastructure of "Self-administering Data
Handlers" (SDHs), which are presumed to exist at various points
in the network.
We have implemented an initial prototype of the Self-administering Data Handler.
In the current prototype, the SDH is configured for co-authoring across
administering domains, so that the authors' involvement in document management
can be minimized.
The initial prototype suggests that SDH has interesting applications to digital
libraries and information management processes. In this semester, we are
planning to design and implement such interesting
applications using SDH.
In the progress report, I am going to present the background of SDH, the
challenging problems that need to be addressed, and some preliminary results.
Reference:
[1] Toward a Model of Self-administering Data, B. Hoon Kang and Robert Wilensky,
First ACM/IEEE-CS Joint Conference on Digital Libraries June 24 - 28, 2001,
Roanoke, VA, USA.
http://www.cs.berkeley.edu/~hoon/published/jcdl2001.pdf
Kyungmin KIM, Yueh-Ying HSU, Mengzhi HU:
When MARC meets GIS.*
An advantage of computer technology is a
trend to provide search service on bibliographic records across
multiple libraries. As more libraries have put efforts on
connecting with each other, users benefit in being able to
search for records and borrowing books from multiple different
libraries. For a user, finding a book or magazine located
nearest to where he or she is would be a high priority.
On the other hand, a researcher may be interested in the
evolution of specific subjects over time or location through
library records. Current bibliographical records in libraries,
however, are mainly displayed on textual description of
documents. Those needs are hardly satisfied as current
bibliographical records have fewer connections with
location data. In order to satisfy new needs in searching,
we propose a project to
provide an interactive graphic interface, connecting
bibliographic records with GIS and gazetteers. Our research
might focus on (1) Mapping MARC records with GIS and gazetteers;
(2) Cross-language search in digital gazetteers;
(3) Providing graphic boundaries for user query and solutions
for inconsistent data of map boundaries.
Xiaojun PENG: Distributed Information Retrieval.*
The goal of distributed information retrieval is to enable the
identification and retrieval of data sets relevant to a general
description or query, wherever those data sets may be located or hosted.
There are two key components of distributed IR system. First is the
application of distributed and parallel computing technology in the area
of information retrieval. An example would be the design, implementation
and performance evaluation of a distributed architecture for information
retrieval. The second key component is a standard information retrieval
protocol, or a set of interoperable IR protocols. An example would be
proposal and standardization of such communication protocols as Z39.50.
Oct 12: Students' Reports and David Blundell
Mike KIM & Chan Jean LEE: Atypical product search.*
Often when searching for a product, a person does not know
the name of the particular item they are searching for. However, they
do know the characteristics or attributes of what they seek. If he or
she knows which categories a particular item belongs to then the search
is simple. But often, it is difficult to determine which category an
item will fall under. As an example, if you are looking for a baby
carrier that an adult carries by the handle and also can be used as a
car seat, where would you look? In many websites, there are more than
one category in which this item can be listed.
Atypical products have attributes of several
categories. We will research the relationships of category names and
the attributes of that category. We would like to determine the
feasibility of allowing searching within multiple categories. We hope
this will reduce the recursive drilling down categories to find a
product.
Nan ZHOU: Search Engines for Online Stores.*
This project will do research on search engines for online
stores, show how they work, and compare different search engines for some
sites. The ultimate goal is to reach conclusions on how could online store
search engines could be made more adequate and more effective.
David S. Blundell, Visiting Scholar, International
& Area Studies; Dept of Anthropology, Taiwan National University, Taipei:
Creating an Austronesian Linguistic Atlas.
David Blundell will describe his experience in
language studies, what a linguistic atlas is, the Austronesian
language family (whichextends from Vietnam to New Zealand and from Madagascar
to Hawaii), and the steps involved in creating an Austronesian linguistic
atlas and putting it into a digital form.
Oct 19: Marti HEARST: Incorporating Faceted Metadata into Web Site Search.
One of the most pressing usability issues in the design of web sites
is that of how to improve navigation and search. We are conducting a
series of usability studies to address this problem, focusing on web
sites that consist of large collections of loosely organized
information. This talk describes our method and presents
preliminary results which suggest that use of faceted metadata can be
useful both for the initial stages of highly constrained search and
for the intermediate stages of less constrained browsing tasks. We
also find that users state an interest in using different search
interface types to support different search strategies. We are in the
process of conducting usability studies to investigate how to make
this approach scale to very large collections in which each metadata
facet is hierarchically organized.
Joint work with Jennifer English, Kirsten Swearingon, Rashmi Sinha,
and Ping Yee.
See
bailando.sims.berkeley.edu
/flamenco.html
Oct 26: Nancy VAN HOUSE:
Who do You Trust? Digital Libraries as Boundary Objects.
Information and communications technologies make
information more readily
available. This is not always desirable. In my field work with people
engaged in biodiversity and environmental planning, both of which rely
extensively on shared data, both users and providers of
information expressed concern about the ease with which digital
information can be disseminated. One group, for example,
debated restricting access
to photos of endangered plant species to prevent landowners from
identifying and destroying rare specimens.
These led me to look more closely at the practices
of knowledge creation
and of cognitive authority in this community, and especially at
boundaries: how they help, and are created and maintained, as well as how
they are crossed in interdisciplinary work. I draw on the literature of
social epistemology, science studies, and other areas concerned with
knowledge and knowledge communities, as well as field work with people
actively engaged in designing shared data systems.
I consider the
implications for the design of digital libraries and other shared
information systems.
Nov 2: Students' Progress Reports:
Mike KIM & Chan Jean LEE: Atypical product search:
Lessons Learned and Future Work.*
Our goal is make better use of the work that has already been done
to categorize products from different webmasters. To do this, we want to
extract information from different websites and determine which items are
under which category. By comparing the items of two categories, we may be
able to compare the characteristics of the items in the two categories.
These goals directed our research focus toward text parsing html pages and
database design to store the parsed text. In the database design, we focus
on how to store hierarchical structure of categories of a website in order to
retrieve all items under any category.
We will discuss our challenges such as some limitations of web crawling
robots and our current approach to extract data from websites.
ByungHoon KANG: Hash-history based Approach
for Management of Weakly Consistent Replicas.*
Version vectors [Parker83] are used for reconciling replicas by detecting
conflict and partial orderings in weakly consistent replication systems
such as Bayou [Petersen97] and Ficus [Reiher94]. However, it is well
known that this approach does not scale as the number of replicas
increases: Each replica locally maintains its own version vector to track
the number of updates generated by other replicas; the size of the vector
grows in proportion to the number of replicas and the complexity of the
replica creation pattern [Petersen97]. In addition, the management of
version vectors becomes complicated since entries for newly added replicas
have to be either broadcasted or incrementally propagated to other
replicas.
We propose a novel "hash history" based scheme to provide a scalable and
simple approach for reconciling replicas. Instead of version vectors,
each site keeps a record of the hash of each version that the site has
created or received from other sites. When an update comes from another
site, the sites exchange their lists of hashes, from which each can decide
which is the newer version. If no version dominates the other, the most
recent common ancestral version can be found instead and used as a useful
hint in a subsequent diffing/merging process. This approach is scalable
since the growth of hash lists is not proportional to the number of
replicas.
References
[Parker83] D. Stott Parker, Jr., Gerald Popek, Gerard Rudisin, Allen
Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards,
Stephen Kiser, and Charles Kline. Detection of mutual inconsistency in
distributed systems. IEEE Transactions on Software Engineering,
9(3):240--247, May 1983.
[Petersen97] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and
A. J. Demers. Proceedings of the 16th ACM Symposium on Operating Systems
Principles (SOSP-16), Saint Malo, France, October 5-8, 1997, pages 288-301
[Reiher94] P. Reiher, J. S. Heidemann, D. Ratner, G. Skinner, and G. J.
Popek. Resolving file conflicts in the Ficus file system. In USENIX
Conference Proceedings, June 1994.
Nov 9: Additional Students' Progress Reports:
Hua AIi: Comparisons of Distributed Information
Retrieval Protocols.*
Kyungmin KIM, Yueh-Ying HSU, Mengzhi HUi:
When MARC meets GIS.*
Bibliographical records do have location information, but the
information is often ignored its usage. Our project is to make use of
location information in bibliographical records. It can support users
who have specific needs in retrieving location-related data, for
example,
"epidemic diseases in southern China." MARC is a standard to organize
bibliographical records. GIS stands for geographical information system.
Our approach is to use map interface to help users take advantage of
MARC records for facilitating search. It is also to provide a new
experiment for search of bibliographical records. That's why we call it
"MARC meets GIS."
The purpose of the project is as follows:
- Supplement to current text search in library service.
- Help in location search and language search of MARC records.
- Provide as an analysis tool for observing an evolution or distribution
of a specific topic.
We will present our plan to implement the project,
current challenges, and possible solutions. We also propose the
discussion of some issues that we have observed in the process.
Xiaojun PENG: Distributed Information Retrieval.
Friday Nov 16: Margo DUNLAP, Nan ZHOU, Clifford LYNCH.
Nan ZHOU: Search
Engines for Online Stores.*
The following is what I have done so far:
1. How search engine works, including the whole process by which
spiders
work, the use of meta-tags, and how search engines build the
index as well as a search.
2. Compare a couple of most popular search engines.
3. Drawing on the study of search engines, I'll proceed to study some
online stores with search engines. By comparing similar online stores, I
want to find out about how to improve the search results by
making some changes in underlying working principles.
Margo E. DUNLAP:
Internet Archive: Documenting Ourselves to Death.*
My inquiry into the internet archive, Alexa,
and Brewster Kahle is moving
towards a discussion of context in the appropriation of
digital cultural
artifacts in on-line collections.
I'm reading Lyman, Kahle, Lesk, and reports by RLG and CLIR on
archiving digital objects for future research, the process of
documentation, and collection development.
Clifford LYNCH:
Personalization in a Distributed Environment.
There has been a great deal of use of personalization through
technologies such as reccomender systems over the past few years;
however, for reasons of both user privacy and competative business
advantage, personalization has been highly site-specific. I will
discuss these issues and speculate about how one might begin to think
about reformulating personalization in a more user-centric fashion.
This talk builds in part on a keynote given at the NSF/DELOS
Personalization Workshop in Dublin Ireland June 2001. See
http://www.ercim.org/publication/ws-proceedings/DelNoe02/index.html
An extended abstract for this paper is one from the bottom of the list.
Friday Nov 30: Aitao CHEN and Ray LARSON: Retrieval Evaluation Conferences:
TREC and CLEF.
Internationally there are two forums for the
comparative assessment of retrieval performance: the Text Retrieval Conference
(TREC) and the Cross-Lingual Evaluation Forum (CLEF).
A report of the latest TREC (TREC-2001)
and CLEF conferences will be provided and, if time permits, a discussion
of cross-lingual retrieval.
Friday Dec 7: ** 3-6 pm ** Students' Presentations.
Margo E. DUNLAP:
Internet Archive: Documenting Ourselves to Death.*
Nan ZHOU:
A Study on Search Engines.*
Description: How search engine works? What are the differences between
search engine and traditional database search? Here's a study showing the
details of how search engine crawling on the web as well as comparing some
popular search engines in different respects.
Xiaojun PENG: Distributed Information Retrieval
Client-Server Model.*
First, I will explain a couple of typical
client-server architectures,
such as two-tier, three-tier model. Second, briefly discuss client-side
and server-side web programming. Next, introduce JavaServer pages and
servlets as server-side programming and Java Database Connectivity
(JDBC). Finally, an example showing how I use Java servlets and JDBC to
handle clients' requests to query or update a database on the server.
Kyungmin KIM, Yueh-Ying HSU, Mengzhi HU:
When MARC meets GIS.*
Our project is to make use of location information in bibliographical
records. This information can support users with special needs
in retrieving location-related records, for example, "epidemic diseases
in southern China." The use of this information may help users a lot to
get desirable results. Our approach is to use map interface to help
users take advantage of MARC records for facilitating search. It is also
to provide a new experiment for search of bibliographical records.
The research topics of the project are as follows:
(1) Make use of publisher location information;
(2) Use map as a supporting tool for browse or search;
(3) Apply gazetteer for subject search.
We will present our approaches and prototype.
Hua AI: Distributed Computing and the
Design of CHESHIRE.*
Distributed computing is becoming increasingly important in a networked
world. Cheshire III is needs to be adapted to distributed computing.
This report will examine the costs and benefits if Cheshire adopts some
of the major distributed technologies, and measures
of the feasibility and desirability of supporting them in CHESHIRE III.
ByungHoon KANG:
Harnessing P2P for multiple writers: Challenges and Solutions.*
In this talk, I will present the need for harnessing cooperative peer-to-peer
interaction for multiple writers and the new challenges that I identified and
its solutions that we are working on. We propose to build a new data management
model to address these challenges. After the brief overview of the new data
management model and the hash-history mechanism, I will present latest challenge
that we find. Update-sharing among peers exposes a fundamental security risk; a
compromised or virus-infected state can be easily spread among peers. We
speculate that an "undo-barrier" with a "cooperative defense" approach is useful
for addressing peer-to-peer security risks.
Mike KIM & Chan Jean LEE:
Combining multiple information resources to provide a more unified
search capability.*
Our research focused on the utility of allowing users to choose more than
one category while searching for atypical products. Websites categorize
items in multiple categories because some products have varied attributes or
characteristics that allows it to be placed in two or more categories. We
wanted to provide better search results by linking category information from
other websites. The presentation will describe our work towards this goal.
Spring 2001 summaries.
Schedule for Fall 2001.
Spring 2002 summaries.