296a1 Summary - Seminar Information Access Fall 2001

School of Information Management & Systems
Previously School of Library & Information Studies

296a-1 Seminar: Information Access.
("The Friday Afternoon Seminar")
Summaries

Fall 2001. Fridays 3-5. 107 South Hall. Schedule. * indicates student work for credit.

Friday Aug 31: Clifford LYNCH: Introduction: Overview of the Issues.
I will introduce the seminar, and we will have time for participants to introduce themselves and speak briefly about their interests. After this, I will do a brief review of events of interest that have taken place over the summer, including the Joint Digital Libraries Meeting, developments in the Library of Congress Digital Preservation Program and other activities. I will also provide an overview of some of the topics that I hope to explore in more depth during the semester and highlight opportunities for student research in some of these areas.

Sept 7: Michael BUCKLAND: Metadata Research Update.
For more than 10 years the School has had two related research programs: The "OASIS Research Program," a series of studies of how the command languages of online bibliographies and catalogs could be made both easier to use and more powerful, and the development of the CHESHIRE system by Professor Ray Larson. More recently, Fredric Gey, Ray Larson, Aitao Chen, student researchers, and I have been collaborating in Federally-funded research on how bibliographic descriptive data ("Metadata") can be put to wider use. I will summarize recent work and talk about what we now want to work on, including the use of catalog records as linguistic corpora in their own right; the detection and use of "dialects" among populations of searchers; the automatic construction of bilingual dictionaries from bibliographic records using statistical association techniques; the formal evaluation mappings (relative indexes) from searchers' "query vocabulary" to indexers' "entry vocabulary"; and map-based interfaces for geographical searching of bibliographical databases. (It is intended to have separate sessions on cross-lingual retrieval and the design of gazetteer servers.)

Sept 14: Michael BUCKLAND: The Modern Invention of 'Information': Discourse, History, and Power.
What are the underlying and overlooked assumptions, values, and consequences of modern discourse about "The Information Age"? Prof. Ron Day's recent book, The Modern Invention of "Information": Discourse, History, and Power (Southern Illinois University Press, 2001) analyzes historically important 20 th century writings about "Information." He shows that the documentalist Paul Otlet (1868-1944) and the librarian Suzanne Briet (1894-1989) associated Information and Information Management (then "Documentation") with cultural progress: efficiency, globalization, and world peace. After World War II, Warren Weaver and Norbert Weiner applied a conduit metaphor (based on Shannon's mathematical theory of signaling, aka "Information Theory") to social applications. More recently the multimedia theorist Pierre Levy has adapted the writings of Deleuze and Guattari to a capitalist understanding of the "virtual society." Common characteristics of these theorists include: utopianism, a reductionist view of human life, totalitarian tendencies, and a narrowly limited conception of "information" as facts or even bits. Further, these assumptions tend to lead to amnesia concerning earlier understandings of information.
Prof Day writes: "This epistemology and history has blocked a more careful examination of the history of which shows that "the information age" has been occurring again and again and erasing its own history by this epistemology." ... The result is a crisis in the meaning of historical agency and freedom, in so far as the room for a hermeneutic or "critical" rereading of history becomes more and more reduced, especially within "scientific" and professional rhetorics (and the social structures they produce and define), to the "factuality" of the past, present, and future."
The book returns to the mid-century critiques of the philosopher Martin Heidegger and the social critic Walter Benjamin in order to see critical strategies to information that have been more or less forgotten.
Prof. Day author teaches Library and Information Science at Wayne State University. He has Ph.D. in Comparative Literature, a Masters in Philosophy as well as a Masters in LIS (from Berkeley). His current work is on Italian Autonomous Marxism and the philosophically and politically informed critiques of globalization and the social meaning of information and communication technologies in, and from, this movement.
Resources: Copies of the first and last chapters are available in the Computer Lab. I have two copies of the book which I can loan. Prof Day's website contains related material. See item 1, parts of 2, 5, 7, 10, and the second part of 11 at
http://www.lisp.wayne.edu/~ai2398/papers.htm

Sept 21: Tom LEONARD, University Librarian: Strategic Planning and the Library.
Professor Leonard, our University Librarian, serves on the campus Strategic Planning Committee and trying to explain to them where the Library fits in. He will review the issues and invite discussion.
See http://himalia.chance.berkeley.edu/opa/spc/
Also Clifford LYNCH will report on recent developments.

Sept 28: Two topics: A short presentation, then a longer one:
Avi RAPPOPORT, avirr@searchtools.com: Integrating Question-Answering into Web Site Search.
While site and intranet search engines tend to operate on standard information retrieval principles of relevance ranking, user queries do not. Users seem to waver between information seeking and looking for answers, sometimes in the same query. To address this, I recommend that search administrators provide recommended pages for common searches, and adjust the relevance weighting for various kinds of documents.
Ruth MOSTERN, Electronic Cultural Atlas Initiative, and Michael BUCKLAND: Designing Better Gazetteers.
Gazetteers are the familiar lists of place-names commonly found at the back of an atlas. In an online environment gazetteers acquire a greatly enhanced significance as a linking mechanism between maps and texts; and, in a networked environment, "gazetteer servers" could, in principle, be invoked in conjunction with any text processing or map-related application. A gazetteer, is a list of datasets, each with three data-elements: Place-name; Feature Type (What kind of a place is it? City, shrine, lake,...); and Location (longitude and latitude). However, complexity increases greatly in the humanities and historical work because places commonly have multiple names in multiple languages and multiple scripts and names are unstable over time. Further, places expand, divide, merge, have unclear and/or disputed boundaries, and, occasionally, simply move. Also the range of Feature Types varies greatly depending on the application of interest (archeological, architectural, environmental, linguistic, transportation, military,..), so no single thesaurus of feature types can be expected to be satisfactory.
The National Science Foundation has awarded a grant to the Electronic Cultural Atlas Initiative (ECAI, http://ecai.org/) to design improved format and content specifications for entries in online gazetteers and for characterizing an online gazetteer as a whole. This is being done in collaboration with Academica Sinica (Taiwan, whose digital versions of ancient Chinese histories with archaic place-names in Chinese ideographs provide suitably challenging test material) and the Alexandria Digital Library project at UC Santa Barbara.
Gazetteers are also of wider interest as an instance of a linking mechanism between two different genres.
We will discuss the project tasks and possible solutions, and report on a recent project workshop in Taipei.
Ruth MOSTERN is project manager on the ECAI NSF Gazetteer project and completing a doctoral dissertation on the longterm stability of political boundaries in China. Michael BUCKLAND is Co-Director of ECAI in addition to being a professor in SIMS.

Oct 5: Students' Preliminary Reports on Topics*
Hua AI: Comparisons of Distributed Information Retrieval Protocols.*
I am planning to look at the area of distributed information retrieval. I am right now looking at Java space, and I plan to also look at the Z39.50 protocol and the Open Archive Initiative. My goal is to familiarize myself with the ways that are currently used in distributed information retrieval and compare their pros and cons.
Margo E. DUNLAP: The Internet Archive: The vision of an Entrepreneur.*
I am conducting an ethnographic study of Alexa, a web information company. Alexa provided the technology enabling the archiving of the Internet. Brewester Kahle created the Internet Archive and now plans to make it available to the public. I am interested in reviewing the value of information access and the possibility of archiving everything.
B. Hoon KANG: The Challenges and Solutions Toward the World of Self-Administrating Data.*
We have designed and prototyped a novel data management model, called "self-administering data" [1]. In this model, a declarative specification of how a data object should behave is associated with the object, perhaps explicitly by the user, or by the action of a data input device. Typically, the specification, called a Self-administering Data Description (SDD), expresses how and to whom the data should be transferred, and how it should be incorporated when it is received. The actions required to implement the specification are carried out by a distributed infrastructure of "Self-administering Data Handlers" (SDHs), which are presumed to exist at various points in the network.
We have implemented an initial prototype of the Self-administering Data Handler. In the current prototype, the SDH is configured for co-authoring across administering domains, so that the authors' involvement in document management can be minimized. The initial prototype suggests that SDH has interesting applications to digital libraries and information management processes. In this semester, we are planning to design and implement such interesting applications using SDH.
In the progress report, I am going to present the background of SDH, the challenging problems that need to be addressed, and some preliminary results.
Reference: [1] Toward a Model of Self-administering Data, B. Hoon Kang and Robert Wilensky, First ACM/IEEE-CS Joint Conference on Digital Libraries June 24 - 28, 2001, Roanoke, VA, USA. http://www.cs.berkeley.edu/~hoon/published/jcdl2001.pdf
Kyungmin KIM, Yueh-Ying HSU, Mengzhi HU: When MARC meets GIS.*
An advantage of computer technology is a trend to provide search service on bibliographic records across multiple libraries. As more libraries have put efforts on connecting with each other, users benefit in being able to search for records and borrowing books from multiple different libraries. For a user, finding a book or magazine located nearest to where he or she is would be a high priority. On the other hand, a researcher may be interested in the evolution of specific subjects over time or location through library records. Current bibliographical records in libraries, however, are mainly displayed on textual description of documents. Those needs are hardly satisfied as current bibliographical records have fewer connections with location data. In order to satisfy new needs in searching, we propose a project to provide an interactive graphic interface, connecting bibliographic records with GIS and gazetteers. Our research might focus on (1) Mapping MARC records with GIS and gazetteers; (2) Cross-language search in digital gazetteers; (3) Providing graphic boundaries for user query and solutions for inconsistent data of map boundaries.
Xiaojun PENG: Distributed Information Retrieval.*
The goal of distributed information retrieval is to enable the identification and retrieval of data sets relevant to a general description or query, wherever those data sets may be located or hosted. There are two key components of distributed IR system. First is the application of distributed and parallel computing technology in the area of information retrieval. An example would be the design, implementation and performance evaluation of a distributed architecture for information retrieval. The second key component is a standard information retrieval protocol, or a set of interoperable IR protocols. An example would be proposal and standardization of such communication protocols as Z39.50.

Oct 12: Students' Reports and David Blundell
Mike KIM & Chan Jean LEE: Atypical product search.*
Often when searching for a product, a person does not know the name of the particular item they are searching for. However, they do know the characteristics or attributes of what they seek. If he or she knows which categories a particular item belongs to then the search is simple. But often, it is difficult to determine which category an item will fall under. As an example, if you are looking for a baby carrier that an adult carries by the handle and also can be used as a car seat, where would you look? In many websites, there are more than one category in which this item can be listed.
Atypical products have attributes of several categories. We will research the relationships of category names and the attributes of that category. We would like to determine the feasibility of allowing searching within multiple categories. We hope this will reduce the recursive drilling down categories to find a product.
Nan ZHOU: Search Engines for Online Stores.*
This project will do research on search engines for online stores, show how they work, and compare different search engines for some sites. The ultimate goal is to reach conclusions on how could online store search engines could be made more adequate and more effective.
David S. Blundell, Visiting Scholar, International & Area Studies; Dept of Anthropology, Taiwan National University, Taipei: Creating an Austronesian Linguistic Atlas.
David Blundell will describe his experience in language studies, what a linguistic atlas is, the Austronesian language family (whichextends from Vietnam to New Zealand and from Madagascar to Hawaii), and the steps involved in creating an Austronesian linguistic atlas and putting it into a digital form.

Oct 19: Marti HEARST: Incorporating Faceted Metadata into Web Site Search.
One of the most pressing usability issues in the design of web sites is that of how to improve navigation and search. We are conducting a series of usability studies to address this problem, focusing on web sites that consist of large collections of loosely organized information. This talk describes our method and presents preliminary results which suggest that use of faceted metadata can be useful both for the initial stages of highly constrained search and for the intermediate stages of less constrained browsing tasks. We also find that users state an interest in using different search interface types to support different search strategies. We are in the process of conducting usability studies to investigate how to make this approach scale to very large collections in which each metadata facet is hierarchically organized.
Joint work with Jennifer English, Kirsten Swearingon, Rashmi Sinha, and Ping Yee.
See bailando.sims.berkeley.edu /flamenco.html

Oct 26: Nancy VAN HOUSE: Who do You Trust? Digital Libraries as Boundary Objects.
Information and communications technologies make information more readily available. This is not always desirable. In my field work with people engaged in biodiversity and environmental planning, both of which rely extensively on shared data, both users and providers of information expressed concern about the ease with which digital information can be disseminated. One group, for example, debated restricting access to photos of endangered plant species to prevent landowners from identifying and destroying rare specimens.
These led me to look more closely at the practices of knowledge creation and of cognitive authority in this community, and especially at boundaries: how they help, and are created and maintained, as well as how they are crossed in interdisciplinary work. I draw on the literature of social epistemology, science studies, and other areas concerned with knowledge and knowledge communities, as well as field work with people actively engaged in designing shared data systems.
I consider the implications for the design of digital libraries and other shared information systems.

Nov 2: Students' Progress Reports:
Mike KIM & Chan Jean LEE: Atypical product search: Lessons Learned and Future Work.*
Our goal is make better use of the work that has already been done to categorize products from different webmasters. To do this, we want to extract information from different websites and determine which items are under which category. By comparing the items of two categories, we may be able to compare the characteristics of the items in the two categories. These goals directed our research focus toward text parsing html pages and database design to store the parsed text. In the database design, we focus on how to store hierarchical structure of categories of a website in order to retrieve all items under any category.
We will discuss our challenges such as some limitations of web crawling robots and our current approach to extract data from websites.
ByungHoon KANG: Hash-history based Approach for Management of Weakly Consistent Replicas.*
Version vectors [Parker83] are used for reconciling replicas by detecting conflict and partial orderings in weakly consistent replication systems such as Bayou [Petersen97] and Ficus [Reiher94]. However, it is well known that this approach does not scale as the number of replicas increases: Each replica locally maintains its own version vector to track the number of updates generated by other replicas; the size of the vector grows in proportion to the number of replicas and the complexity of the replica creation pattern [Petersen97]. In addition, the management of version vectors becomes complicated since entries for newly added replicas have to be either broadcasted or incrementally propagated to other replicas.
We propose a novel "hash history" based scheme to provide a scalable and simple approach for reconciling replicas. Instead of version vectors, each site keeps a record of the hash of each version that the site has created or received from other sites. When an update comes from another site, the sites exchange their lists of hashes, from which each can decide which is the newer version. If no version dominates the other, the most recent common ancestral version can be found instead and used as a useful hint in a subsequent diffing/merging process. This approach is scalable since the growth of hash lists is not proportional to the number of replicas.
References
[Parker83] D. Stott Parker, Jr., Gerald Popek, Gerard Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen Kiser, and Charles Kline. Detection of mutual inconsistency in distributed systems. IEEE Transactions on Software Engineering, 9(3):240--247, May 1983.
[Petersen97] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers. Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP-16), Saint Malo, France, October 5-8, 1997, pages 288-301
[Reiher94] P. Reiher, J. S. Heidemann, D. Ratner, G. Skinner, and G. J. Popek. Resolving file conflicts in the Ficus file system. In USENIX Conference Proceedings, June 1994.

Nov 9: Additional Students' Progress Reports:
Hua AIi: Comparisons of Distributed Information Retrieval Protocols.*
Kyungmin KIM, Yueh-Ying HSU, Mengzhi HUi: When MARC meets GIS.*
Bibliographical records do have location information, but the information is often ignored its usage. Our project is to make use of location information in bibliographical records. It can support users who have specific needs in retrieving location-related data, for example, "epidemic diseases in southern China." MARC is a standard to organize bibliographical records. GIS stands for geographical information system. Our approach is to use map interface to help users take advantage of MARC records for facilitating search. It is also to provide a new experiment for search of bibliographical records. That's why we call it "MARC meets GIS."
The purpose of the project is as follows:
- Supplement to current text search in library service.
- Help in location search and language search of MARC records.
- Provide as an analysis tool for observing an evolution or distribution of a specific topic.
We will present our plan to implement the project, current challenges, and possible solutions. We also propose the discussion of some issues that we have observed in the process.
Xiaojun PENG: Distributed Information Retrieval.

Friday Nov 16: Margo DUNLAP, Nan ZHOU, Clifford LYNCH.
Nan ZHOU: Search Engines for Online Stores.*
The following is what I have done so far:
1. How search engine works, including the whole process by which spiders work, the use of meta-tags, and how search engines build the index as well as a search.
2. Compare a couple of most popular search engines.
3. Drawing on the study of search engines, I'll proceed to study some online stores with search engines. By comparing similar online stores, I want to find out about how to improve the search results by making some changes in underlying working principles.
Margo E. DUNLAP: Internet Archive: Documenting Ourselves to Death.*
My inquiry into the internet archive, Alexa, and Brewster Kahle is moving towards a discussion of context in the appropriation of digital cultural artifacts in on-line collections. I'm reading Lyman, Kahle, Lesk, and reports by RLG and CLIR on archiving digital objects for future research, the process of documentation, and collection development.
Clifford LYNCH: Personalization in a Distributed Environment.
There has been a great deal of use of personalization through technologies such as reccomender systems over the past few years; however, for reasons of both user privacy and competative business advantage, personalization has been highly site-specific. I will discuss these issues and speculate about how one might begin to think about reformulating personalization in a more user-centric fashion.
This talk builds in part on a keynote given at the NSF/DELOS Personalization Workshop in Dublin Ireland June 2001. See http://www.ercim.org/publication/ws-proceedings/DelNoe02/index.html An extended abstract for this paper is one from the bottom of the list.

Friday Nov 30: Aitao CHEN and Ray LARSON: Retrieval Evaluation Conferences: TREC and CLEF.
Internationally there are two forums for the comparative assessment of retrieval performance: the Text Retrieval Conference (TREC) and the Cross-Lingual Evaluation Forum (CLEF). A report of the latest TREC (TREC-2001) and CLEF conferences will be provided and, if time permits, a discussion of cross-lingual retrieval.

Friday Dec 7: ** 3-6 pm ** Students' Presentations.
Margo E. DUNLAP: Internet Archive: Documenting Ourselves to Death.*
Nan ZHOU: A Study on Search Engines.*
Description: How search engine works? What are the differences between search engine and traditional database search? Here's a study showing the details of how search engine crawling on the web as well as comparing some popular search engines in different respects.
Xiaojun PENG: Distributed Information Retrieval Client-Server Model.*
First, I will explain a couple of typical client-server architectures, such as two-tier, three-tier model. Second, briefly discuss client-side and server-side web programming. Next, introduce JavaServer pages and servlets as server-side programming and Java Database Connectivity (JDBC). Finally, an example showing how I use Java servlets and JDBC to handle clients' requests to query or update a database on the server.
Kyungmin KIM, Yueh-Ying HSU, Mengzhi HU: When MARC meets GIS.*
Our project is to make use of location information in bibliographical records. This information can support users with special needs in retrieving location-related records, for example, "epidemic diseases in southern China." The use of this information may help users a lot to get desirable results. Our approach is to use map interface to help users take advantage of MARC records for facilitating search. It is also to provide a new experiment for search of bibliographical records. The research topics of the project are as follows:
(1) Make use of publisher location information;
(2) Use map as a supporting tool for browse or search;
(3) Apply gazetteer for subject search.
We will present our approaches and prototype.
Hua AI: Distributed Computing and the Design of CHESHIRE.*
Distributed computing is becoming increasingly important in a networked world. Cheshire III is needs to be adapted to distributed computing. This report will examine the costs and benefits if Cheshire adopts some of the major distributed technologies, and measures of the feasibility and desirability of supporting them in CHESHIRE III.
ByungHoon KANG: Harnessing P2P for multiple writers: Challenges and Solutions.*
In this talk, I will present the need for harnessing cooperative peer-to-peer interaction for multiple writers and the new challenges that I identified and its solutions that we are working on. We propose to build a new data management model to address these challenges. After the brief overview of the new data management model and the hash-history mechanism, I will present latest challenge that we find. Update-sharing among peers exposes a fundamental security risk; a compromised or virus-infected state can be easily spread among peers. We speculate that an "undo-barrier" with a "cooperative defense" approach is useful for addressing peer-to-peer security risks.
Mike KIM & Chan Jean LEE: Combining multiple information resources to provide a more unified search capability.*
Our research focused on the utility of allowing users to choose more than one category while searching for atypical products. Websites categorize items in multiple categories because some products have varied attributes or characteristics that allows it to be placed in two or more categories. We wanted to provide better search results by linking category information from other websites. The presentation will describe our work towards this goal.

Spring 2001 summaries. Schedule for Fall 2001. Spring 2002 summaries.