296a3 Summary - Seminar Information Access Spring 2000

School of Information Management & Systems
Previously School of Library & Information Studies

296a-3 Seminar: Information Access.
("The Friday Afternoon Seminar")
Summaries

Fridays 3-5. 107 South Hall. Schedule.

Fri Jan 21: Clifford LYNCH: Introduction. Report on HICSS: Hawaii International Conference on System Sciences, Jan 2000. Also: Authenticity and integrity in a digital environment.

Jan 28: Brett BUTLER, INFOUR Intellectual Property Development, Foster City: AnswerBase and how we restructure a query to get closer to users' needs - and get an answer, not just a citation or a text.
Brett Butler, founding President of Information Access, established Magazine Index and, later, Infotrac, pioneering electronic indexing services, using Library of Congress Subject Headings and an OPAC-like structure to build a service that became the largest access product line in the U.S.
Now he is taking another look at access, starting a company based on the premise that we have interjected too many indexing and browsing tools between the patron's question and the target answer. AnswerBase will be a reference database that links queries directly with specific answers using traditional library classification and other structures in non-traditional ways.
The service will also be collaborative, enabling libraries to capture questions as they are asked and submit answers to a central, editorially reviewed database - a first for reference publishing. Traditionally, information flows only from publisher to library.
He will address the impact of a truly interactive query and response system on library practices and on organization for information delivery.

Feb 4: Fredric GEY & Ray LARSON: TREC8: Report on the 8th Text Retrieval Evaluation Conference.
TREC, Text REtrieval Conference has been conducted from the past 8 years by the National Institute of Standards and Technology (NIST) with support from Defense Advanced Research Projects Agency (DARPA) The Text Retrieval Conference (TREC) workshop series encourages research in information retrieval from large text applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. Now in its ninth year, the conference has become the major experimental effort in the field. Participants in the previous TREC conferences have examined a wide variety of retrieval techniques, including methods using automatic thesauri, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching. Other related problems such as cross-language retrieval, retrieval of recorded speech, and question answering have also been studied. Details about TREC can be found at the TREC web site, http://trec.nist.gov .
TREC focuses on a number of specific retrieval tasks in a set of "tracks". Below is a brief summary of the tasks. Complete descriptions of tasks performed in previous years are included in the Overview papers in each of the TREC proceedings (in the Publications section of the web site).
The central task for all past TREC's has been the Ad Hoc retrieval task, attempting to find the relevant documents in a fixed database.
In addition there are a number of more specialized tracks:
- Cross-Language Track -- a track that investigates the ability of retrieval systems to find documents that pertain to a topic regardless of the language in which the document is written.
- Filtering Track -- A task in which the user's information need is stable (and some relevant documents are known) but there is a stream of new documents. For each document, the system must make a binary decision as to whether the document should be retrieved (as opposed to forming a ranked list).
- Interactive Track -- A track studying user interaction with text retrieval systems.
- Query Track -- A track designed to foster research on the effects of query variability and analysis on retrieval performance.
- Question Answering Track -- A track designed to take a step closer to *information* retrieval rather than *document* retrieval. For each of a set of 500 questions, systems produce a text extract that answers the question.
- Spoken Document Retrieval Track -- A track that investigates the effects of speech recognition errors on retrieval performance.
- Web Track -- A track featuring ad hoc search tasks on a document set that is a snapshot of the World Wide Web.

Feb 11: Jack L. XU, Senior Manager, Search Technology Group, Excite@Home Corp: Internet Search Engines: Real World IR Issues and Challenges.
The Excite Web search engine was created in 1996. As one of the industry leading engines, Excite Search is currently indexing 250 million Web pages from an initial database of over 920 million visited Web page and supporting eleven languages including Japanese, Italian, Spanish and Chinese.
10s millions of users per day, 100s queries per second, 10s Terabytes of data, 100s gigabyte database of indexed terms, 10s Sun E4500 servers on the backend, multiple data centers. At that scale, everything isn't easy.
Jack Xu joined Excite in 1996, he is one of the founding researchers and developers at Excite. Jack currently manages the Search Technology Group within Excite@Home Corp. This talk discusses real world IR issues (web collection, users, queries ...), and why the issues underlying the internet search engines are challenging. This talk will be illustrated with lessons learned along the way in managing the Excite search engine.

Feb 18: Fred GEY, UCDATA; Steve LUSSIER; John McCARTHY, LBL & Frank OLKEN, LBL: ISO/IEC 11179 metadata registries.
Report on the Open Forum on ISO 11179 Metadata Registries, Jan 17-21, 2000, in Santa Fe, New Mexico, US. The fourth in a series of international conferences with participants from private enterprise, government, academe and standards organizations to explore the capabilities, uses, content, development and operation of metadata registries, particularly those based on ISO/IEC 11179. Emphasis is on managing the content (semantics) of data that is shared within and between organizations or disseminated via the World Wide Web.

Feb 25: Richard GEIGER, S.F. Chronicle: From "Morgue" to Electronic Publisher -- The Evolution of Newspaper Libraries.
A survey of the many changes that have taken place in news libraries over the last two decades and look ahead to the future, addressing such issues as subject access, format standardization, database software and vendor relations. Also will discuss the effects of the the Web and the global economy on news libraries.

Mar 3: Michel BIEZUNSKI, Infoloom, Paris, France: The Topic Maps International Standard (ISO/IEC 13250:1999).
The Topic Maps International Standard (ISO/IEC 13250:1999) provides a standard syntax for interchanging the information needed to support collaborative creation and maintenance of finding aids such as indexes and glossaries. Topic Maps permit such modeling information to be maintained separately from the materials that are indexed. This presentation will give an overview of the Topic Maps architecture, covering concepts, syntax, and some applications currently under development will be presented.
Michel Biezunski Michel Biezunski is working as an independent consultant. He specializes on SGML applications, and has worked specifically on document architectures based on links within information objects.
For an explanation of Topics Maps see:
Topic Maps.
Welcome to Topic Map Land.
"The new ISO standard ISO/IEC 13250 Topic Maps defines a model and architecture for the semantic structuring of link networks. The basic concepts of the standard are topics, occurrences of topics, and relationships ("associations") between topics. A topic map in its interchange form is an SGML (or XML) document (or set of documents) in which different element types are used to represent topics, occurrences of topics, and associations between topics."

Mar 10: Progress reports:
-- Lincoln CUSHING: "Call for Paper: Paper permanency developments.
It is common knowledge in the library community that most printed documents produced over the past 150 years are slowly deteriorating because the paper is archivally unstable. The consequential lost knowledge has been serious enough that the ALA has described the situation as "...a form of censorship." This report summarizes the extent of the problem, reviews the proactive efforts made to improve the quality of new materials being produced, and raises suggestions for areas of further policy development.
-- Karthik IYER: The application of XML in the e-commerce arena.
Also something about new technologies like tuple spaces and Jini architecture and the possible incorporation of Xml in those technologies.
-- Sridarshan KOUNDINYA: Ontologies.
What are they? Why are they interesting? Summary of definitions Various approaches taken by different researchers. Research questions that intrigue me. How does this topic link my past background in pu blic policy with my current and future interest in information management? Progress in clarifying the concept.
-- Kathryn KADA & Steve LUSSIER. Environmental Informatics Portal Prototype.
We are taking a non-profit, "open source" approach to support creators and users of environmental datasets, seeking both to lower entry barriers to the field and to promote structured dialogue and collaborative knowledge development among researchers. A recent draft interface is up at www.sims.berkeley.edu/~s lussier/newmain2.html.

Mar 17: Patricia BREIVIK, Dean of the University Library, San Jose State U.: Two Changing Faces of Libraries: Information Literacy and Joint Libraries.
-- While concerns about America's digital divide intensify, some librarians are aggressively confronting this challenge. This presentation will explore examples of two very different approaches to closing the gap between the haves and have nots in our Information Society. These examples are: a $771.5 million project of the San Jose State University and the City of San Jose to build a joint library with integrated services and the growing impact of information literacy in education.

Mar 31: Spring Break.

Apr 7: Reports on recent developments:
- New project on "Translingual Information Management Using Domain Ontologies";
- Web-Wise: Institute for Museum and Library Services conference for National Library Leadership Grant recipients;
- DARPA TIDES kick-off meeting;
- Coalition for Networked Information Forum; and more!

Apr 14: John L. OBER, California Digital Library, UC Office of the President: Applied Research and Technology Transfer for the California Digital Library.
John Ober, Director of Education and Applied Research at the CDL, will discuss the creative tension between immediate goals, available and emerging technology, and the establishment of an applied research and technology transfer agenda to address the mid and long-term goals of the CDL and its users. A tools and services "wish list," creation of strategic partnerships, and organizational processes are all facets of the topic open for discussion.

Apr 21: Clifford LYNCH: Collaborative filtering, popularity based notification, and "how hits happen."

Apr 28: Ray LARSON: Cross-Domain Resource Discovery: Integrated Discovery and Use of Textual, Numeric and Spatial Data.
This talk will describe the International Digital Library project sponsored by NSF and JISC in the UK under the NSF/JISC International Digital Library Grant program.
The goals of this project are twofold:
1) Practical application of existing DL technologies to some large-scale cross-domain collections.
2) Theoretical examination and evaluation of next-generation designs for systems architecture and and distributed cross-domain searching for DLs.
The Participants:
* University of Liverpool
* Art and Humanities Data Service (http://ahds.ac.uk/)
* OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS (Surrey & Northumbria)
* Consortium of University Research Libraries (CURL)
* UC Berkeley Library
* Making of America II
* Online Archive of California
* Use in NESSTAR
For the first goal, we are implementing a distributed search system based on international standards (Z39.50 and SGML/XML) called "Cheshire II" which will be used for cross-domain searching. Databases include: Arts and Humanities Data Service (AHDS), CURL (Consortium of University Research Libraries) Online Archive of California (OAC) Making of America II (MOA2) The second goal will be addressed in the design, development, and evaluation of the distributed information retrieval system architecture, its client-side systems that aid the user in exploiting distributed resources and in the design and evaluation of protocols for efficient and effective retrieval in a internationally distributed multi-database environment. (Cheshire III?). We will be dealing with types of data: 1) document databases with information about various topics ranging from news reports and library catalogue entries to full-text articles from academic journals including text, images and multimedia elements, 2) Numeric statistical databases which assemble facts about a wide variety of social, economic, and natural phenomena and 3) Geographic databases derived from geographic information systems, digitized maps, and other resource types which have a georeferenced view of the geographic features and boundaries including georeferenced information derived from place names.
For CHESHIRE see http://cheshire.sims.berkeley.edu
For this project see http://cheshire.sims.berkeley.edu/proposal.html

May 5: Sridarshan KOUNDINYA: Ontologies.
"Ontologies" are a topic of widespread discussion, but what is meant? There is "Ontology" as a field of Philosophy and there are different kinds of "AN ontology." The differences will be explained with special reference to the use of "ontology" in the sense of a metadata language, such as a thesaurus.
Also Lincoln CUSHING: Modern Industrial Papermaking and its Consequences for Librarians.
An overview of why the paper used for producing documents over the past 150 years has had terrible consequences for librarians and archivists. Technical and policy issues are explored, as well as suggestions for future work.

May 12: Karthik IYER: XML Applications and related technologies.
There are various XML based applications used for commercial purposes. Some like HotMetal server are used for e-commerce applications. I will also describe some of the XML technologies like xsl, xlink,xpointer, etc.
Also Michael GEBBIE: Preliminary Research on Subdomain Indexes Special Vocabulary for Special Information Needs.
Conventional practice in indexing is to create a general index to the entire database or corpus. But searchers are usually looking for something within a specific topic. In the DARPA Metadata project we have been experimenting with creating indexes based on a specialized sub-area ("subdomain") only, with striking results. A progress report.

Fall 1999 schedule.