Friday, Aug 28: Clifford LYNCH: Introduction. Building Computational Instruments for the Humanities and Social Sciences.
    Introduction to Seminar; Schedule for Semester; Introduction of Participants.
    Building Computational Instruments for the Humanities and Social Sciences.(Lynch). Continuing and extending a discussion at last year's seminar, I'll explore some of the potential for building new computational services that can function as new instruments for the humanities and social sciences (and indeed for many other areas of investigation) and relate them to developments in text and data mining and information retrieval. I will highlight a number of recent experiments in this area. Finally, I'll frame questions about how such instruments might be deployed, and by what organizations.

Friday Sep 11: Doug OARD, Univ of Maryland: Finding Things You Can't Read: Interactive cross-language search for monolingual users.
    Speech recognition and machine translation techniques are evolving rapidly, creating new opportunities to build systems that can support information seeking in large collections of multilingual and multimedia content. Little is presently known, however, about how people would use such systems to accomplish real tasks. In such circumstances, designers naturally rely on their own judgment to decide how component capabilities should be optimized and how those components should be integrated. Once that's been done, the next step is to put the resulting system in the hands of users in order to learn what they do with it. In this talk, I will describe what we have learned so far from such a process. I'll start with some background on user-centered evaluation for cross-language information retrieval at the Cross Language Evaluation Forum (CLEF). I will then introduce Rosetta, an integrated system that supports search and display of live and archived news feeds in four languages for users who know only English and I'll explain how we have used a formative evaluation process to co-evolve both the design of the system and of the ways in which it can be used. I'll conclude the talk with a few design ideas that build on what we have learned to date.
    Douglas Oard is an Associate Processor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies. He is on sabbatical at Berkeley's School of Information for the Fall 2009 semester. Dr. Oard earned his Ph.D. in Electrical Engineering from the University of Maryland, and his research interests center around the use of emerging technologies to support information seeking by end users. His recent work has focused on interactive techniques for cross-language information retrieval, searching conversational media, and support for sense-making in large digital archival collections. Additional information is available at

Friday Sep 18: Julian WARNER, Queen's University, Belfast:   Creativity in Feist,
    This paper is not about the legal aspects of Feist (1991), but approaches the judgment from an information science perspective. Analogies are found between concepts in the widely circulated public discourse of Feist and distinctions between forms of mental labor recently introduced to information science. The delineation of the absence of creativity in Feist is analogous to syntactic mental labor and the judgment's criteria for creativity can be encompassed by semantic labor. The validity and significance of the distinction between syntactic and semantic mental labor is supported by the discovery of corresponding concepts in the judgment.
    Julian Warner teaches information science and information policy in the Management School at the Queen's University, Belfast, and has been a Visiting Scholar here. He is interested in the history of information and of information technology. His forthcoming book Human Information Retrieval will the first in the new MIT Press series on the History and Theory of Information Science.
More at

Friday Sep 25: Catherine MARSHALL, Microsoft Research, Silicon Valley: No Bull, No Spin: Comparing Public Tags with other Descriptive User Metadata.
    User-contributed tags have shown promise as a means of indexing multimedia collections by harnessing the efforts and enthusiasm of online communities. But tags are only one way of creating viable descriptions of multimedia collections. In this talk, I report on a study that takes a close look at the characteristics of public tags by comparing them to other forms of descriptive metadata that users have assigned to an image collection. I also use the study results to formulate design recommendations for tagging tools and to speculate on how photo sharing sites may be used as de facto art and architecture resources.
    Cathy Marshall is currently a senior researcher at Microsoft Research's Silicon Valley laboratory after a stint in Microsoft's product divisions as part of the Advanced Reading Technologies team. Before that, she was a long-time member of the research staff at Xerox PARC. Cathy's non-Microsoft homepage is at There you will find her publications, blog, contact information, and will learn why she was not invited to her high school reunion.
    Also Brief Progress Report: Ryan SHAW: Modeling Colligatory Concepts in Historical Texts.
    The philosopher of history W.H. Walsh introduced the notion of "colligation" to describe how historians gather diverse factual statements under a unifying concept like "The Renaissance" or "The French Revolution." Frank Ankersmit, building on Walsh's ideas, proposed that types of these "colligatory concepts" could be defined extensionally, by clustering overlapping sets of statements from various texts narrating similar concepts under the same name.
    My proposal for this semester is to investigate Ankersmit's theory by analyzing the full text of 10 books on the 1886 Haymarket Square Riot from the Internet Archive. I plan to use sentence alignment techniques (Barzilay & Elhadad 2003) to identify overlapping sets of statements among the 10 texts. I hope to demonstrate that we can extensionally model the Library of Congress Subject Heading "Haymarket Square Riot, Chicago, Ill., 1886" according to Ankersmit's theory and provide an interface for highlighting differences among the individual narratives constructed by the different texts.

Friday, Oct 2: Michael BUCKLAND: Design for the Future Use of Reference Works.
    Understanding depends on knowing the background, context, and relationships of whatever is of interest. Learning comes through adding to or modifying what one already knows. For these purposes a variety of reference books evolved in the print environment. Having a suitable set of explanatory works conveniently at hand is a valuable amenity, but has been slow to evolve in the online environment. How could such an amenity be made part of everyone's personal computing environment? The literature on library reference service has concentrated on empowering librarians to find answers for library users, which is good, but most people prefer to find explanations for themselves if they can do so easily enough. Economic considerations and changes in technology make a compelling basis for a shift in emphasis to the support of reference self-service. A current project of the Electronic Cultural Atlas Initiative and the School of Information entitled "Context and Relationships: Ireland and Irish Studies" seeks to provide a remedy. After several months experience with an initial prototype "Context Finder", a quite different design is now being worked on. I will lead a discussion of some of the implications of enabling self-service discovery in relatively trustworthy resources, which are commonly digital versions of traditional print-on-paper reference works. We will consider the design implications for three groups: provders of office software (browsers, wordprocessors); publishers of reference works; and librarians, bibliographers, and teachers. More at

Friday, Oct 9: Students' progress reports:
    Krishna JANAKIRAMAN: BellKor and other approaches towards building book recommendation systems.

    The Netflix recommendation system competition has effected a surge in recommendation systems research. This has resulted in more accurate and scalable approaches towards building recommendation systems (Y Koren RecSys 2008). For the seminar, I would like to take a detailed look at the BellKor algorithm (Y. Koren, "Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model", the algorithm that won the Netflix recommendation system competition for the year 2008. One motivation is to try and apply the same algorithm for book recommendation using the BookCrossing dataset ( Another motivation is to perform a detailed statistical analysis of the BookCrossing dataset itself. Such an analysis, I believe, may lead towards discovering interesting rules and patterns within a large book reading community like BookCrossing. The inferred rules can further be utilized towards engineering rules or decision tree based recommendation systems for books - an approach seldom taken by recommendation system engineers.
    Nick DOTY: A Meaningful Ontology of Location.
  As more and more devices have the ability to geolocate themselves, we have an increasing ability to map our own geospatial position. Where we are at a given point of time can provide a valuable and meaningful context to our lives, but in practice most location-based services exclusively exchange latitude and longitude coordinates. Though those coordinates are straightforward for storage and transmission, they leave out a lot of the semantic content. I'll report on my work so far looking at existing ontologies of location and then roughly sketch out some of the additions that might be useful in capturing the cultural meaning of our location.
    Clifford LYNCH: Storage Systems, Resilience, and the Research Agenda for Digital Preservation.
    I'll share some reflections on the recent Library of Congress sponsored Symposium on Storage Systems for Digital Preservation, what we are learning about storage systems, some ideas from the emerging field of resilient systems, and talk about what this may suggest for the future research and development agenda in support of digital preservation going forward.

Friday, Oct 16: Katsumi TANAKA, Kyoto University: Web Search and Information Credibility Analysis.
    We describe a new concept for improving Web search performance and/or increasing the information credibility of search results using Web 1.0 and Web 2.0 content in a complementary manner. Conventional Web search engines still suffer from a low precision/recall ratio, especially for searching multimedia content (images, videos, etc.). The quality control of Web search is generally insufficient due to low publishing barriers. As a result, there is a large amount of mistaken and unreliable information on the Web that can have detrimental effects on users. This calls for technology that facilitates the judging of the trustworthiness or credibility of content and the accuracy of the information that users encounter on the Web. Such technology should be able to handle a wide range of tasks: extracting credible information related to a given topic, organizing this information, detecting its provenance, and clarifying background, facts, and other related opinions and their distribution. We propose and describe a concept of enhancing the search performance of conventional Web search engines and analyzing information credibility of Web information using the interaction between Web 1.0 and Web 2.0 content. We also overview our recent research activities on Web search and information credibility based on this concept.
    Professor Katsumi Tanaka received the BS, MS and PhD degrees in Information Science from Kyoto University, in 1974, 1976 and 1981, respectively. In 1986, he joined the Department of Instrumentation Engineering, Faculty of Engineering at Kobe University, as an associate professor. In 1994, he became a full professor at the Department of Computer and Systems Engineering Department, Faculty of Engineering, Kobe University. Since 2001, he has been a professor of the Graduate School of Informatics, Kyoto University. He is currently a vice-dean of the school. His research interests include database theory and systems, Web search, video retrieval, and multimedia information systems. More at
    Also Katzutoshi SUMIYA, Hyogo University: Less-Conscious Information Retrieval Techniques for Location Based Services.
    We have developed methods which can deal with the users' interaction without the conventional conscious searching manner. When a user generally performs map operations with certain information retrieval intentions (less-conscious), a system using our method can detect the specific operation sequences. For example, if the user performs zooming-in and centering operations, the user is narrowing down the search area to a certain location. We define such operation sequences as chunks. The system detects the chunks and uses them to analyze the user's operations and thereby detect the user's intentions. We have developed several prototype systems based on the proposed methods.
    Kazutoshi Sumiya is professor, School of Human Science and Environment, University of Hyogo, Japan. He specializes in information search, the WWW, content integration and multimedia. He received his BE and ME degrees in instrumentation engineering from Kobe University in 1986 and 1988, respectively. Then he joined Matsushita Electric Industrial Co. He received his Ph.D in Information media from Kobe University in 1998. He left the company and became a lecturer at Kobe University in 1999, and then was promoted to an associate professor in 2000. He became an associate professor in 2001 at Kyoto University and a professor at the University of Hyogo in 2004. He developed software development support systems using visual prototyping for embedded software in home appliances and digital satellite data dissemination systems at Matsushita Electric. At Kobe University and Kyoto University, he developed information dissemination systems and fusion technique for broadcast media and network media. At the University of Hyogo, he is developing next-generation information techniques. He is a chair of Database System special interest group (DBS) in the Information Processing Society of Japan (IPSJ) and a co-editor of IPSJ Transaction on Database.

Friday, Oct 23: Patrick SCHMITZ: Berkeley Prosopography Services and CollectionSpace.
    Berkeley Prosopography Services (BPS) is an open-source prosopographical toolkit that generates interactive visualizations of the biological and social connections that link documented individuals, providing a dynamic and heuristic tool for researching historical communities documented in legal and administrative archives. We are currently exploring and developing a prototype application with a single target corpus, but will soon expand to support multiple corpora. The initial corpus is a set of Hellenistic Babylonian legal texts (cuneiform tablets). I'll describe our architecture and the tools we're using, and describe our plans for the next year or so.
    CollectionSpace is a collaboration that brings together a variety of cultural and academic institutions with the common goal of developing and deploying an open-source, web-based software application for the description, management, and dissemination of museum collections information. Berkeley is responsible for the development of the services back-end, which follows SOA principles adapted to this domain. I'll talk about the overall project architecture and organization, and some of the new approaches we've developed to services architecture, SOA methodology, and SOA governance. Pilot deployments of CollectionSpace are underway with the Phoebe A. Hearst Museum of Anthropology, and with the Herbaria collections.
    Both of these projects fit into a longer term mission in IST-Data Services to build a platform of reusable, interoperable services that support research and teaching. See Using Natural Language Processing and Social Network Analysis to study ancient Babylonian society. Also Collection management systems for campus museums: CollectionSpace 0.1 released.
    Patrick Schmitz is Semantic Services Architect in the campus Information Services and Technology's Data Services section.

Friday, Oct 30: Isaac MAO, Social Brain Foundation: The Future of Sharism: Social Media's Impact in China.
    As we mark 40 years since the transformation of the Internet from a single meme into a global communication tool, it's time for us to imagine that the future of the Intenet could be both socialized to connect all people and materalized to connect all things. Considering the speed with which we now connect, a high level of global consciousness could emerge with active sharism around the world. This kind of emergent power could be showcased soon in some rapidly wired countries like China to see its constructive potential in politics and society.
    Isaac MAO is a philosopher on Sharism, social entrepreneur, blogger, software architect and researcher in learning and social technology. He divides his time between research, social works, business and technology. He is now managing director of Social Brain Foundation . As one of the earliest bloggers in the Chinese community, Isaac is not only co-founder of which is the earliest evangelizing site in China on grassroots publishing, but also the co-chair of Chinese Blogger Conference Issac Mao's homepage is at

Friday, Nov 6: Clifford LYNCH: Very Large Scale Preservation; Free Speech and Access to Knowledge.
    After a quick around-the-table for announcements, I'm going to first cover questions about new storage and computational models for very large scale digital preservation, with particular focus on the issues raised by work going on in the resilient computing area, which I will summarize. This will help to shape a new research agenda for digital preservation.
    If there's enough time, I'll follow this up with a re-visiting of some of the talk that I gave last week at the inagural Kaplan symposium at Penn State, which examines the relationships between fundamental American values of free speech and freedom of the press and related ideas of rights of access to knowledge and information. There are some surprisingly deep problems here, including the relationships among "knowledge", "information", "entertainment" and "culture".

Friday, Nov 13: Tom MORITZ: Data as Evidence.
    For decades there has been a general recognition that data should be freely and effectively available for use. (The scientific method assumes the availability of data for replication or falsification of results.) A variety of countervailing pressures have impeded such access and use. Recently, the European Union, the US National Academies, the Ecological Society of America, GBIF (the Global Biodiversity Information Facility), the NSF OCI DataNet initiative and have all been exploring new models for full life cycle management of data.
    In well funded, "big science" domains, models for data management incorporating community standards, metrics and best practices have evolved to provide for access and use. In small science such models are less well developed. This talk will consider data and emerging developments in data curation and dissemination -- focusing on "small science" and on effective applications of data to policy formation and decision making.
    Tom Moritz has worked since 1975 as a librarian and knowledge manager in both the public sector and the non-profit private sector, in governmental, academic and museum settings. He has worked as an advisor on knowledge management in Africa, Asia, Europe, the Pacific and Latin America, was a lead organizer of the Biodiversity Heritage Library Project and the now-UNEP-based Conservation Commons ( He led in the development and release of the first World Database on Protected Areas. and has successfully participated in grants from the Mellon Foundation, the Sloan Foundation and the US National Science Foundation. In the Fall of 2005, he served as Visiting Assoc. Prof. at the Pratt Institute Graduate School of Library and Info. Science in NY.

Friday, Nov 20: Students' Final Progress Reports.
    Nick DOTY: Personalized Ontologies of Location
    Many valuable location-based services will depend on an understanding of location that is personal rather than universal. How would such a parameterized ontology vary from standard concepts of geography and how can it enable self-reflection, privacy and context? What challenges stand in the way of any implementation?
    Krishna JANAKIRAMAN: Neighborhood Based Approaches Towards Building a Book Recommendation System.
    Collaborative filtering algorithms are recommender systems that predict unknown user ratings for items from previously known ratings. A predominant approach towards building such algorithms is to build a neighborhood model for items (or users) and then predict the unknown rating for an item using ratings from the item's (or user's) neighborhood. In my final report, I will discuss three such approaches towards building a Collaborative Filtering algorithm for recommending books using the Book Crossing dataset( This includes a new approach suggested by Koren in their Netflix prize winning algorithm. In their BellKor algorithm, (Y. Koren, "Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model") Koren et. al proposed a neighborhood model in which the weights that relate ratings in the neighborhood to the predicted rating are learned from a global optimization scheme. I will be analyzing their method's performance on the Book Crossing dataset against two well known neighborhood based approaches where the neighborhood models are built using the Pearson's correlation coefficient and the SVD of the user-book ratings matrix respectively.
    Ryan SHAW: The Haymarket Affair/Massacre/Riot: Programmatically Analyzing Full Texts About a Contested Event.
    I will present a progress report on my attempt to investigate Ankersmit's theory of the colligation by analyzing the full text of 10 books on the 1886 Haymarket Affair from the Internet Archive.

