IS290-1, Spring '06
After Google, What?
Last modified: January 19, 2006

Assignments

List of assignments

Assignments for term papers and projects are set out below under two headings: research topics, and practical exercises

Students who wish to pursue problems not identified on the list set out below may be permitted to do so, so long as:

Research topics

How are academic libraries perceived and being used? It is incontrovertible that fundamental aspects of the academic library are changing fundamentally. The trajectory of that change is less easy to discern. Thinking in an innovative way about your use of methods and sources, identify, explain and critique some of the developmental trajectories that you see emerging. Don’t hesitate to apply your analysis to UC libraries whether at Berkeley or elsewhere.

Re-centralization of academic information services? The paradigm for managing and funding the university’s enterprise-wide information systems seems to cycle between centralized and distributed modes. The mainframe computer environments that were so prevalent in the mid-1980s characterize the extremely centralized mode. The distributed cluster-based scientific computing environment and the departmental networks and service providers characterize the more distributed approach to information service provision that is more prevalent today. There is evidence in the past few years that the pendulum is swinging back from distributed to the more centralized or at least coordinated norms. Basing your assessment on the real experience as reported or observed at a handful of academic institutions, explain what services that are being considered in a more coordination or centralized fashion and account for this shift in their provision

Managing risk inherent in our digital scholarly assets. The academic establishment is becoming increasingly concerned about the risks inherent in the volatility of digital scholarly information that has no analog in more stable formats such as print or even film. A great deal of attention has been paid to the persistent management of scholarly publications and to digital research data. Are there other digital assets that the academy should be concerned with? If so, what are they? To whom do they have some long-term value (what is the business case for investment in their preservation)? What strategies technical, organizational, and financial, should be considered in capturing, managing, and ensuring appropriate subsequent use of such assets.

Search in an academic context. The assignment is intended to gather information about how academic users locate scholarly information and is designed to assist in the design of effective resource discovery systems. Students who choose this assignment will in effect be asked to participate in and then document their experience of a series of user protocols where they will either:


Students will document their experience of the protocols and comparing the
search systems they encountered during it, with a view to identifying essential and desirable attributes of resource discovery systems as used in different contexts described above.

As part of this assignment, students will have an opportunity to assess features of either a relevance ranking or recommender service that is being developed to prototype by  the California Digital Library (CDL).

The economics of scholarly publishing. “Postprint services will not significantly affect the economics of scholarly publishing. Worse, they will divert scarce university resources away from strategic task capable of re-shaping scholarly communication processes in general”. Discuss.

Massive digitization. A well known company is interested in working with the UC libraries to scan a 100,000 out of copyright books and making them available online where they may accessed openly by anyone with an Internet connection. The company gives the library only some very general guidance… that the works selected for digitization should be of public and educational interest and focus broadly on Americana: literature, the arts, and history (people, places, events, and ideas) as broadly construed. What are the key approaches that the libraries should consider for selecting the 100,000 books? For each approach identify strengths and weaknesses as well as critical obstacles that would need to be overcome. Finally, make a recommendation to the libraries about how they should proceed.

Practical exercises

The following assignments are for the more technically inclined. CDL computing resources may be made available in some circumstances

Evaluating text mining techniques and tools. Manual cataloging will not scale to the sheer volume of content currently produced. The value of manually cataloging a discrete collection according to a controlled vocabulary becomes diluted when that collection becomes aggregated among other collections. Machine-based clustering and classification tools show promise as a scalable way to improve access to large, heterogeneous collections. What are the relative strengths and weaknesses in the application of clustering and classification techniques and tools (e.g., Dave Newman's, TopicSeek, Emory's MetaCombine, Marti Hearst's Nearly-Automated Metadata Hierarchy Creation) to digital library content?

Resource location before discovery. Federated search integrates access to pre-selected targets. In an ideal world, targets would be dynamically selected as being those most likely to return results appropriate to a user-supplied query. In this project students will investigate methods for a recommending service that selects targets based on a user-supplied query in advance of conducting a federated search. Problems that need to be solved include determining the nature, level, and type of information that the recommendation service would need to gather from potential targets in order to make “good” and “reliable” recommendations, and of course how to gather that information and maintain it as current.

Low-cost digital preservation? As digital preservation repositories emerge on the academic information landscape, one cannot help but wonder what will happen to their contents if we find ourselves 10 or 20 years out with vast collections of archived objects, but insufficient funding to effectively render any but the most treasured ones for the then current generation of computer hardware and software. What would we do to access the rank and file objects.

Cooperative harvesting of web-based content. Web harvesting is a grossly inefficient means of capturing Internet-accessible assets. It misses deep web materials and is either too costly (where conducted with excessive manual intervention) or too inclusive (where conducted less discriminately across whole web domains). What methods might be used to enable data providers that wish to (a small set, to be sure), to “push” their content periodically into waiting repositories. What server-side standards and practices (eg, MOD, OAI, Google sitemaps) might be used?

Demographics of the .edu domain. Web harvesting techniques may provide an opportunity to capture and leverage valuable at-risk scholar information assets that reside on the .edu domain. The promise, however, requires a better understanding of the domain’s demographics. What kind of information assets are available there? Can they be detected automatically? Which are sufficiently valuable to justify the investment involved in their long-term preservation? And how may some of them be leveraged?

Recommender systems based on course syllabuses. This may relate logically to the demographics of the.edu domain (above) in a manner that makes an obvious assignment pairing. The .edu domain is littered with course syllabuses, assignment lists, and faculty and student papers and publications. Embedded in these are numerous references to scholarly publications of all kinds (journal articles, monographs, textbooks, etc), and consequently a source of data that may usefully be employed in recommender systems that guide scholarly resource discovery. How would such a recommender system be developed. How would data be gathered and maintained? How would data be weighted and/or what algorithms would be used to deploy it in a recommender system?