Projects | IO Lab

Delicious Memex

Introduced September 1 – Due September 22

Technologies

Javascript, jQuery, Greasemonkey, JSON, Delicious API

Description

During the first class students saw and later implemented a standalone version that allows users to load bookmarks from any Delicious account and create a new trail. For this assignment, students will create a related implementation that explores the idea of trails as a mechanism for organizing information using Delicious.

Options

Select one of the following as a starting point:
- Trail Browser: Create an interface that displays all of a Delicious user's trails and lets you navigate through them. The navigation interface could be purely textual and display metadata from Delicious. It could also display bookmarked pages themselves within an iframe or by using a screencapture utility like webkit2png or khtml2png.
- Post to Trail: Use Greasemonkey to modify the Delicious Post a Bookmark screen so that a user can add a bookmark to an existing trail or create a new one.
- Trail Maker at Delicious: Use Greasemonkey to modify the Delicious bookmarks page to allow users to create a trail from their bookmarks within the current Delicious interface.
Notes

The syntax for trails that we will be using on Delicious is to apply 'trail:[trail_name]' to an item to indicate that it is part of a trail and 'step:[number]' to indicate where in the trail the item is positioned. Each item in a trail named "Vacation in Paris," for example, has a tag 'trail:vacation_in_paris'. The first item in the trail has a tag 'step:1', the second item has the tag 'step:2', and so on.
Controlled Vocabularies

Introduced September 22 – Due October 6

Technologies

Javascript, jQuery, Greasemonkey, JSON, site-specific APIs

Description

In Cory Doctorow's "Metacrap" essay he lists seven problems with explicit metadata. Students will build or modify UI as a way to potentially address one of these problems, making it easier to use a controlled and consistent vocabulary.

Students may also run experiments or analyze existing data to document how much of a problem uncontrolled vocabularies are and how much of a difference a simple fix might make.

Options

Select one of the following as a starting point:
- People are lazy: Attempt to prove Doctorow wrong and show that ease-of-use will help this problem. Add UI onto Delicious (or some other service) that helpfully suggests tags to make it easy for you to follow the strict tagging principles you defined in 202 Assignment 6 last year or the vocabulary you designed in Assignment 3 this year. Or, investigate automatic tagging using the mSense API, the Times Topics API or some other approach.
- People are stupid: There still exist lots of "Plam Treo" listings on eBay. Add UI onto eBay (or some other service) that auto-corrects spelling mistakes. Or, build a UI that suggests similar spellings that are more popular.
- Canonical: Create a metric for the dilution experienced when several non-canonical versions of a link are saved on a service like Delicious. What would the effect be if these links were all consolidated? Or, build a extension for Delicious that automatically inserts the canonical version of a URL.
- Or: tackle another one of Doctorow's strawmen.

Semantic Web and Microformats

Introduced October 6 – Due October 27

Technologies

Python, Google App Engine, RDF, RDFa, FOAF, XFN

Description

The semantic web promises to define content precisely and meaningfully enough that computer agents will be able to make sense of it. Some propose that RDF and SPARQL are the correct way to realize this dream, while others argue that lighter-weight microformats are more practical. Students will build tools to either produce or consume either RDF triples or web-based microformats.

Options

Select one of the following as a starting point:
- Build a triple-store as described in Programming the Semantic Web on top of Google App Engine, enter some triple data and write code to implement some interesting query (like six degrees of Kevin Bacon) as a web app. (Or modify RDFlib to use the Google App Engine datastore for persistent storage.) Describe the techniques for, and advantages and disadvantages of using Google App Engine's datastore for RDF storage.
- Build an interface to let iSchool users easily create their own FOAF files of iSchool contacts. Or export iSchool users' data from Facebook into FOAF or an RDF store.
- Using the FOAF or XFN connections of iSchool faculty, staff and students, create an application which recommends new iSchool friends or a visualization of the existing social graph using its RDF/FOAF representation. Or, create a SPARQL query to calculate centrality or degree of various iSchoolers (ask Granovetter why this might be helpful).
- Link two or more semantic data sources together to answer some question you think is interesting. (Programming the Semantic Web has some good examples to start with, like calculating degrees of Kevin Bacon. Try using metaweb.py to access Freebase.)
- Design a set of semantic web formats to use on the iSchool website (or some other web resource that you can edit). Choose the set of RDFa attributes that should be applied, using existing namespaces (Dublin Core, etc.) wherever possible or suggest hCard, hCal, XFN or some other microformat. For example, add semantic content so that a computer can understand class schedules or the times and locations of lectures, or add Dublin Core author information to links about papers and books written by iSchool faculty. Mark up at least a few sample pages so that some other group can build a tool to consume that information.
- Build a tool to consume RDFa or microformatted content from iSchool pages (or, if more ambitious, arbitrary pages on the web -- we could help you use 80legs to crawl some substantial portion of the web) and either visualize the data (a graph of who co-wrote a paper with which other faculty member) or draw some programmatic conclusions from it (email alerts to the dean when a lecture overlaps with a career fair).
- Or, build your own tool to either add or consume semantic content, whether it's in RDF or microformat form.
Notes
Our campus O'Reilly Safari subscription gives us all unlimited access to Programming the Semantic Web, a very valuable resource for actually writing code using RDF, FOAF and other semantic web resources.
Social and Distributed Classification

Introduced October 27 – Due November 10

Technologies

Python, Google App Engine, Greasemonkey, Javascript, jQuery

Description

Analyze existing uses of social classification and attempt to evaluate their usefulness.

Options

Select one of the following as a starting point:
- Wikipedia's organization of categories and disambiguation is itself a socially determined classification and may prove very valuable. Build a Greasemonkey script that uses Wikipedia disambiguation pages to suggest narrower, less ambiguous search queries. How does your Wikipedia-generated disambiguation compare to Google or Bing's suggested searches? There is a (long) list of Wikipedia disambiguation pages that your project can draw from. You don't necessarily have to extract all the disambiguation pages from Wikipedia to demonstrate this idea.
Search and Retrieval

Introduced November 10 – Due December 10

Technologies

Python, Google App Engine, Greasemonkey, Javascript, jQuery, 80legs

Description

Boolean search, tf-idf, stemming, stop words, and natural language processing are all techniques used to improve access to information on the retrieval end. In this project students will create a tool that illustrates the effects of various approaches to search.

Options

Select one of the following as a starting point:
- Tweet Search. Using a provided framework for Google App Engine or Javascript, write some search methods to compare the effectiveness of different search queries side-by-side when searching a user’s messages on Twitter. You could use a combination of simple boolean search, a stemming algorithm, td-idf, or a NLP-based method to provide different results. Evaluate what, if any, benefit a vector algorithm has in displaying relevant results for a corpus with very short documents. Implementations of the Porter stemmer or Porter2 stemmer are available in various languages.
- Build the worst possible corpus, a modern-day Library of Babel that contains real readable documents and many, many randomized versions of real documents. Can search engines (Google, a search engine that some other group builds) find the real information amongst the noise? You could use Wikipedia (or Twitter) as a starting corpus.