ASSIGNMENT 5: WEB INTERMEDIARY
Goal

The purpose of this exercise is to give you some practice in designing an information system.  You will need to use texts and ideas from several different weeks of the course.  The system you will design has multiple pieces.  You do not need to implement any of these pieces, but you need to describe them in enough detail so that we can see how they might be implemented.  Below we present the design exercise and each of the pieces we are looking for.

Problem statement

We would like you to design a web-based interface that can assist one in planning a day of garage sale shopping.  Garage sales -- oftentimes known as "tag sales" on the East coast -- are generally sales of secondhand artifacts held at private residences.  To plan a day of garage sale shopping requires that one look in the newspaper to determine where and when the sales are and then plot on a map of locations each of the sales of interest.  Since enough of the information to do this is now on the web, it should be possible to plan a day of shopping by just looking the details up on the web and then printing out the relevant information.  However, as things now stand, this is still a bit awkward because one needs to look up the garage sales in an online newspaper, find the addresses, and type each of them into a mapfinder to figure out how to get from place-to-place.  Therefore, we want you to design an architecture and an interface that would allow one to open a webpage, type in a date, a time, and an address and receive, as a webpage, a map (or set of maps) and a list of garage sales.  Each of the sales should be plotted on one or more maps and the list of sales should be in the order that one might visit them (i.e., they should be grouped and sequenced according to how close they are geographically).

Relevant readings

(1) This type of information system is called a web intermediary, or WBI (pronounced, "webbie").  You can read about WBI's in the Maglio and Barrett article we read for September 27th.  Note also that the WBI technology described there has come from a long-running research project at IBM Almaden.  You can read more about it here: http://www.almaden.ibm.com/cs/wbi/.  We will neither be using the IBM technologies (e.g., their Java-based WBI Development Kit) nor their notations and formalisms, but it is useful to have a look at these readings so you get a clear idea of the kind of technology you are being asked to design.

(2) To pass the data about garage sales and maps between parts of the WBI you will be designing, you will need to also design two XML DTDs.  Information about XML and DTDs can be found in the readings from Novcember 15th.  Also, there is plenty on this at the w3.org site.  Even though we will not ask you to design an XML Schema, information that might be useful to this exercise can be found in examples used in the W3's primer on XML Schemas: http://www.w3.org/TR/xmlschema-0/

(3) To think about how the garage sales might be extracted from an online newspaper, you will need to think about some of the issues we covered concerning information extraction during our November 6th meeting.

(4) To group and order the garage sales appropriately, you will need to roughly outline a sorting or clustering algorithm.  Remember that we discussed clustering several times in weeks 3 - 7.  You are not being asked to design a sophisticated clustering algorithm, but you might find it useful to remember that there are different ways of clustering data given a distance metric.  In this case the distance metric is quite straightforward: it is mostly geographic distance measured in meters or miles (issues of time are also possibly of concern as detailed below).

(5) Finally, you will need to think about how you want the WBI's output to look.  For this, recall that we have discussed information architecture and interface design several times this term.

Web resources

To do this exercise you will need a source of garage sale listings and a mapfinder.  I suggest you use the following:

(1) Garage sale listings can be found here: http://www.sfgate.com/cgi-bin/jusma/garage_sales/main?daysoffset=0.  You will probably find it easiest to get the data you need if you concentrate on garage sales for San Francisco (since they are more numerous).

(2) Maps of particular locations and distances between locations can be retrived from this site: http://maps.lycos.com/

You can use other resources if you want to, but don't use web resources that do significantly more work for you than these two sites do.  For instance, if you know a garage-sales-planner.com already exists, we won't be satisifed if you simply hand in the URL to the site as your answer to the assignment.

Step-by-Step instructions

(0) Preliminaries: Keep in mind that you are designing a system that could be implemented as a computer program that might access the web page for the garage sales, download the garage sales, extract the relevant information, then repeatedly access the mapfinder web page to find the sales' addresses and their relative distances from one another, and, finally, take all of the results, write some HTML and output the results as a web page. Moreover, it would make sense if this program was written as a CGI script so that it could accept a date, time and address from a user via forms on a web page. Alternatively, we could use IBM's WBI toolkit to build something like this, or a variety of other tools.  The important issue is that this system would be designed as a program that uses the web to get input, output results, and fetch necessary resources off the net (e.g., maps and sales).  You can think of this sort of program -- a web intermediary -- as a specialized web crawler and search engine since it incorporates some of the same functionalities, but in a limited and specialized form.  If you've never written a script to automatically download pages from the web, it might be interesting for you to try the following on a Unix command line (e.g., on irony.sims.berkeley.edu):

irony% telnet www.sfgate.com 80
GET / HTTP/1.0

Type the first line, wait a bit for a response -- you have connected to the web server at the SF Gate -- then type the second line, hit return, wait a bit, and then hit return again.  You should receive, in response, the entire homepage for the SF Gate, in text format, of course.  Note that these two lines access the web and download a page without use of a web browser.  Or, rather, these two lines are sort of the simplest web browser imaginable.  It is lines like these that are incorporated into the kind of program you are designing for this exercise.

(1) Examine the garage sale listings: Go to the SF Gate site and download at least twenty garage sale listings for a given day.  You might have to look at the sales for a Saturday or a Sunday to get twenty of them.  Now, go through the listings an determine the important elements and attributes.  For example, each listing should have an address or, if not exactly and address, then at least some indication of its location.  Thus, obviously, something like address or location will be an eleemnt or an attribute in your XML DTD.

(2) Define an XML DTD for a garage sale listing: Now that you've figured out the important pieces of the sales, define them as an XML DTD including definitions for one or more entities, elements, and attributes.

(3) Use the DTD to mark up the garage sale listings: Return to the list of garage sales that you downloaded and use the DTD you defined to mark up five of the garage sale listings.

(4) Comment on the implied information extraction process: Review the Ralph Grishman paper on information extraction, and the lecture on information extraction.  Think about the kinds of decisions you had to make in order to mark up the garage sale listings.  For instance, not all of the locations given for the sales are in the same format.  Therefore, you need to make a series of decisions to figure out what refers to a street, versus a region of the city, versus someone's name, etc.  Comment on each of the six different levels of analysis that Grishman names in his paper on information extraction (shown in Figure 2 of the paper): lexical analysis, name recognition, partial syntactic analysis, scenario matching, coreference analysis, and inference (your "template" is basically your XML DTD for the sales).  What kinds of decisions did you have to make, at each of these levels, in order to mark up the garage sale listings using your DTD?  Would any of the steps you made in so doing be difficult to implement in a computer program?

(5) Examine the maps: Go to the Lycos map site and type in all of the addresses for five of the garage sales you have downloaded in step (1).  Also, type in the ten address pairs to see how you can use the site to compute distances between the different garage sales.  Examine the HTML, the page source, for the web pages that contain the maps.  Note that any program that uses this site will need to extract the maps and the distances out of the HTML.  For step (7) below assume that you have two functions that do this: (a) a function, extract-map, that given the HTML from a webpage returned from maps.lycos.com returns the gif file containing the image of the map; and, (b) a function, extract-distance, that given the HTML from a webpage returned from maps.lycos.com, returns the distance between the two addresses input.

(6) Define an XML DTD for a map: Include elements and/or attributes to tag location(s) and distance.

(7) Define a sorting procedure and describe it in pseudo code: Include in your sorting procedure multiple fetches from the maps.lycos.com site to compute the distances between addresses.  The input to your sorting procedure should be (1) an address; (2) a time; and, (3) a set of XML tagged garage sales.  The output should be an ordered list of XML tagged garage sales (i.e., the elements of the the input set).  The output list of garage sales should be in an order such that it would be  sense to start at the the address input and visit each of the garage sales in order.  The address input might not be the address of a garage sale per se.  Rather, the address input might simply be one's starting point.  Imagine that there are three clusters of garage sales corresponding to garage sales in three different neighborhoods in town.  It would be sensible to visit the nearest neighborhood first.  However, if the sales in the first neighbood close at a time later than sales in a neighborhood further away, then it might make sense to visit the first neighborhood later in the day.  Also, note that clustering the sales into neighborhoods is not enough.  Rather, your sorting procedure should also order the sales within neighborhoods too so that it would make sense to walk or drive between them in the order output by the sorting procedure.  So, the two criteria that are being used to sort the garage sales are (1) distance; and, (2) time of day.  Specify your sorting procedure in high-level pseudo code; i.e., if something with many small but straightforward details to it needs to be done (e.g., extract-map) then simply make up a function name for that subprocedure and refer to it by name in your pseudo code, rather than spelling out multiple nitty gritty simple details.  Also note that -- despite the fact that I have named this a "sorting" procedure -- this procedure might not have a single best output for any given input.  For example, some of the garage sales I have seen don't have addresses or have vague addresses.  In such a case it might be best to put them at the end or the beginning of the output list, or maybe just weave them into the list in a way that might or might not be correct (which is something one can only determine once one has the correct addresses for all of the sales).  One can also imagine cases in which the addresses are misspelled, or in which two addresses are of equidistance from a third address, etc.  Dont' worry too much about these exceptions, just outline a procedure that would work most of the time.

(8) Lay out your map(s) and list of garage sales:  Use the maps and distances you collected for step (5) to create a web page in which your five tagged garage sales are listed and ordered according the criteria you chose to answer (7).  The easiest layout might be simply one map per garage sale and a list of garage sales and maps organized according to geographical distance.  However, it might, for instance, be much easier to read if the first map was preceded by one or more contextualing, locator maps that -- in one map -- describes where all of the garage sales are.  Or, perhaps the maps can be tiled together in some way so that all of maps are overlayed, one on top of the other.  For this step of the exercise, you should create a webpage that incorporates your ideas about how to walk or drive from one garage sale to the next.  The webpage you create for the 5 garage sales you tagged should be something you could print out with you and take with you during your day of garage sale shopping.  Originally, I conceived of this assignment as an assignment for a possible Palm Pilot application (rather than as a Web application).  Thus, some of the  places I have found ideas for laying out the maps and ordering the garage sales have been in Palm applications.  If you want some ideas about this, check out the demo for the Vindigo Palm application at www.vindigo.com.  Of course, to "borrow"  layout and interface ideas from a Palm app you will need to translate from the Palm interface look to something that can be rendered in HTML or the web in general (easier to do that than to go in the other direction, i.e., from web-based interface to a small format interface like the one of the Palm Pilot

(9) Describe your layout and explain its rationale: Write a couple of paragraphs describing why you layed out the maps and ordered the garage sales as you did for step (8).
 

Deliverables

When you hand in your assignment you will want to give us the following pieces of work:

(a) the XML DTD you defined for a garage sale listing for step (2) above.

(b) Five garage sale notices tagged using the tags of  the XML DTD.  Also, 15 other garage sale listings that you looked at, but didn't tag.

(c) Comments on an information extraction process as written for step (4) above.

(d) An XML DTD for maps as design for step (6) above.

(e) The pseudo code and comments for the sorting procedure you devised for step (7)

(f) One or more web pages (or mock web pages rendered with a graphics program like Photoshop, etc.) that contain the ordered garage sales list (containing the 5 garage sales you tagged) and the maps that show how to get from place to place; i.e., the work you did for step (8) above.

(g) And, finally, the description and rationale you wrote for step (9).
 

Due Date

This assignment is due on November 29th.