INFO 202 Fall 2008 - Assignment 5

Document Types for a Day

Assigned 8 October, due 17 October

Primary TA: Jonathan Breitbart

Follow steps 1-5 below, paying particular attention to what you need to submit in step 5. Please complete this work before 9:00 AM on Friday 17 October. We will discuss it in the section on Monday 20 October.

In this assignment you will create a "diary" of the document and data model types you encounter in a single day. Its purposes are (a) to sensitize you to the number and variety of information models that span the "Document Type Spectrum," (b) give you more practice thinking about categories, aggregation, abstraction and ontology and (c) teach you a technique for naming and describing a system of ontologically-grounded categories so that others are able to use them effectively.

Before you start, take a deep breath and reflect on the advice I gave you in my feedback on Assignment 3:

In any vocabulary, there is an intricate balance between the material covered (breadth), the specificity of each descriptor (precision), and the number of descriptors available.
There is no perfect answer in this assignment. Every system of categories is biased. But that doesn't mean that every system of categories is equally good.

There is no requirement in this assignment to explicitly test the design and robustness of your document type ontology by exchanging instances with anyone else. But you are welcome to do this if you think that the extra effort this entails will improve your ontology.


1.1 On some day during the week in which you come to campus, keep a "diary" of the document and data model types you encounter on a single day. The diary can be a simple list that contains the time and location where you encounter an instance and the name of the type in which you've categorized it. Use the excel spreadsheet at we have provided so that we can more easily aggregate and compare your diaries. For example, my diary would probably begin with:

6:00 AM breakfast table NY Times story on Palin, p1 10/13/08 Newspaper article
6:10 AM breakfast table side panel on Kix cereal box Nutrition information
6:30 AM medicine cabinet label on Centrum Vitamin bottle Dosage instructions


2.1 Count the document and data model types as you go, and you can stop your diary when you reach 15 unique types (if you haven't identified 15 types in a 24 hour period you are not paying enough attention to the information around you; and don't use any of the three examples in my table above). You don't have to stop at 15, but stopping at 15 or shortly thereafter is the "don't get compulsive or crazy about busywork" reminder for this assignment. Depending on how you think of the types, you might see several dozen of them on the way to campus, and it would be silly to record every instance of each one if you are reading a half dozen of the same kind. So you should only list those that you interact with sufficiently where the notion of an identifiable "instance" makes sense.


3.1. When you're diary is complete, organize your types in a concept hierarchy (starting with "Document" at the root or top of the hierarchy as the "mother of all document types") because this will sharpen your ability to define them. There are examples of this sort of intellectual exercise in the lecture notes from September 8 with the "Bates Fundamental Forms of Information" and there were several others in the September 29 lecture when we discussed Ontology.

3.2 On reflection, you might discover that not all of your types listed in your diary are at the appropriate level of abstraction for the instances. If your types are too narrow, then the only instance that they can describe is the one you've recorded in your diary - and there's no point in doing that. To make sure that your lowest-level types are "equivalence classes" with some greater scope, you should think of at least one other document instance that fits within the type. You don't have to actually find this instance -- it can be hypothetical. Please add these hypothetical instances to your diary spreadsheet (there will be at least 15 of them), indicating "hypothetical" in the "Time" field of your spreadsheet and specify the instance in the "Instance" field. This might cause you to change the name you've given to a type.

3.3 You must introduce at least one level of more abstract document types ("hypernyms" - some people would call these "document genres") to organize the "hyponyms" (sub-types) in your diary. Don't strive for brilliant visualization and graphical design here because a three level hierarchy is adequate (but if you can do more, that's great); what matters is that you think about the ontology of your types. You can use any notation or tool you want that allows you to represent the hierarchical relations among your types.

3.4 Whenever you introduce a new level in your document type hierarchy, try to come up with some characteristics or attributes of all of the new types or categories at that level. You aren't going to find anything as rigorous and consistent as the levels in biological classification (species -> genus -> order -> family -> class -> phylum), but you want to maintain as much as possible a consistent "abstraction gap" between a hypernym and its (at least 2, as I explained in paragraph 3.2) hyponyms.


4.1 Now that you've created a conceptual hierarchy that captures the relationships among your document and data model types, it should be straightforward to write a precise definition for each of them. Your goal is to write a definition for each of the types in your ontology that would enable an ordinary person given one of the instances to categorize it as an instance of the correct type. Write your definitions following the "Formula for Definitions" in the September 29 lecture on Ontology:

hyponym = {adjective+} hypernym {distinguishing clause+}

EXAMPLE: Homework assignment = {educational} Instruction {given by an instructor to a student to perform a task intended to assist in learning a particular subject}

4.2 You should write these definitions for every type in your ontology, including those you introduced in part 3.3) that are intermediate between your lowest-level types and the "document" mother of all types. Make sure what you say is consistent with the definitions of each type; you may find yourself iterating between the diagram and the definitions because the act of defining will clarify the type distinctions.

4.3 You should review the October 1 lecture notes and readings that introduce the idea of document types as a conceptual distinction that is independent of the syntax and technology in which instances are ultimately implemented. Your ontology and definitions should focus on the rules or constraints that distinguish one type from another, not on aspects of their presentation.

4.4 In most cases colors, size measurements, and other physical or presentational properties are not fundamental characteristics of document types. Presentation is often incidental or just a property of instances and not types at any point in the spectrum, and while it is better to correlate it with structure and content distinctions, the rules aren't type-dependent (ie., the presentation rule that "big and centered text is important information" is true for all document types, so it can't distinguish them) This is not to say that it isn't occasionally useful to distinguish documents as "electronic" or "printed" or otherwise classify them on the basis of format or presentation -- but these are not ontological distinctions.

4.4 Likewise, try NOT to rely on structural properties of documents to organize them into document types. Focusing on structure will suggest types like "lists" and "collections" that bring together instances that differ significantly in the meaning and purpose of their content. Many document types have conventional organizational structure (like phone books) but the essence of a phone book is not that its entries are alphabetized. Dictionaries are also alphabetized, but dictionaries and phone books aren't that similar from an ontological perspective.

4.5 Granted, it isn't always easy to separate content and structure rules from presentation rules because some kinds of documents have very conventional appearances and structure -- you can think of this as a merger of presentation and content where the meaning is partly because of appearance (a STOP sign can be considered a document instance and is a really good example of this merger, so because I'm using it for teaching purposes here you can't use that as one of the 15 document types in your diary).

4.6 Put another way... make sure you are defining the document type using rules that identify it, and not using properties that are specific to your instance and not true of all members of the type. Having to add the hypothetical instances should help you ensure that your definition isn't relying on instance-specific properties.

4.7 Record the definition for each of your types in your diary so that we can more easily compare definitions provided by different people for the same type.

4.8 Sort your diary on the "document type" field. This should bring together the instances you observed and the hypothetical ones you imagined to ensure that your lowest-level types weren't too narrow. (If the sort doesn't do this, make sure you entered the "Document or Data Model Type" for your hypothetical instances).

5. SUBMITTING YOUR WORK (before 9:00 AM on Friday 17 October 2008)

5.1 Upload your spreadsheet and ontology diagram using the "Assignment Upload" link (Name the spreadsheet 202A5_YourName.xls and the diagram 202A5_YourName.pdf [convert to pdf so that we can be certain to be able to view it]).

5.2 PLEASE FORMAT THE SPREADSHEET SO THAT YOUR DIARY TABLE PRINTS OUT ON A SINGLE PAGE IN LANDSCAPE ORIENTATION. You can format the diagram in whatever way makes it easy to read (so try to fit it on one page; whether this is landscape or portrait orientation will depend on your ontology).