Plan for Today's Class

Analysis tour of the Document Type Spectrum
More on content, structure, and presentation analysis
The Harvest Table

Relationships Between Text and Non-text [1]

Another useful dimension for thinking about content considers the relationship in documents between the text and non-text information that they contain
- Text-dominated
- Text-framework
- Non-text dominated or text-enhanced

Relationships Between Text and Non-text [2]

The relationship between text and non-text information can vary at all points on the document type spectrum
- Narrative document type can be philosophy (all text) or anatomy (lots of non-text)
- Transactional document type can be invoice (all text) or RFQ (lots of non-text)

Dictionaries, Encyclopedias, and Reference Books

Usually very carefully designed, with regular structure that is exploited in information access and navigation features to enhance usability
Often have rich repertoire of content component types (pictures, maps, charts, formulas, tables)
Mixed content in paragraphs or other text blocks will contain numerous content types that are implicit hypertext

Engineering Compendium – Typical Entry

Encyclopedia Britannica Entry

Encyclopedia Americana Index

Oxford English Dictionary – Typical Entry

Procedures, Policies, Laws, and Regulations

Usually mostly text, created and used by people
Information that is often extremely important to companies and highly-paid professionals because the cost of finding (or not finding) information can be high
Often has high "intrinsic hypertext" character with many explicit and implicit links between content components
Often follow structural conventions and standards with regular numbering and naming schemes
Versioning and configuration requirements can pose problems
Making this type of content computable or executable is a huge R&D area (XML standards like XACML, policy engines and wizards, expert systems)

Code of Federal Regulations

Catalogs

Many different types
Some are extracted from ERP system or product database
Often contain a mixture of structured and unstructured content
Vocabulary and ontology variation makes it challenging to aggregate or align catalogs

Industrial Parts Catalog

Transaction Documents

Printed or electronic forms
Data-intensive, designed to capture and present small information components
Inputs and outputs of business processes and often created and consumed by computers
Few and somewhat arbitrary presentational characteristics
Strongly datatyped with field length, range and value, other restrictions

Document Type Prescriptiveness

The prescriptiveness of a document type and the homogeneity of instances reflects the number and strength of the constraints about content and structure that you identify in your document analysis
Sometimes the document type is defined with weak constraints and merely descriptive, and thus the instances are heterogeneous in content and structure
But is this heterogeneity an inherent property of the document type, or just the way it has been (implicitly) defined? Could the type be more prescriptive?

Document Type Prescriptiveness in "Modeling SylViA"

"Even within the very limited scope of recent SIMS syllabi, we found a great variety of document types.
Our syllabi ranged from fairly transactional forms, with tables of class titles, readings, and assignments...
...to more narrative documents with long descriptions of topics and discussion questions and a notable lack of specific dates and assignments."

Presentations that Mask Content Components

A form may ask you to enter your address this way

Address:
        Line 1: _________________
        Line 2: _________________
        City: ____________  State: ________  ZipCode: _________

But "line 1" and "line 2" are presentation labels that are not useful for any purpose other than printing out an address label
They are not candidate content components
They are masking content components like "number," "street," etc.

Generated or Derived Components

"Table of Contents," "Permuted Index," and list of figures, tables, or other types of components can usually be generated or derived from other components and are not components in their own right
Similarly, if "ExtendedPrice" is "Quantity" x "UnitPrice" we might only want the latter two components in our model since collecting that first one separately could lead to data integrity problems

Analyzing Structural Components

Structural components are often identified by the names attached to pieces of information – think of the outline or table of contents or lists of various kinds
Your analysis goal is to capture the rules for applying numbers or names to content in the hierarchy

False Content Hierarchy

Structural levels can suggest distinctions in types of content at different structural levels that aren't real

Many documents, especially those in reference and legal types are very hierarchical. For example, MIL-STD 1472D looks like this:

5.4.2.2. Continuous adjustment rotary controls
        5.4.2.2.1 Knobs
                5.4.2.2.2.1 Use.  Knobs should be used when...
                5.4.2.2.1.2 Dimensions, torque, and separation.  ...
                5.4.2.2.1.3 Knob style.  Unless otherwise specified...
        5.4.2.2.2 Ganged control knobs
                5.4.2.2.2.1 Application. Ganged knob assemblies...
                5.4.2.2.2.2 Dimensions and separation.  ...
                5.4.2.2.2.3 Resistance. ...
                5.4.2.2.2.4 Marking. An indexing mark or pointer...

Relationships Among Content Components

Content components can be related to one another
- Derivational relationships
  - A graduate student is a specialized kind of student
  - A student is a specialized kind of person
- Referential relationships
  - There is a very large set of possible referential relationships between components
  - The relationship is sometimes signaled with some presentational or structural component
  - The type of (or reason for) the relationship is less likely to be explicit

The Complexity of Link Types [1]

The Complexity of Link Types [2]

Law review articles contain an incredibly rich set of links for cross-referencing and footnoting (including the use of Latin to signal semantic types and direction). Can you locate:
- An "ordinary" footnote from the article text to an associated footnote
- A footnote that cites another document
- A footnote that cites another part of the same article
- A footnote that cites another footnote
- A footnote that cites another footnote that contains the citation to another document's footnote?
Why are there so many different kinds of footnotes? Why are they so indirect?
What are the consequences for this variety in transforming printed law review articles into electronic versions?

Harvesting Components

As we identify candidate content components, we need to record its properties (or attributes or behaviors) that let us understand it and distinguish it from other ones
A practical way to do this for each document or information source being analyzed, create a table or spreadsheet containing the candidate component and the metadata useful in understanding and distinguishing it from other ones
The component won't always have a name so if you must invent one, it is helpful to start a dictionary list of the words that names contain

Harvest Table - Syllabus

Harvest Table - Schedule of Classes

What Metadata to Record About Candidate Components

What attributes about each type of content might we record in our analysis?
- Names/synonyms/homonyms (what it is called)
- Definition (what it "means")
- Identifiers
- Cardinality/Optionality (occurrence rules)
- Restricted values, code sets, defaults
- Data Type (text, numbers, date, video)
- Relationships/Associations (participation in structures)
- Origin (Is this new information, or from some other source? Who maintains it?)
- Access (who is allowed to view/change/copy/etc. it)
- Permanence (is it static or dynamic? how often does it change?)
- Business processes in which it participates

21. Document Analysis [3]

DE + IA (IS 243) - 10 April 2006

Plan for Today's Class

Relationships Between Text and Non-text [1]

Home Blueprint

Relationships Between Text and Non-text [2]

Dictionaries, Encyclopedias, and Reference Books

Engineering Compendium – Typical Entry

Encyclopedia Britannica Entry

Encyclopedia Americana Index

Oxford English Dictionary – Typical Entry

Procedures, Policies, Laws, and Regulations

Code of Federal Regulations

Catalogs

Industrial Parts Catalog

Transaction Documents

Tax Form

Document Type Prescriptiveness

Recipe [1]

Recipe [2]

Document Type Prescriptiveness in "Modeling SylViA"

Presentations that Mask Content Components

Generated or Derived Components

Analyzing Structural Components

False Content Hierarchy

Relationships Among Content Components

The Complexity of Link Types [1]

The Complexity of Link Types [2]

Harvesting Components

Harvest Table - Syllabus

Harvest Table - Schedule of Classes

What Metadata to Record About Candidate Components

Readings for 12 April