22. Document Analysis [1]
DE + IA (INFO 243) - 2 April 2007
Bob Glushko
Plan for Today's Class
- Introducing Document Analysis
- Presentation, Structure, and Content
The Document Engineering Approach
Document Analysis: From Physical to Conceptual Models
- When we analyze information sources: interviews, documents, sets of data whatever - our goal is to identify and describe the "significant things" or the "information components" and their characteristics or attributes
- But when you analyze documents the information components aren't as immediately apparent because they are contained in structures and rendered in some presentation
- So we have to remove the presentational information and dis-assemble the structural information to find the content information that is our highest priority
- As we take away presentation and structure, we are abstracting away or generalizing from a physical implementation and creating our first conceptual or logical model of the information components
Three Types of Information In Documents
- We need a vocabulary to classify different kinds of information that we find in documents and sets of data
- Content – "what does it mean" information
- Structure – "where is it" or "how it is organized or assembled" information
- Presentation – "how does it look" or "how is it displayed" information
Content Components
- We can identify components as the separate units of content to be organized -- "pure content" with no structure or presentation assigned or implied
Document Engineering: Optimizing "Content + Structure + Presentation"
- The "Document Engineering Methodology" can be thought of as:
- Distinguishing the three kinds of information in instances or artifacts
- Carefully describing their current and desired relationships
- Creating conceptual models that describe the content information as it is and as it could be
- Using principles of "good design" and patterns to refine the conceptual model
- Reassembling or recombining the three kinds of information to achieve the desired relationships in the "instances" or "artifacts," beginning with the conceptual model and then adding structure (creating document schemas) and then adding presentation (with transforms or stylesheets)
Document Engineering and Information Architecture
- This formulation of the Document Engineering approach is essentially equivalent to how Information Architecture is defined:
- Information Architecture =
- (((content + information structure) +
- navigation structure) +
- presentation structure) +
- + presentation design
The Most Important Principle for Information Architecture
- We say "the document is about … the photograph is
about… the movie is about"
- We're expressing a distinction between information as
conceptual or as content: and the physical container or
medium, format, or technology in which the information is
conveyed
- It is very useful to think abstractly about "information
content" without making any
assumptions or statements about the "presentation" or
"rendition" or "implementation"
- Separating content from its structure and presentation is the most important principle of Information Architecture
Presentation Information
- Human-oriented attributes for visual (or other sensory) differentiation (type font, type size, color, background, indentation, pitch, ...)
- In general, presentation information is the least important stuff you find in documents but
- Good information architecture and user interface design correlates this with structural or content information
- You might have a requirement to preserve it or make it more consistent
Presentation Fidelity and Integrity
- Presentation Fidelity is a requirement to preserve the original presentation, often exactly
- For example, with International Letters of Credit and Bills of Lading you can readily imagine a bank or customs inspector carefully comparing computer-generated and original printed documents.
- More common is the requirement to
replace ad hoc, inconsistent or incomplete presentation components with rule-governed presentation
- Presentation Integrity is a requirement to assemble the document model in "document order" – that is, to organize the elements so that their valid order matches the order in which they would want them to appear in a document instance
Extracting Presentation Rules
- Presentation affects structure and content by applying transformation rules to them
- To understand the structure and content we must identify and record what the rules of the transformation were
- Explicit transform rules can be encoded in templates, stylesheets or source code
But Sometimes Rules Can't be Extracted
- No access to source formats or source code
- Rules may be inaccessible in source formats ("override" formatting in word processors instead of style tags)
- Rules don't exist or are inconsistently followed (author has "fontitis" with "ransom note" presentation style)
Correlations or Conventions with Presentation Information
- Color, pitch, other perceptual dimensions can be correlated with semantic distinctions
- Type size is usually correlated with the structural hierarchy
- Content types can have characteristic layouts or text attributes
- Adjacency can suggest a semantic relationship, like that between figure and caption
- Presentation order is sometimes semantically significant
Binding Structure to Presentation - Alternatives
Gestalt Principles -- Reinforcing Structure with Presentation
Presentation View of a Lecture Slide
Structural Information
- Physical piece of a document or user interface (e.g. table, section, header, footer, panel, window)
- Embodies the rules on how content components fit together, often hierarchical
- Often driven by context of document use
- Most applications and web sites are organized with a small set of structures:
- Lists/hierarchies
- Networks/links
Structure is Independent of Content
Structural Integrity
- A requirement to preserve some aspects of structure, but not necessarily any presentation:
- Identical page boundaries for the electronic and printed versions of documents, especially when document revisions are highly localized (as in "looseleaf" publications with their placeholder pages that say "this page intentionally left blank"
- Chronological order for a narrative biography or history
- "Putting it together" instructions (don't want to say "assembly" here) for a bicycle or piece of furniture need to follow the order in which they are most easily or safely put together.
Analyzing Structural Components
- The structural components can provide the hierarchical "skeleton" or "scaffold" into which the content components are arranged
- Presentational Structures provide a framework for presentation -- table, section, title, header, footer
- Semantic Structures are logical groups of conceptually-related components - parts of an Address, Phone number
- Structural components are often identified by the names attached to pieces of information – think of the outline or table of contents or lists of various kinds
- Metadata to capture
- Depth of hierarchy
- Sub-structures included within a structural container
- Rules for applying numbers or names to content in the hierarchy
Structural Relationships Among Components Expressed as a Hierarchy
Entry Points
- Similar to list structures are "entry points" -- structures that are "wrapped around" some set of content components to provide an organized way to access them
- Most familiar examples are tables of contents and topical indexes; these are created from the names or other descriptive metadata for each component (which might first be extracted by processing the component content)
- An entry point can be created as a static structure at design time, but preferably would be dynamically generated at run time
- There are many similar examples of entry point structures generated from the names or descriptors of content components (Tables or Lists of content of type "X")
Table of Contents as "Entry Point"
Topical Index as "Entry Point"
Structural Relationships Among Components Expressed as a Network
Links
- Links are relationships between components that can express content as well as structural information
- A link is represented in a logical model by its:
-
Anchors -- the point, region, or span within the components to which it refers
- Type -- the semantics that the link relationship represents; not always explicit
- Directionality -- is the link one or two-way? Is the relationship meaningful in both directions? Does the reverse direction link mean the inverse?
- Cardinality -- 1 to 1 to many?
Navigation Structures
- Navigation structures support finding or moving between components
- Forward or back in some structural organization
- Forward or back in a temporal organization (history list) or according to other relationships associated with the content components
Structural View of a Lecture Slide
Content Components
- Content components are the "nouns" in our documents or sets of data – things like "topic," "summary," "name," "address," "price"
- In publications a lot of the content isn't easily identified by "component type" – it may be "just text" that could be playing any of a very large number of roles in the document
- And sometimes you get no help from the set of style or formatting tags in word processors or in HTML, which are very format or structure oriented and not content oriented at all
- We need XML so we can invent the vocabulary of tags needed to describe component content in a specific document type
Identifying Content Components
- Easier in Transactional-type documents:
- Documents designed to convey explicit content
- Strong data typing with metadata for field length, range and value, other restrictions.
- Few and somewhat arbitrary presentational characteristics
- Information about content components in:
- Physical implementation models (schemas)
- Source code of any relevant applications that process documents
Relationships Among Content Components
- Content components can be related to one another
- Derivational relationships
- Referential relationships
"Mixed Content"
- Narrative documents can hide or obscure candidate components in paragraphs
or other blocks of text
- Document analysts refer to these as "Mixed Content"
components because they are mixed into surrounding text that may be more
generic or untyped
- A common form of mixed content is an otherwise unstructured text
paragraph that contains emphasized words, glossary terms, references to
tables or figures, citations to supporting documents, or links to footnotes or
endnotes
Analyzing Content Components
- What attributes about each type of content should we record in our analysis?
- Names/synonyms/homonyms (what it is called)
- Definition (what it "means")
- Roles (what it does)
- Cardinality/Optionality (occurrence rules)
- Restricted values, code sets, defaults
- Data Type (text, numbers, date, video)
- Relationships/Associations
- Origin (Is this new information, or from some other source? Who maintains it?)
Content View of a Lecture Slide
Readings for 4 April
- Read DE 12 Again!
- "The Domain of Domains" Robert Schmidt. Extreme Markup Languages 2002
-
"When "It Doesn't Matter" Means "It Matters"" B. Tommie Usdin.
Extreme Markup Languages 2002