23. Document Analysis [2]
DE + IA (INFO 243) - 4 April 2007
Bob Glushko
Plan for Today's Class
- Some significant syllabus surgery
- Feedback on Assignments 2-4
- Tour of the Document Type Spectrum to learn about content, structure, and presentation analysis
Revised Syllabus
- M April 9 - 5 minute project presentations; Document Analysis [3]; Document Analysis Assignment
- W April 11 - Document Component Design [1]
- M April 16 - Document Component Design [2]
- W April 18 - IN CLASS PROJECT WORKING SESSION
- M April 23 - IN CLASS PROJECT WORKING SESSION
- W April 25 - Document Model Assembly
- Sun April 29 - St Helena (optional)
- M April 30 - NO LECTURE -- work on projects!
- W May 2 - Model-Based Applications; Justsystems Demo
- M May 7 - Final Project Presentations
General Comments About Assignments
- Letter grade is from me; numerical grades and comments are generally from Anya
- Anya grades very objectively according to a checklist or answer key; I "read between the lines" to assess whether you seem to know more than you are literally saying or whether you are so brilliant or cryptic that your answer doesn't fit the answer key or checklist
- Most of you are doing consistently excellent work, and when you aren't you probably know why better than I do
- In general: make sure you answer all parts of each question or activity
Assignment by Assignment Comments
- Scavenger Hunt
- OASIS (designing new highways) vs Microformats (paving the cow paths)
- Business Patterns
- Easiest questions were about channel conflict and Bullwhip Effect; hardest one was analyzing the 2002 dock strike/lockout
- DIMENSIONS of the exit/voice pattern, not the pattern per se
- Requirements and Inventory
- On the surface, a very straightforward assignment to get you some practice using the "generic requirements" and considering them from different perspectives
The Document Type Spectrum
Systematic Variation in Document Types Across the Spectrum
- Instances more heterogeneous on narrative end
- Types are "broader" and more descriptive, less prescriptive on narrative end
- The set of content types within a document type is much greater on the transactional end because the leaves aren't "just text"
- More need for "metadata" augmentation of documents on narrative end, because on transactional end what would be metadata is more likely to be explicitly contained in the content already
- Presentational information more likely to be correlated with content and structure on narrative end
Relationships Between Text and Non-text [1]
- Another useful dimension for thinking about content considers the relationship in documents between the text
and non-text information that they contain
- Text-dominated – most of the content is conveyed by text components, with non-text components unnecessary or in an incidental role (examples: legal documents, accounting information, invoice)
- Text-framework – the document reflects the organization defined by the text components, but non-text components provide content enhancements (examples: encyclopedia, maintenance manual, product catalog, purchase order)
- (multimedia) Non-text dominated or text-enhanced – most of the content conveyed by non-text components; which provide
the framework for the text; text components carry metadata, annotate or explain intrinsically non-textual content (examples:
photos, video, engineering drawing, atlas, art book)
Relationships Between Text and Non-text [2]
- The relationship between text and non-text information can vary at all points on the document type spectrum
- Narrative document type can be philosophy (all text) or anatomy (lots of non-text)
- Transactional document type can be invoice (all text) or RFQ (lots of non-text)
Dictionaries, Encyclopedias, and Reference Books
- Usually very carefully designed, with regular structure that is exploited in information access and navigation features to enhance usability
- Often have rich repertoire of content component types (pictures, maps, charts, formulas, tables)
- Mixed content in paragraphs or other text blocks will contain numerous content types
Engineering Compendium – Typical Entry
Oxford English Dictionary – Typical Entry
Procedures, Policies, Laws, and Regulations
- Usually mostly text, created and used by people
- Information that is often extremely important to companies and highly-paid professionals because the cost of finding (or not finding) information can be high
- Often has high "intrinsic hypertext" character with many explicit and implicit links between content components
- Often follow structural conventions and standards with regular numbering and naming schemes
- Versioning and configuration requirements can pose problems
- Making this type of content computable or executable is a huge R&D area (XML standards like XACML, policy engines and wizards, expert systems)
Code of Federal Regulations
Catalogs
- Many different types
- Some are extracted from ERP system or product database
- Often contain a mixture of structured and unstructured content
- Often a challenge to match the user's vocabulary and ontology for a product domain
Transaction Documents
- Printed or electronic forms
- Data-intensive, designed to capture and present small information components
- Inputs and outputs of business processes and often created and consumed by computers
- Few and somewhat arbitrary presentational characteristics
- Strongly datatyped with field length, range and value, other restrictions
Harvesting and Consolidation
- Harvesting – Create a set of candidate content components by extracting them from the information sources while removing presentation and structure
- As we identify candidate content components, we need to record its properties (or attributes or behaviors) that let us understand it and distinguish it from other ones
- A practical way to do this for each document or information source being analyzed, create a table or spreadsheet containing the candidate component and the useful metadata
- Consolidation– Identify synonyms and homonyms among the candidate content components, assigning a unique name to each unique meaning as part of a controlled vocabulary
- How rigorously we must assign "good names" and "good definitions" depends on the size of the document inventory and the scope of the project
- Names might follow precise rules to ensure that they can be reliably stored and located in a data dictionary a la ISO 11179
Seek Semantic Clarity and Precision
- It seems obvious that we need "good names" and "good definitions" for the components we identify and design but what does that mean?
- "What's in a Name?" (http://www.vertaasis.com/articles/whats_in_a_name.htm) recommends three "levels" of models (or names) that line up nicely with our three stages
of analysis, design, and encoding
- Business names – a format that lets the requirement or semantics be
easily readable and verifiable by a business person (not a modeling or XML expert). This should use familiar words and be completely technology-independent
- Logical names – a format optimized for the expression of the design or model; essential that they are expressive enough to reflect the relationships between model components. Logical names might follow precise rules to ensure that they can be reliably stored and located in a data dictionary; ("qualified names" specialize general terms to convey the context of use)
- Physical names – the format required by the implementation technology for the
model
Defining What Something Means
- Definitions
- Definitions in a controlled vocabulary
- Data types
- Metadata
- Metamodels
- Formal assertions
- Ontologies and thesauri
The Simplest Information Component Model
- The simplest or minimal information component model is a glossary – a list of the words used to describe or name the "things of significance" and what they mean
- This simple data model is augmented as attributes or characteristics of the significant things are identified and recorded
- The model is further developed as relationships or associations or links between the "significant things" are identified and recorded
What Metadata to Record About Candidate Components
- What attributes about each type of content might we record in our analysis?
- Names/synonyms/homonyms (what it is called)
- Definition (what it "means")
- Identifiers
- Cardinality/Optionality (occurrence rules)
- Restricted values, code sets, defaults
- Data Type (text, numbers, date, video)
- Relationships/Associations (participation in structures)
- Origin (Is this new information, or from some other source? Who maintains it?)
- Access (who is allowed to view/change/copy/etc. it)
- Permanence (is it static or dynamic? how often does it change?)
- Business processes in which it participates
Readings for 9 April
- Lisa De Larios-Heiman and Carolyn Cracraft. "Overview and The Sylvia Data Model.
The Syllabus Viewing Application."
-
Allison Bloodworth and Robert Glushko.
"Model-driven Application Design for a Campus Calendar Network" (Sections 1-3.2.3.2.2)"
XML 2004 Conference
First Project Presentation
- 5 minutes per team
- Original vs revised scope, and why...
- What you've learned about the context and requirements from documents, interviews, or other "anthropology" or "archeology"
- Any modeling artifacts to show us or describe?
- Any interesting/unexpected techno-political challenges?
- What are your key activities this and next week?