25. Document Component Design
DE + IA (INFO 243) - 11 April 2007
Bob Glushko
Plan for Today's Class
- Some Leftover Harvesting Topics
- Transforming Presentation to Content
- Code Sets
- Consolidation
- Aggregate Components
Transforming Presentation to Content
- Deconstructing tables into their content types is an instance of the more general goal of transforming presentation to content
- Other presentation components and conventions that carry semantic information should be made explicit as content components
- The mere existence or non-existence of values within the cells of a table can have semantic significance.
- Color coding: Red text or box around text -> warning
- Adjacency: figure and caption -> illustration aggregate
Analyzing "Possible Values"
- It is critical to capture any rules governing the possible values for a component
- Sometimes possible values are conventional, fixed, and span the entire semantic range for some domain (days of week, AM/PM)
- Determine who can control the value sets (internal [Manufacturer part #s] vs external [Bar codes])
- Patterns like regular expressions are often useful but not sufficient for validation
- And if the set of possible values is just historical and well motivated, fix it in your component design
Code Sets
- Codes are constrained sets of values
- Codes establish their meaning by reference to those values, often by abbreviations
- Using codes in vocabularies and metadata promotes consistency and makes meaning unambiguous
- You especially want to avoid doing a partial enumeration in a domain where a standard set of enumerated values already exists
- Most organizations have internal code sets or business rules that implicitly define them
External and Internal Codes
- External codes are those maintained by some entity or organization outside of your control (ISO, ANSI, etc.)
- Internal codes are code sets that you can define and control
How This All Relates to Content Models in Vocabularies
- EXAMPLE: "country code" or "currency code" are "Fregan" and can be
reduced to context-free enumerations, but "country" or "money" can't began they're "Wittgensteinian"
- Put very simply: The meaning of a tag can rarely be defined in terms of its legal values
- This doesn't mean that we can't use money as a component in an information model, but it warns us that we can be more precise if we pretend that money can be understood as "currency code" and an "amount"
- And whenever a "code set" exists in the world, make sure you capture it in your semantic description
Consolidating The Harvest
- We can begin our consolidation with the candidate components from any of the information sources, but we recommend using the one you believe is the most authoritative or that yielded the most components
- The goal is to combine components that are synonyms (different names for the same [or highly similar] meaning) and to distinguish any homonyms (same names for different meanings)
Consolidation -- Seeking Semantic Clarity and Precision
- The component names we harvest from information sources might not be consistent; resolving synonyms and homonyms is almost always necessary
- How rigorous we need to be in naming (and re-naming) components depends on the size of the inventory and the scope of the project
Guidelines for Minimizing Synonymy
-
Components that are similar but not identical in semantics often pose the most problems because
they encourage multiple inconsistent ways to tag the same content
- This is not only not a good thing, it is a very bad thing
- Synonymous components often arise in harvests from information sources from different authors, organizations, and perspectives on the domain
- Are the differences between the proposed components substantive (that you can explain using the metadata in your harvest table) or stylistic (based on writing or encoding style)?
- Are the differences "real" but "unimportant" to users or applications? (spurious precision)
Example Consolidation Table -- Event Calendars
After Consolidation, Then What?
- We have now reached the point where we have captured the business rules and content components of the domain / document inventory in which we're working
- We have separated the Presentational, Structural and Content Components
- We have developed a conceptual model of our consolidated and essential "atomic" content components -- semantic equivalence classes
- We will have some sense of the distribution of the content components and can distinguish those that are "core" -- that appear in all or almost all contexts in the domain -- from those that are more context-dependent
- Now we have to ensure that we can reuse these components when we assemble document models from them
Why Analysis Models Aren't Good Enough
- Document artifacts differ a great deal in how they combine content, structure, and presentation
components
- Some combinations are idiosyncratic and ad hoc or represent compromises between incompatible requirements that make structures less than optimal
- If we are completely constrained by the artifacts as they exist in our component model, we will preserve both their good and bad aspects – which may be influenced by factors which are not part of our new requirements
- So our analysis models of components and aggregates may need to be revised to allow alternative ways of satisfying our requirements that relaxes the (implicit) constraint to preserve the original artifacts
Design and Re-design with Conceptual Models
- The component model may present many attractive options for re-design and reuse of our content components
- Design means changing our model, not simply improving the way we view it. This is when we actually get to apply our insights about reuse and patterns
- During design we can devise more consistent component names, remove repeating or reoccurring content and structure, increase reuse of standard patterns or components, replace implicit components with explicit ones, and otherwise create a more abstract, concise, and context-free representation of the essential characteristics
- Once design begins, we cannot guarantee that we will be able to recreate the original artifacts from the model
- Unless, of course, we have presentational or structural fidelity or integrity requirements
Analogy: The Build-to-Order Computer Factory
- Designing a factory that makes "build-to-order" computers:
- You might start with some collection of computers and take them apart to see what pieces are needed to assemble them (ANALYSIS)
- Because you want to be able to make these items with reasonable quality but at less cost and at greater efficiency, you redesign the computers to use standard components (DESIGN FOR REUSE)
- You organize the components and the assembly lines to make it easy to locate components when you get an order (ORGANIZE FOR REUSE)
The Concept Factory in Document Engineering
- Designing a conceptual model of some domain:
- You might start with a set of hand-crafted applications with printed or online data entry forms and take them apart to see what pieces of information each of them needs (ANALYSIS)
- Because you want the complete "enterprise model" for the domain to be able to represent any application or form with reasonable quality but at less cost and at greater efficiency, you redesign the pieces of information from analysis to be more standard and context-free (DESIGN FOR REUSE)
- You organize the components to make it easy to locate the components when you build the specific contextualized model for an application or form (ORGANIZE FOR REUSE)
Generalizing and Specializing Components
- You can add a context qualifier to specialize a component rather than defining a completely new one (this reuses the "base" component type)
- In the Engineering Compendium both Figures and Tables have Captions, and the Caption is similar enough in both to allow it to be re-used by both.
- This suggests components for FigureCaption, TableCaption
- You can also remove or "factor out" context to define a more general or abstract component that can be used more broadly
- "Delivery Date" and "Ship Date" suggest a "Date" component
The Contextualization Continuum
- Your set of components has to find a balance between precision and generality (or flexibility)
- You want a set of components that can be reused across related document types in some context (or group of related contexts)
- Contexts fit into a continuum --
At one end is an extremely specific context in which component definitions are suitable for a very narrow class of documents
(example: a component model that describes the existing course catalog for a single department or school)
- At the other end is an extremely loose or general context, suitable for a very broad class of documents
(example: a component model that describes a catalog as a set of items of any type)
Components: A Reminder
- Components – the units of content
- Any piece of information that has a unique label or identifier is a candidate component
- Any piece of information that is self-contained and comprehensible on its own is a candidate component
- A component is a logical unit, with no presentation implied; it may be organized structurally
- These definitions are very helpful for finding (aggregate) components in some types of documents but less so in others
- It depends on the presence of, and relationships with, the structural and presentational information
Motivating Aggregate Components
- Atomic components that hold individual pieces of information
- Especially in transactional documents,where atomic components have a natural representation as primitive data types ("string," "Boolean," "date") or as datatypes that are derived from these by restriction
- Document components that assemble smaller components into the set of information needed to carry out a self-contained purposeful activity
- Especially in transactional contexts, where documents have a natural correspondence to some unit of work that initiates, records, or responds to a clearly-defined event
Motivating Aggregate Components [2]
- Aggregate components are composed of atomic ones and are reused in the assembly of document components
- They are easier to identify in transactional contexts because they are often the key information that flows from one document to another
- "Address" or "Person" are obvious examples of aggregates composed of smaller ones
- Two key questions:
- How do we select and group atomic components into aggregates?
- How many aggregates should we create?
Identifying Two Kinds of Component Aggregates
- Structural Aggregates -- sets of components defined by parent-child or containment relationships
- One way to do this is by putting all the unique components on index cards and then sort them into clumps or clusters. First sort all the components that go together because of containment or structural rules (X contains Y and Z).
- Conceptual Aggregates -- sets of components that "go together" because of logical dependency
- After you've identified all the structural aggregates, you can start to further cluster those intermediate clusters on the basis of dependency rules - what things "go together" logically>
Identifying Aggregate Components in Non-Transactional Documents [1]
- Aggregates are more elusive on the narrative end of the DTS because there are limits to the rigor with which components can be grouped
- "Mixed content" models arise when there are few or weak constraints on where atomic components can appear
- Presentation often masks the atomic components in potential aggregates
- Structures are often based on conventions for organization and presentation than on semantic relationships
- But there will still generally be components that "go together" to form reusable structures
- And "going together" means different things for each set of components
Identifying Aggregate Components in Non-Transactional Documents [2]
- Aggregates can be created in two "bottom-up" ways that focus on the atomic components:
- The first is by rebuilding or making explicit the structures that we took apart in document analysis
- The second is by creating structure in "blobs" of poorly structured information written in an overly narrative style (with mixed content at best)
- A more modular style for the information will increase its regularity and reusability; it will eliminate content that has little value to users and reinforce its use as
"boilerplate" or via links
Extracting Repetitions
- Different aggregates might have the same components
- "Contract" and "Shipment" might both have "Start Date," "End Date" and "Duration"
- The repeated components can be extracted and created as a reusable aggregate
- In this example we might call the common pattern the "Period"
Reuse of Existing or External Patterns [1]
- Many of the patterns that you might identify as repetitions would have also been identified in a previous analysis of your domain or context
- You should determine if their analysis yielded components for you to reuse
- Pay particular attention to "standards" if they come from credible sources
- Allison Bloodworth discussed her analysis of iCAL ("Internet Calendaring and Scheduling Object Core Specification") and SKICal ("Structured Knowledge Initiative Calendar") in the report on the event modeling
Reuse of Existing or External Patterns [2]
- But don't accept someone else's analysis and models if you don't understand them
- And NEVER assume that a component model is appropriate solely on the basis of its name because...
- UBL has invested three years in creating conceptual component models so that people can confidently reuse the XML schemas based on them
For 16 April
- Read Sections 3.2.4-end of "Model-driven Application Design for a Campus Calendar Network" (Bloodworth and Glushko).
- Database Normalization (Gilfillan) [Don't bother if you already understand database normalization]
- Creating and Maintaining Large Families of Related Schemas (Coates) [Don't bother if you aren't immediately interested in implementing document models in XML schemas]