Namespaces, Trees and Infosets, Base URI's, and Modular Documents
R. Alexander Milowski
milowski at sims.berkeley.edu
#1
Overview
Topics:
Namespace processing.
Trees, Parsing, and Infosets
Base URI's and XML Base
Modular documents via XInclude
References will be provided at the end.
#2
Namespace Declarations have Scope
A namespace declaration's scope is the element where it occurs.
There is no different between declarations on the root element and elsewhere.
The element, its attributes, and its children may use that prefix in their names.
You can re-define a prefix to point to a different namespace.
#3
How it Works - An Example
The simplest thing is to default the namespace:
<order xmlns="urn:publicid:IDN+cde.berkeley.edu:example:order:en"> <from>me</from> <to>you</to> <items> <item>A clue</item> <item>A better clue!</item> </items> <memo> <p>I need this order now!</p> </memo> </order>
#4
How it Works - An Example - Part 2
But we can use a prefix and get the exact same names:
<o:order xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:en"> <o:from>me</o:from> <o:to>you</o:to> <o:items> <o:item>A clue</o:item> <o:item>A better clue!</o:item> </o:items> <o:memo> <p>I need this order now!</p> </o:memo> </o:order>
#5
How it Works - An Example - Part 3
If we change that prefix binding later on, we'll get different names:
<o:order xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:en"> <o:from>me</o:from> <o:to>you</o:to> <o:items xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:items:en"> <o:item>A clue</o:item> <o:item>A better clue!</o:item> </o:items> <o:memo> <p>I need this order now!</p> </o:memo> </o:order>
#6
How it Works - An Example - Part 4
We can also mix in a default if we wish:
<o:order xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:en"> <o:from>me</o:from> <o:to>you</o:to> <o:items xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:items:en"> <o:item>A clue</o:item> <o:item>A better clue!</o:item> </o:items> <o:memo xmlns="http://www.w3.org/1999/xhtml"> <p>I need this order now!</p> </o:memo> </o:order>
#7
Syntax vs. Resolved Names
This example is a bit pathological!
All the 'include' elements below have the same name of {http://www.w3.org/2001/XInclude}include:
<top xmlns="http://www.w3.org/2001/XInclude" xmlns:parent="http://www.w3.org/2001/XInclude"> <!-- This one declares its own prefix 'xi' --> <xi:include href="mydoc.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/> <!-- This one declares its own prefix 'xmlinclude' --> <xmlinclude:include href="mydoc.xml" xmlns:xmlinclude="http://www.w3.org/2001/XInclude"/> <!-- This one declares defaults the namespace --> <include href="mydoc.xml xmlns="http://www.w3.org/2001/XInclude"/> <!-- This one uses the default declared on 'top' --> <include href="mydoc.xml"/> <!-- This one uses the prefix 'parent' declared on 'top' --> <parent:include href="mydoc.xml"/> </top>
#8
Syntax vs. Resolved Names
There is an infinite number of ways to encode the same element.
For your sanity, don't associate a namespace to more than one prefix or default unless you absolutely must do so!
...but sometimes that's what a program outputs (e.g. XSLT might do this).
#9
What is XML to an Application?
Syntax? Passing "<tag>" and "</tag>"?
Names and Values? Passing start "tag" and end "tag"?
Whitespace? Text? Is there a difference?
How about comments, processing instructions, attributes, etc?
Is it an API? (SAX - Simple API for XML)
#10
Information Sets (infosets)
We really don't want syntax--that's too much.
But, a lot of the rest we do want (e.g. the application needs to decide when whitespace is ignoreable and not text).
Standard API's are great but insufficient to standardize across processing environments or programming languages.
Answer: An XML Processor conveys an Information Set to the application:
Figure 1. Figure
#11
The XML Infoset
From the standard:
"Its purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document [XML].
It does not attempt to be exhaustive; the primary criterion for inclusion of an information item or property has been that of expected usefulness in future specifications. Nor does it constitute a minimum set of information that must be returned by an XML processor. "
An infoset is a hierarchy (or tree) of items with named properties.
Each property can have a simple value (e.g. a name, text, URI, etc.), other items as its value, or lists of either simple values or items.
#12
An Example Infoset
Figure 2. Figure
<?xml version="1.0"> <doc><title>My Document</title> <body><p>something</p> </body><citations/></doc>
#13
Document Info Item
[children] - An order list of child info items. There must be exactly one element info item amongst them. There are other constraints too...
[document element] - The element corresponding to the root of the element tree.
[base URI] - The base URI of this document.
[character encoding scheme] - The name of the unicode character encoding scheme used.
[standalone] - The standalone value from the xml declaration.
[version] - The version value from the xml declaration.
#14
Element Info Item
[local name] - The name of the element.
[children] - An ordered list of the children (elements, characters, comments, or processing instructions) info items.
[attributes] - An unordered list of attribute info items.
[in-scope namespaces] - The set of namespace mappings that are in-scope at this element's position.
[base URI] - The base URI of the entity from which this element was parsed.
[parent] - The parent element or document info item.
#15
Namespace Info Item
Namespace declarations are encoded as Namespace Info Items.
They have two properties:
[prefix] - the prefix whose binding this info item describes.
[namespace name] - the namespace name bound to this prefix.
These are not the namespace declarations.
Serialization of XML Infosets with namespaces is tricky.
#16
In-Scope Namespaces
Every namespace binding that is "in-scope" is listed in the element's [in-scope namespaces] property.
Again, this is not just those declared.
You have to look at the differences between parent and child to see what has changed.
This property is only on elements.
The prefixes 'xml' and 'xmlns' are always defined:
xml → http://www.w3.org/XML/1998/namespace
xmlns → http://www.w3.org/2000/xmlns/
The document implicitly defines 'xml' and 'xmlns'.
#17
Attribute Info Item
[local name] - The name of the attribute.
[normalize value] - The value of the attribute as normalized by the XML Processor.
[owner element] - The element upon which this attribute was specified.
#18
Character Info Item
[character code] - The unicode character code (code point).
[parent] - The element containing this character
Usually implementations collection character info items into strings.
#19
There's More...
There are more properties than what has been show here.
There are info items for processing instructions, comments, and few more things.
The XML Information Set recommendation is one of your reading assignments.
#20
Information Sets are Extensible
New recommendations can associate properties with info items by adding properties.
For example, XML Schema adds properties to the infoset to record the results of validation.
Proprietary software can add their own properties too.
#21
The Base URI
An XML document has a base uri.
Typically, the value represents where the document came from.
These values propagate down to each element.
They allow relative URI links to work:
<graphic src='example.png'/> <next href='slide2.xml'/> <top href='../index.xml'/>
#22
Unknown Base URI
Not all documents have a base URI.
What happens when you POST an XML document? There is no URI for the client...
What happens in a pipeline when you chain two XSLT stylesheets together?
The result of the first transform doesn't have a base URI.
Relative links won't work until it is serialized.
This means that documents can be broken after you author them.
Example:
This link to 'bio.xml' has no resolveable URI under HTTP POST:
<person bio="bio.xml"> <name>Alex Milowski</name> </person>
#23
IETF RFC 2398
This RFC provides the means for embedding base URIs in documents.
There are rules for determining the base URI:
The base URI is embedded in the document's content.
The base URI is that of the encapsulating entity (message, document, or none).
The base URI is the URI used to retrieve the entity.
The base URI is defined by the context of the application.
There is a W3C Recommendation called XML Base that covers (1).
#24
xml:base Attribute
The xml:base attribute annotates an element with a base URI value.
Example:
<graphic xml:base='http://www.milowski.com/thesis/' src='issac-scarfex.jpg'/>
Rules for finding the base URI of an element:
The value of the xml:base attribute.
The base URI of the parent element if there is a parent element.
The base URI of the document.
Keep in mind that the 'xml' prefix is pre-defined to map to 'http://www.w3.org/XML/1998/namespace'.
#25
Interpreting Values
For any attribute, use the base URI of the element.
<link href='mystuff.xml'/>
For any text content, use the base URI of the parent element.
<link>mystuff.xml</link>
For xml:base, use the base URI of the parent element.
<p xml:base="http://www.example.com"> A <a xml:base="/bingo/" href='stuff.xml'>link</a> </p>
In the first case, the base URI of 'p' is used but not needed as the value of the xml:base attribute is an absolute URL.
In the second case, the base URI of 'a' is used--which was inherited from 'p'. This gives the URL 'http://www.example.com/bingo/' for the 'a' element.
#26
Modular Content
Sometimes you just want to define reuseable "text".
Other times you want re-useable content modules.
XInclude defines a way to do both.
This is a "replacement" for general entities (e.g. &mystuff;).
#27
XInclude
XInclude is a W3C Recommendation as of 20 December 2004 (see http://www.w3.org/TR/xinclude/).
XInclude is a vocabulary that tells a processor how to transform your document.
The vocabulary has replacement semantics (e.g. replace this xinclude with the specified text/module).
#28
Including External Content
Just point to the content:
<doc xmlns:xi="http://www.w3.org/2001/XInclude"> <title>My Document</title> <xi:include href='abstract.xml'/> </doc>
'abstract.xml' contains:
<abstract> <p>Blah, blah, blah,...</p> </abstract>
So you get:
<doc xmlns:xi="http://www.w3.org/2001/XInclude"> <title>My Document</title> <abstract> <p>Blah, blah, blah,...</p> </abstract> </doc>
#29
Try this at home!
For all the features, you need s full XInclude processing.
This XSLT stylesheet will handle transclusion of those with parse='xml'.
It will also handle the fallback if the document does not exist.
Example input: doc.xml abstract.xml
Example output: doc_output.xml
#30
Including Internal Content
Now we're going to go out on the edge of reality...
Just point to the content:
<doc xmlns:xi="http://www.w3.org/2001/XInclude"> <string id="cde">Center for Document Engineering</string> <title>Where I Work</title> <p>I teach at the <xi:include xpointer="id('cde')"/>.</p> </doc>
So you get:
<doc xmlns:xi="http://www.w3.org/2001/XInclude"> <string id="cde">Center for Document Engineering</string> <title>Where I Work</title> <p>I teach at the <string id="cde">Center for Document Engineering</string>.</p> </doc>
#31
Inclusion Chains
You can't have circular includes (e.g. A includes B and B includes A).
Inclusion chains are followed:
<doc xmlns:xi="http://www.w3.org/2001/XInclude"> <title>My Document</title> <xi:include href='abstract.xml'/> </doc>
'abstract.xml' contains:
<abstract> <string id="b">blah</string> <p><xi:include xpointer="id('b')"/>, <xi:include xpointer="id('b')"/>, <xi:include xpointer="id('b')"/>,...</p> </abstract>
So you get:
<doc xmlns:xi="http://www.w3.org/2001/XInclude"> <title>My Document</title> <abstract> <string id="b">blah</string> <p>blah, blah, blah,...</p> </abstract> </doc>