XML Documents, Namespace, and Unicode

#1

Overview

Topics:
- XML syntax
- Names and Namespaces
- Unicode
References will be provided at the end.

#2

History of XML

XML's Development start around 1996.
The motivation was in part due to the success of HTML and in part due to the crushing need to simplify SGML.
Jon Bosak from Sun Microsystems was the prime motivator for the creation of XML and inspired by Yuri Rubinsky's[1] "SGML on the Web"
The W3C (World Wide Web Consortium) was chosen to host the standards process and the original XML Committee was formed.
The first draft was produced in November 1996 and shown at the SGML conference in Boston, MA in December 1996.[2]
In February 1998, XML became a W3C Recommendation[3].
There is now a "third edition"[4] and XML 1.1 recommendation as of April 2004[5].

[1] http://www.oasis-open.org/cover/yuriMemColl.html

[2] http://www.w3.org/TR/WD-xml-961114.html

[3] http://www.w3.org/TR/1998/REC-xml-19980210

[4] http://www.w3.org/TR/2004/REC-xml-20040204/

[5] http://www.w3.org/TR/xml11/

#3

The W3C Process

Interest in a particular topic is made known by the consortium members either as a result of a published note or from the results of a workshop.
Either a Working Group is formed or the topic is assigned to an existing group.
A Requirements Document is drafted and approved by the first the working group and then the consortium.
All documents go through these stages: Initial Draft, Working Draft, Last Call Draft, Proposed Recommendation. Candidate Recommendation, and Recommendation.

#4

What is XML?

XML is about encoding and structuring single "instances" of documents.
XML and related standards also define "what is a valid instance".
XML is not a programming language. It has NO SEMANTICS.
Erase "XML Programming" from your vocabulary.

#5

What is Document?

Document

1 a archaic : PROOF, EVIDENCE b : an original or official
          paper relied on as the basis, proof, or support of something c :
          something (as a photograph or a recording) that serves as evidence
          or proof 2 a : a writing conveying information b : a material
          substance (as a coin or stone) having on it a representation of
          thoughts by means of some conventional mark or symbol c :
          DOCUMENTARY [1]

Documentary

1 : being or consisting of documents : contained or certified
          in writing <documentary evidence> 2 : of, relating to, or
          employing documentation in literature or art; broadly : FACTUAL,
          OBJECTIVE [1]

Markup Language

: a system (as HTML or SGML) for marking or tagging a
          document that indicates its logical structure (as paragraphs) and
          gives instructions for its layout on the page for electronic
          transmission and display [1]

[1] Merriam-Webster Online: http://www.m-w.com/

#6

What is an XML Document?

It can be thought of as a tree of elements--containers of information.
Traditionally, from a markup perspective:
- Articles, books, notes, poems, novels
- Technical manuals, slip sheets, product packaging
More abstractly, documents are instances of information that typically have structure.
So, lots of things are documents:
- Messaging, e-mails, web sites, etc.
- Business transactions, invoices, statements, etc.
- Log files, configuration files, install scripts, etc.
- Ontologies, medical lab data, instrument experiment results, human genome annotations, etc.
XML is a useful common syntax for encoding these documents.

#7

The Real Answer

Documents are instances of units of information^*.

* With XML you get to define "instance", "unit", and "information".

#8

What XML Provides

Internationalization via Unicode
Validation of instances.
Localization of names via namespaces (e.g. My 'tomato' isn't your 'tomato').
A "human readable" format.
Hierarchical structure.
A "motif" for extensibility.

#9

An Example


  <slide>
    <title>What XML Provides</title>
    <contents>
      <ul>
        <li><p>Internationalization via Unicode</p></li>
        <li><p>Validation of instances.</p></li>
        <li><p>Localization of names via namespaces 
              (e.g. My &#39;tomato&#39; isn&#39;t your &#39;tomato&#39;).</p>
        </li>
        <li><p>A &#34;human readable&#34; format.</p></li>
        <li><p>Hierarchical structure.</p></li>
        <li><p>A &#34;motif&#34; for extensibility.</p></li>
      </ul>
    </contents>
  </slide>

#10

A More Complicated Example

<c:pseudocode name="Adj" 
   xmlns:c="urn:publicid:IDN+mathdoc.org:schema:pseudocode:2004:1.0:us"
>
   <c:args><c:arg>v</c:arg><c:arg>j</c:arg></c:args>
   <c:for>
      <c:varassign> 
         <c:var>i</c:var>
         <c:constant>1</c:constant>
      </c:varassign>
      <to><c:constant>j</c:constant></to>
      <c:do>
         <c:varassign>
            <c:var>r</c:var>
            <c:func name="find-simplex">
               <c:value><c:var>v</c:var></c:value>
               <c:value><c:var>i</c:var></c:value>
            </c:func>
        </c:varassign>
      </c:do>
   </c:for>
   <c:return><c:value><c:var>r</c:var></c:value></c:return>
</c:pseudocode>

Adj(v, j):

for i ← 1 to j

do

r ← find-simplex(v,i)

return r

#11

Reading EBNF

[22]    prolog      ::=    XMLDecl? Misc* (doctypedecl Misc*)?
[23]    XMLDecl     ::=    '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
[24]    VersionInfo ::=    S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
[25]    Eq          ::=    S? '=' S?
[26]    VersionNum  ::=    ([a-zA-Z0-9_.:] | '-')+
[27]    Misc        ::=    Comment | PI | S

'+' means 'one or more', '?' means 'optional', '*' means 'zero or more'.
Parenthesis group constructs.
'|' (pipe character) means 'or'
'string' means the occurrence of the literal string.
[c-c] is a character class and represents a single character in the specified range.

#12

Motivation for Namespaces

When all you have is a name:
- What if I want to mix my elements with yours?
- How do I associate semantics with mixed elements?
- How do I associate a schema (or rules) with elements and attributes?
Namespaces are a necessary part of XML.
They aren't evil.
But they can get a bit complex when the markup gets dense or vocabularies are mixed.

#13

Names in XML

Names are always a pair:
- A local name that is an identifier.
- A namespace name that is an absolute URI or "no value".
The namespace name does not occur syntactically in a name.

But there is the non-standardized "Clark Notation" for names: {uri}local-name

{http://www.w3.org/1999/xhtml}a
{urn:publicid:IDN+mathdoc.org:schema:slides:2004:1.0:us}slide
{}doc

#14

Local Names and Identifiers

In XML 1.0, explicit Unicode code point ranges are specified per language.
In XML 1.1, anything that isn't disallowed is allowed in a name.
In code-page zero (basic latin) this means: letters, numbers, hyphens, periods.
The colon (:) is not allowed in a local name.

#15

Syntax of Names

The namespace name does not occur syntactically in a name.
But a shortened form of the name called a QName does occur.

The syntax of a name is a QName (qualified name):

[6]   QName          ::=  PrefixedName | UnprefixedName
[6a]  PrefixedName   ::=  Prefix ':' LocalPart
[6b]  UnprefixedName ::=  LocalPart
[7]   Prefix         ::=  NCName
[8]   LocalPart      ::=  NCName

Instance examples:

<c:pseudoccode id="foo" xml:lang="en">
<?hack format=large?>
<c:description>
<p>This is the algorithm</p>
</c:description>
</c:pseudocode>

#16

QNames and Prefixes

QNames must resolve to names.
So, prefixes in QNames must resolve to a namespace name.
That means prefixes must be declared to resolve a QName.
In an XML document, this is done via special namespace declaration attributes that you'll see in a few slides.

#17

Basic Syntax - Documents & Elements

Here's the basic syntax of a document:

[1]  document     ::=   prolog element Misc* - Char* RestrictedChar Char*
[39] element      ::=   EmptyElemTag | STag content ETag
[9]  STag         ::=   '<' QName (S Attribute)* S? '>' 
[10] ETag         ::=   '</' QName S? '>'
[11] EmptyElemTag ::=   '<' QName (S Attribute)* S? '/>'

Examples:

<?xml version="1.0">
<doc>
<h:a href="something.xml" xmlns:h="http://www.w3.org/1999/xhtml">something</h:a>
</doc>

The 'prolog' and 'misc' can only contain comments and processing instructions--no text content other than whitespace.
If there is an XML declaration (see next slide) it must occur before any characters in the document.
The 'content' production consists of any element, characters, processing instruction, or comment.
There can only be one element at the root of the document and it is called the document element.

#18

The XML Declaration

The XML Declaration must be at the exact start of the document (i.e. no whitespace leading the declaration).
It consists of three parts specified by pseudo-attributes:
1. version='n.n' -- specifies version information -- either 1.0 or 1.1.
2. encoding='type' -- specifies the unicode encoding used by this "data stream" to encode the unicode characters of the XML document.
3. standalone='yes|no' -- a specification of whether the document is sufficiently contained in the current character stream.

An example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

#19

Basic Syntax - Attributes

Attribute syntax:

[12] Attribute   ::=   NSAttName Eq AttValue | QName Eq AttValue

examples:

<a xmlns="http://www.w3.org/1999/xhtml" href='http://www.w3.org/1999/xhtml'/>

In the above, the first is a namespace declaration attribute and the second is a regular attribute.

The contents must be in a quoted literal so there are syntax constraints -- see next few slides.

#20

Those Pesky Namespaces and Prefixes

Now you've seen that qnames occur all over the place.
Those prefixes must be declared:
- On the current element or nearest ancestor there must exist a namespace declaration attribute that uses that prefix.
- A namespace declaration attribute consists of an attribute that starts with the special prefix 'xmlns' followed by a colon ':' and the prefix being declared.
- The value of the attribute is an absolute URI that declares the namespace.
- You can default the namespace for elements by excluding the prefix in the declaration.
- Defaults are "in scope" for their children.
- When a default is present, element names with no prefix use that namespace.
- Attributes do not use the default namespace.

#21

Namespace Example

<doc xmlns:m="http://www.w3.org/1998/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<title>My Document</title>
<body xmlns='http://www.w3.org/1999/xhtml">
<p xlink:href="alternate.xml">I am a paragraph containing some mathematics: 
<m:math><m:mi>x</m:mi></m:math>
</p>
</body>
</doc>

The 'doc' element--the document element--has no namespace.
The 'title' element has no namespace.
The 'body' element defaults the XHTML namespace.
The 'p' element is in the default XHTML namespace.
The 'href' attribute is in the namespace associated with the 'xlink' prefix declared on the 'doc' element.
The 'math' element is in the namespace associated with the 'm' prefix declared on the 'doc' element.

#22

Well-Formed Documents

Documents must pass a minimum syntax check called being 'well-formed'.
Documents must conform to the syntax of XML to be well-formed.
There are additional constraints that are also considered in checking for well-formed documents.
Violating these rules is considered a "fatal error":

"Once a fatal error is detected, however, the processor MUST NOT continue normal processing (i.e., it MUST NOT continue to pass character data and information about the document's logical structure to the application in the normal way)."
Here's a bad document: bad.xml

#23

Well-Formed Elements

Constraints:
- All elements must conform to the syntax of XML.
- The names of the start tag and end tag must be the same.
- The attributes names must be unique.
- Children cannot span past the end tag.
Contents of an element are considered its children.
Children are ordered in "document order"--the order in which the appear in the character stream representing the document.

#24

Well-Formed Empty Elements

Elements can be empty.
They have no content but can have attributes.

They have a special syntax:

<nothing/>
<almost-nothing but="something"/>

#25

Well-Formed Attributes

Constraints:
- In the attribute value, '<' (less than) must be escaped as <.
- '&' (ampersand) must be escaped as & otherwise it parse will assume you've started a entity reference.
Attributes are only associated with the start tag.
Attributes cannot be duplicated on the start tag (i.e. have the same name).
They contain "simple" text content -- no elements.
They are unordered.

Examples:

<doc status='final' xmlns:xlink="http://www.w3.org/1999/xlink">
 <author xlink:href="http://www.milowski.com">Alex Milowski</author>
</doc>

#26

Well-Formed Text

Constraints:
- '<' (less-than) must be escaped as < otherwise the parse will assume you've started a tag.
- '&' (ampersand) must be escaped as & otherwise it parse will assume you've started a entity reference.
Text is allowed within an element, comment, or processing instruction.
Text inside comments and processing instructions need not be escaped.
An entity reference allows you to escape characters.

These are pre-defined:

Character	Entity Reference
&	&
<	<
>	>
'	'
"	"

#27

Comments

You can put comments in the prolog, epilogue, and as content of an element.
Syntax:
```
[15]   Comment    ::=    ''
```
That is, a comment contains text except that '--' is not allowed.

Example:

<!-- Where or where did my content go? -->

Not allowed:
```

```

#28

Processing Instructions

Processing Instructions are like comments accept they are meant for application-specific processing.
Syntax:
```
[16]    PI        ::=    '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
[17]    PITarget  ::=    Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
```
That is, a processing instruction contains text except that '?>' is not allowed in the content and 'xml' is not allow in the name.
Example:
```
<?insert 5/10/03?>
```
Not allowed:
```
<?very ?> bad ?>
```
But the W3C can break the rules:
```
<?xml-stylesheet href="mystyle.xsl"?>
```

#29

CDATA Sections

A CDATA section marks text as "just text".
Element-like syntax will be ignored.

Syntax:

[18]    CDSect    ::=    CDStart CData CDEnd
[19]    CDStart   ::=    '<![CDATA['
[20]    CData     ::=    (Char* - (Char* ']]>' Char*))
[21]    CDEnd     ::=    ']]>'

That is, a CDATA section contains text except that ']]>' is not allowed in the content.

Example:
```
<![CDATA[<greeting>Hello, world!</greeting>]]> 
```
This generates the text '<greeting>Hello, world!</greeting>' and not the element 'greeting'.

#30

Characters and Unicode

All text, names, values, syntax, etc. in an XML document are Unicode Characters.
Unicode assigns each character a "code point"--a number and arranges these code points into code blocks.
Fonts map glyphs to code points (hopefully correctly).
The familiar ASCII is the first code-block (e.g. decimal 65 is the letter 'A')
Unicode attempts to represent all languages with code blocks and code points.
Code points also have properties like: 'whitespace character' or 'name character'.
The Greek Code Page: Ux0370
Both Microsoft and Apple do not ship complete Unicode fonts even though their operating systems fully support Unicode. Complain!!!
Check out the STIX page or Unicode consortium's page more info on complete fonts.

#31

Unicode in XML

Unicode characters can either be encoded directly in the document or accessed by reference.
You need a unicode-aware editor to encode them directly and then you just type in the character you want.

Otherwise you use a character entity reference:

[66]    CharRef   ::=    '&#' [0-9]+ ';'
                       | '&#x' [0-9a-fA-F]+ ';'

An example, some mathematics:

&#x2200;&#x03b1;&#x2208;&#x0393;(...)

&#8704;&#945;&#8712;&#915;(...)

∀α∈Γ(...)

#32

Whitespace Handling

Often, whitespace (space characters, newlines, tabs, etc.) are added to make the XML more "readable".
Whitespace can be marked as not significant and the XML Processor will convey that to the application.
An attribute xml:space can be added to any element to control this behavior.
A value of 'default' means that a validating processor can mark whitespace as ignorable.
A value of 'preserve' means it is significant.
Typically you need to validate to have ignorable whitespace.