Namespaces and Introduction to XPath

#1

Motivation for Namespaces

When all you have is a name:
- What if I want to mix my markup with yours?
- How do I associate semantics with mixed markup?
- How do I associate a schema (or rules) with markup?

#2

Problems - Example 1

These are certainly different:

<date>1/27</date>
<date><year>2004</year><day>1</day><month>27</month></date>

What if they are both correct?

#3

Problems - Example 2

How about:

<order>
<item id="i1"/>
<item id="i2"/>
<item id="i3"/>
</order>

How many different interpretations are there of 'order'?

#4

Solutions?

What do we need?
What is "in-scope" for XML syntactically?
What is "out of scope" for XML syntactically?

#5

Names and Namespaces

Names are broken into two parts:
- prefix - an identifier for a namespace
- local name - an identifier for a name in that namespace
These parts are separated by a colon and called QNames (Qualified Names).

Example:

<c:pseudocode>

   <c:comment xlink:href="http://somewhere..."/>

</c:pseudocode>

This applies only to element and attribute names.
Any prefix containing 'xml' is reserved for the W3C.

#6

Prefixes and Namespaces

Prefixes must be declared by associated prefixes with namespaces "before" they are used.
You can only declare this association on elements.
The syntax is: xmlns:prefix="some:uri"

Example:

<c:pseudocode xmlns:c="urn:publicid:IDN+mathdoc.org:pseudocode:en">

   <c:comment xlink:href="http://somewhere..." 
              xmlns:xlink="http://www.w3.org/..."/>

</c:pseudocode>

"before" means on the element where the prefix occurs or on an ancestor element.
The 'xml' prefix is pre-defined to be http://www.w3.org/XML/1998/namespace

#7

Defaulting Namespaces

Namespaces can be defaulted.
This applies to only element names without prefixes.
The syntax is: xmlns="some:uri"

Example:

<c:pseudocode xmlns:c="urn:publicid:IDN+mathdoc.org:pseudocode:en">

   <c:comment xmlns="http://www.w3.org/1999/xhtml">
     <p>This code does the following:</p>
     ...
   </c:comment>

</c:pseudocode>

You can undeclare the default: xmlns=""

#8

Namespace Names and URI Values

The namespace name is a URI (Uniform Resource Identifier).
URI values can be web address (e.g. http://youdomain.com/...)
- web locations: http://yourdomain.com/...
- URN (names): urn:...
- Or other schemes: scheme:scheme-specific-part
But the processor uses the exact string as the URI value and not URI equivalences:
- http://www.somwhere.com/A
- http://www.somwhere.com/%41

#9

Declarations have Scope

A namespace declaration's scope is the element where it occurs.
There is no different between declarations on the root element and elsewhere.
The element, its attributes, and its children may use that prefix in their names.
You can re-define namespaces.

#10

What's a Name?

It's not what you think.
The prefix is just an abbreviation of the actual namespace name (i.e. the value of the declaration).
A name now consists of two parts:
- namespace name - the name of the namespace associated with the prefix.
- local name - the part of the name after the colon.
The prefix is now non-critical syntax.

#11

The Infoset "Changes"

Namespace processing is optional.
When used, names change.
But this only affects elements and attributes.
You can get different property values depending on whether you process namespaces.
- Names now have two parts (i.e. local names and namespace names).
- Namespace declarations and their scopes occur.

#12

The Infoset "Changes" - Element

[namespace name] - the namespace name or "no value" if there isn't one.
[local name] - the local part of the name (i.e. after the colon).
[prefix] - The namespace prefix used on the element or "no value" if there isn't one.
[in-scope namespaces] - An unordered list of namespace info items.
[namespace attributes] - An unordered list of attribute information items corresponding to the namespace declaration attributes actually on this element.

#13

The Infoset "Changes" - Attribute

[namespace name] - the namespace name or "no value" if there isn't one.
[local name] - the local part of the name (i.e. after the colon).
[prefix] - The namespace prefix used on the attribute or "no value" if there isn't one.

#14

How it Works - An Example

The simplest thing is to default the namespace:

<order xmlns="urn:publicid:IDN+cde.berkeley.edu:example:order:en">
  <from>me</from>
  <to>you</to>
  <items>
    <item>A clue</item>
    <item>A better clue!</item>
  </items>
  <memo>
    <p>I need this order now!</p>
  </memo>
</order>

#15

How it Works - An Example - Part 2

But we can use a prefix and get the exact same names:

<o:order xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:en">
  <o:from>me</o:from>
  <o:to>you</o:to>
  <o:items>
    <o:item>A clue</o:item>
    <o:item>A better clue!</o:item>
  </o:items>
  <o:memo>
    <p>I need this order now!</p>
  </o:memo>
</o:order>

#16

How it Works - An Example - Part 3

If we change that prefix binding later on, we'll get different names:

<o:order xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:en">
  <o:from>me</o:from>
  <o:to>you</o:to>
  <o:items xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:items:en">
    <o:item>A clue</o:item>
    <o:item>A better clue!</o:item>
  </o:items>
  <o:memo>
    <p>I need this order now!</p>
  </o:memo>
</o:order>

#17

How it Works - An Example - Part 4

We can also mix in a default if we wish:

<o:order xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:en">
  <o:from>me</o:from>
  <o:to>you</o:to>
  <o:items xmlns:o="urn:publicid:IDN+cde.berkeley.edu:example:order:items:en">
    <o:item>A clue</o:item>
    <o:item>A better clue!</o:item>
  </o:items>
  <o:memo xmlns="http://www.w3.org/1999/xhtml">
    <p>I need this order now!</p>
  </o:memo>
</o:order>

#18

Namespace Info Item

Namespace declarations are encoded as Namespace Info Items.
They have two properties:
- [prefix] - the prefix whose binding this info item describes.
- [namespace name] - the namespace name bound to this prefix.
These are not the namespace declarations.
Serialization of XML Infosets with namespaces is tricky.

#19

In-Scope Namespaces

Every namespace binding that is "in-scope" is listed in the element's [in-scope namespaces] property.
Again, this is not just those declared.
You have to look at the differences between parent and child to see what has changed.
This property is only on elements.
The prefixes 'xml' and 'xmlns' are always defined:
- xml → http://www.w3.org/XML/1998/namespace
- xmlns → http://www.w3.org/2000/xmlns/
The document implicitly defines 'xml' and 'xmlns'.

#20

And now for something completely different...

We're now actually going to do something with XML...

...but we need some tools.

#21

XPath

XPath is a syntax for "addressing" into a document.
They are "path expressions".
It allows you to expression things like:
- The "para" child element of the "contents" element.
- The next sibling of the "item" element.
- The "item" element where the attribute "overbid" has value "true".
It is its own "mini standard" used by many specifications.

#22

Like Directory Paths

XPath expressions have a directory-path-like syntax.
A single "/" (forward slash) represents the Document info item--also know as the root.
Subsequent named "steps" in the path represent children:
```
/doc/title
```
selects the 'title' child element of the document element 'doc'.
But they don't have to be "rooted":
```
contents/para
```
selects the 'para' child element of the 'content' element.

#23

Node Set Results

The result of evaluating an XPath expression is a Node Set.
A node is just another term for "info item".
For example, given the content:
```
<contents>
<para>One</para>
<para>Two</para>
<para>Three</para>
</contents>
```
the expression:
```
/content/para
```
would return three 'para' elements as a set.

#24

Selecting Attributes

You can also select attributes by adding the step: @name

For example, given the content:

<contents>
<para><a href="one.html">One</a></para>
<para><a href="two.html">Two</a></para>
<para><a href="three.html">Three</a></para>
</contents>

the expression:

/content/para/a/@href

would return the attribute 'href' of each of the three paragraphs as set.

#25

Names and Namespaces

Any step expression can use a QName: h:body
The prefix binding is defined external to the expression (e.g. application specific).
Matching is based on the local name and namespace name and not the prefix.
We could add namespaces to the previous examples:
```
/s:contents/d:para
/s:contents/d:para/h:a/@href
```
The application would have to define the prefixes 's', 'd', and 'h'.

#26

No Prefix = No Namespace

A name test without a prefix only matches something without a namespace.

For example:

m:section/title

matches

<m:section xmlns:m='urn:...'>
<title>No Namespace</title>
</m:section>

but not

<m:section xmlns:m='urn:...' xmlns='urn:something-else...'>
<title>I've got a namespace</title>
</m:section>

Remember, name matching is based on local name and namespace name alone!!!

#27

Wildcards

The '*' (asterisk) can be used to wildcard names.
Elements: All the elements contained in a 'content' element.
```
contents/*
```
Elements: All the attributes of 'para'.
```
para/@*
```
Namespaces can also be used:
```
s:contents/d:*
d:para/@h:*
```

#28

Context Node

Evaluation is always with respect to a context node.
You can address the context node as '.' (period):
For example, the attributes of the context node:
```
./@*
```
The context node is implicit.
For example, these are equivalent:
```
contents/para
./contents/para
```
The context node does not have to be an element.

#29

Parent and Ancestors

From the context node you can access your parent and ancestors.
Just like directories, '..' represents the parent.
You can go back many levels:
```
../../section
```
This selects the 'section' element that is the context node's parent's parent

#30

Conditional Matching

Predicates on the step allow conditions to be specified.
They follow the step and are wrapped in square brackets ('[' and ']').
For example:
```
para[@id='mine']
```
selects 'para' elements where the 'id' attribute has value 'mine'.
There is a whole wealth of operators (including boolean logic) that can be used.
You can also have sub-expressions:
```
contents[para/@id='mine']
```
This selects a 'contents' element that has a child 'para' with an attribute 'id' of value 'mine'.

#31

Skipping Levels

You can match elements that aren't direct children with the "//" (double forward slash).
This looks through the descendants of the "current context".
For example:
```
/section//cite
```
will match all 'cite' elements that are descendants of 'section'. But:
```
//cite
```
will match all 'cite' elements in the document.

#32

Special Functions

There are some special functions that can be used as steps.

Function	Result
node()	Matches any kind of node.
text()	Matches text.
processing-instruction()	Matches a processing-instruction.
comment()	Matches a comment.

For example:
```
para/node()
```
matches all the children of a 'para' element including comments, text, and processing instructions.

#33

The Real Story

This is all just an abbreviated syntax.
There is a lot more...
We'll start by explaining the "axes" of a document.

#34

Trees and XML

For the computer scientists:

Formally, a tree is a connected, acyclic, undirected graph.
It would be nice if XML was a rooted positional n-ary tree.
XML isn't that simple.
But the parent-child relationships in the infoset do form a tree.
Attributes and namespaces mess this up a bit.

#35

Relationships in a Tree

Relationships from the red node.

#36

Additional XML Relationships

Attributes:
- Each element can have attributes.
- Attribute info items aren't children.
Namespaces:
- Each element can have in-scope namespaces.
- Namespace info items aren't children.

#37

Axes are Directions on Relationships

Axes are just a traversal of a relationship.
Some are tree relationships:
- ancestor, ancestor-or-self
- parent, child, self
- descendant, descendant-or-self
- following, following-sibling
- preceding, preceding-sibling

And some extras:
- attribute
- namespace

#38

Axis Syntax

Steps can be preceded by an axis name and a double colon:

contents/child::para
para/preceding-sibling::*
ancestor::section/title

If that relationship doesn't exist, you get an empty node set.

#39

Axis Specifics

Each axis has:
- A direction of forward or reverse.
- A principal node type: one of attribute, namespace, or element
The direction refers to the order in which items will be traversed.

#40

Principal Node Type

The principal node type refers to what a name matches:
- On the 'child' axis, a name matches an element.
- On the 'attribute' axis, a name matches an attribute.
Principal node type "special cases" the extra relationships:
- Only the attribute axis has type 'attribute'.
- Only the namespace axis has type 'namespace'.
- Everything else has type 'element'.

#41

Axis Direction

Reverse Axes:
- ancestor
- ancestor-or-self
- preceding
- preceding-or-self
Everything else has a forward direction.

#42

Axis Direction - Example

For example, give the following

<doc>
<a/><b/><c/>
<target/>
<d/><e/><f/>
</doc>

These expressions evaluate:

target/preceding-sibling::* → elements 'c' b' 'a'.

target/following-sibling::* → elements 'd' 'e' 'f'.

#43

Abbreviated Syntax Equivalences

Abbreviation	Equivalence
../name	parent::name
name	child::name
//name	descendant::name
.	self::node()
*	child::*
@*	attribute::*
@name	attribute::name

#44

What use is this?

XPath is used extensively in XSLT to process and transform XML documents:

<xsl:transform version='1.0' 
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<html>
<head><xsl:apply-templates select="doc/title"/></head>
<body>
<xsl:apply-templates select="doc/contents"/>
</body>
</html>
</xsl:template>

</xsl:transform>

XML Schema and other standards use this for similar matching needs.
Many programming APIs & commercial products provide XPath for traversing and manipulating XML.

#45

Next Time

There's still more XPath to discuss.
But we will introduce XSLT first.
Then we will dig deeper in XPath semantics in relation to XSLT.