Basics of XPath

#1

XPath

XPath is a syntax for "addressing" into a document.
They are "path expressions".
It allows you to expression things like:
- The "para" child element of the "contents" element.
- The next sibling of the "item" element.
- The "item" element where the attribute "overbid" has value "true".
It is its own "mini standard" used by many specifications.

#2

Like Directory Paths

XPath expressions have a directory-path-like syntax.
A single "/" (forward slash) represents the Document info item--also know as the root.
Subsequent named "steps" in the path represent children:
```
/doc/title
```
selects the 'title' child element of the document element 'doc'.
But they don't have to be "rooted":
```
contents/para
```
selects the 'para' child element of the 'content' element.

#3

Node Set Results

The result of evaluating an XPath expression is a Node Set.
A node is just another term for "info item".
For example, given the content:
```
<contents>
<para>One</para>
<para>Two</para>
<para>Three</para>
</contents>
```
the expression:
```
/content/para
```
would return three 'para' elements as a set.

#4

Selecting Attributes

You can also select attributes by adding the step: @name

For example, given the content:

<contents>
<para><a href="one.html">One</a></para>
<para><a href="two.html">Two</a></para>
<para><a href="three.html">Three</a></para>
</contents>

the expression:

/content/para/a/@href

would return the attribute 'href' of each of the three paragraphs as set.

#5

Names and Namespaces

Any step expression can use a QName: h:body
The prefix binding is defined external to the expression (e.g. application specific).
Matching is based on the local name and namespace name and not the prefix.
We could add namespaces to the previous examples:
```
/s:contents/d:para
/s:contents/d:para/h:a/@href
```
The application would have to define the prefixes 's', 'd', and 'h'.

#6

No Prefix = No Namespace

A name test without a prefix only matches something without a namespace.

For example:

m:section/title

matches

<m:section xmlns:m='urn:...'>
<title>No Namespace</title>
</m:section>

but not

<m:section xmlns:m='urn:...' xmlns='urn:something-else...'>
<title>I've got a namespace</title>
</m:section>

Remember, name matching is based on local name and namespace name alone!!!

#7

Wildcards

The '*' (asterisk) can be used to wildcard names.
Elements: All the elements contained in a 'content' element.
```
contents/*
```
Elements: All the attributes of 'para'.
```
para/@*
```
Namespaces can also be used:
```
s:contents/d:*
d:para/@h:*
```

#8

Context Node

Evaluation is always with respect to a context node.
You can address the context node as '.' (period):
For example, the attributes of the context node:
```
./@*
```
The context node is implicit.
For example, these are equivalent:
```
contents/para
./contents/para
```
The context node does not have to be an element.

#9

Parent and Ancestors

From the context node you can access your parent and ancestors.
Just like directories, '..' represents the parent.
You can go back many levels:
```
../../section
```
This selects the 'section' element that is the context node's parent's parent

#10

Conditional Matching

Predicates on the step allow conditions to be specified.
They follow the step and are wrapped in square brackets ('[' and ']').
For example:
```
para[@id='mine']
```
selects 'para' elements where the 'id' attribute has value 'mine'.
There is a whole wealth of operators (including boolean logic) that can be used.
You can also have sub-expressions:
```
contents[para/@id='mine']
```
This selects a 'contents' element that has a child 'para' with an attribute 'id' of value 'mine'.

#11

Skipping Levels

You can match elements that aren't direct children with the "//" (double forward slash).
This looks through the descendants of the "current context".
For example:
```
/section//cite
```
will match all 'cite' elements that are descendants of 'section'. But:
```
//cite
```
will match all 'cite' elements in the document.

#12

Special Functions

There are some special functions that can be used as steps.

Function	Result
node()	Matches any kind of node.
text()	Matches text.
processing-instruction()	Matches a processing-instruction.
comment()	Matches a comment.

For example:
```
para/node()
```
matches all the children of a 'para' element including comments, text, and processing instructions.

#13

The Real Story

This is all just an abbreviated syntax.
There is a lot more...
We'll start by explaining the "axes" of a document.

#14

Trees and XML

For the computer scientists:

Formally, a tree is a connected, acyclic, undirected graph.
It would be nice if XML was a rooted positional n-ary tree.
XML isn't that simple.
But the parent-child relationships in the infoset do form a tree.
Attributes and namespaces mess this up a bit.

#15

Relationships in a Tree

Figure 1. Relationships from the red node.

#16

Additional XML Relationships

Attributes:
- Each element can have attributes.
- Attribute info items aren't children.
Namespaces:
- Each element can have in-scope namespaces.
- Namespace info items aren't children.

#17

Axes are Directions on Relationships

Axes are just a traversal of a relationship.
Some are tree relationships:
- ancestor, ancestor-or-self
- parent, child, self
- descendant, descendant-or-self
- following, following-sibling
- preceding, preceding-sibling

And some extras:
- attribute
- namespace

#18

Axis Syntax

Steps can be preceded by an axis name and a double colon:

contents/child::para
para/preceding-sibling::*
ancestor::section/title

If that relationship doesn't exist, you get an empty node set.

#19

Axis Specifics

Each axis has:
- A direction of forward or reverse.
- A principal node type: one of attribute, namespace, or element
The direction refers to the order in which items will be traversed.

#20

Principal Node Type

The principal node type refers to what a name matches:
- On the 'child' axis, a name matches an element.
- On the 'attribute' axis, a name matches an attribute.
Principal node type "special cases" the extra relationships:
- Only the attribute axis has type 'attribute'.
- Only the namespace axis has type 'namespace'.
- Everything else has type 'element'.

#21

Axis Direction

Reverse Axes:
- ancestor
- ancestor-or-self
- preceding
- preceding-or-self
Everything else has a forward direction.

#22

Axis Direction - Example

For example, give the following

<doc>
<a/><b/><c/>
<target/>
<d/><e/><f/>
</doc>

These expressions evaluate:

target/preceding-sibling::* → elements 'c' b' 'a'.

target/following-sibling::* → elements 'd' 'e' 'f'.

#23

Abbreviated Syntax Equivalences

Abbreviation	Equivalence
../name	parent::name
name	child::name
//name	descendant::name
.	self::node()
*	child::*
@*	attribute::*
@name	attribute::name

#24

What use is this?

XPath is used extensively in XSLT to process and transform XML documents:

<xsl:transform version='1.0' 
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<html>
<head><xsl:apply-templates select="doc/title"/></head>
<body>
<xsl:apply-templates select="doc/contents"/>
</body>
</html>
</xsl:template>

</xsl:transform>

XML Schema and other standards use this for similar matching needs.
Many programming APIs & commercial products provide XPath for traversing and manipulating XML.