Schema Validation and the PSVI

#1

Schema Validation & Outcomes

A schema processor validates a document by:
1. Loading a complete set of schema components--possibly from some number of schema documents.
2. Validating an infoset with those components.
3. Augmenting the infoset by adding its own properties that signify validity and type information.
The input and output is an infoset.
The output contains the PSVI--which is an augmentation of the original infoset.

#2

The PSVI

PSVI: Post Schema Validation Infoset
It is a set of properties and info items that define schema outcomes.
It also provides components (info items) for the schema definitions/declarations themselves.

#3

Assessment Outcomes

When a validation process is applied the validity of an element or attribute is expressed in an Assessment Outcome.
It is a set of properties:
- [validity] - expresses whether the item is valid with values 'valid', 'invalid', or 'notKnown'.
- [validation attempted] - indicates whether validation was run with values 'full' or 'none'.
- [validation context] - the nearest element with a global schema declaration (this handles local declarations).
- [schema specified] - indicates whether the value was defaulted by the schema with values 'infoset' or 'schema'.
It is possible that during validation some elements will be skipped as they occur in wildcarded content (e.g. xs:documentation).

#4

What gets validated?

Elements get validated in that:
- What attributes have occurred is checked against those allowed and those required.
- Attributes are validated.
- If the content is simple typed, the character children is checked against this type.
- The order of the element children is checked against the type of the element.
- Element children are validated against their declarations.
Attributes get validated against their type and "use" (defaults, optional vs. required, etc.).
Assessment starts by finding an element declaration and a processing mode of 'lax' or 'strict'.

#5

Processing Modes

As schema processor can process in one of three modes:
- skip - no assessment is performed.
- lax - apply assessment if you find a declaration.
- strict - all elements/attributes must have declarations.
Usually processors start with "strict" assessment.
Starting with "skip" doesn't make much sense as nothing would be validated.
Elements can have wildcards that specify different processing modes than with what validation was initiated.

#6

Wildcards in Content Models

An element can have a wildcard for children content.

This is specified via the 'any' element:

<xs:element name="description">
<xs:complexType>
<xs:any namespace="##other" 
        processContents="lax" 
        minOccurs="0" maxOccurs="unbounded"/>
</xs:complexType>
</xs:element>

The 'processContents' attribute can have values 'strict', 'lax', or 'skip' and defaults to 'strict'.
'processContents' controls whether the schema processor must find a schema declaration for the contained elements.
The 'namespace' attribute specifies the allowed element namespaces and is a list of URIs with the following special values allowed:
- ##other - any other namespace than the target namespace.
- ##targetNamespace- the target namespace.
- ##local - no namespace
- ##any - any namespace--which is the default.

XHTML content wildcard:

<xs:element name="description">
<xs:complexType>
<xs:any namespace="http://www.w3.org/1999/xhtml" 
        processContents="lax" 
        minOccurs="0" maxOccurs="unbounded"/>
</xs:complexType>
</xs:element>

#7

Schema Related Properties

These properties are available on elements and attributes:
- [schema default] - if specified, the schema default value.
- [schema error code] - if the item is invalid, this property tells you why.
- [schema normalized value] - if the content is simple typed, the lexical value of the simple type after normalization.
Some of these are contextual and depend on the declaration.

#8

Element/Attribute Declaration

This properties provide the actual declaration from the schema:
- [attribute declaration] - the attribute declaration use to validate the attribute
- [element declaration] - the element declaration used to validate the element.
Only the appropriate property occurs on the element or attribute.
Their value is a "schema component" as specified by the recommendation.
Think of the value as another kind of info item.

#9

Element/Attribute Schema Information

Elements are the only infoset item to contain pointers to the schema.
[schema information] - a set of schema components for each namespace of any schema used.
The schema component:
- [schema namespace] - the schema target namespace
- [schema components] - a set of schema components (the element/attribute declarations, type definitions, etc.)
- [schema documents] - a URL location or document info item of each schema document representation of the schema.

#10

Element/Attribute Type Information

All these properties occur on elements or attribute info items:
- [type definition type] - 'simple' or 'complex'
- [type definition namespace] - The target namespace of the type.
- [type definition anonymous] - 'true' if it was declared locally.
- [type definition name] - The local name of the type.
- [member type definition namespace] - For unions, the namespace of the type definition used.
- [member type definition anonymous] - For unions, indicates whether it was declared locally.
- [member type definition name] - For unions, the local name of the type.
The type definition is found by...

#11

Schema Components

Every schema declaration or definition translates to a schema component.
This is what the [schema components] property contains.
Every part of each declaration/definition is mapped.
They are designed so that XML Schema validation can "run" upon them.
This means nothing is hidden or magic.
You'll find them in the XML Schema specification--which isn't very easy to read.

#12

Xerces

Xerces provides access to both the PSVI and Schema Components.
Check out their documentation at the apache website.
The main implementation class for loading schemata is org.apache.xerces.impl.xs.XMLSchemaLoader

#13

Using Xerces' XSModel to get Schema Components

You can load schemata quite easily with Xerces:

Just import the right classes:

import org.apache.xerces.util.*;
import org.apache.xerces.xs.*;
import org.apache.xerces.impl.xs.XMLSchemaLoader;

Instantiate the XSLoader implementation:

XMLSchemaLoader loader = new XMLSchemaLoader();

Setup the catalog:

String [] catalogs = { "catalog.xml" };
XMLCatalogResolver resolver = new XMLCatalogResolver(catalogs);
loader.setEntityResolver(resolver);

Load the schema:

XSModel model = loader.loadURI(new File("myschema.xsd").toURI().toString());

For example, list all the types:

XSNamedMap map = model.getComponents(XSConstants.TYPE_DEFINITION);
for (int j=0; j<map.getLength(); j++) {
   XSObject o = map.item(j);
   System.out.println("{"+o.getNamespace()+"}"+o.getName());
}

You can browse the javadoc for the package org.apache.xerces.xs at the apache website.

#14

Parsing and Using the PSVI in Xerces

The XNI (XML Native Interface) provides direct access to the PSVI along with the infoset.
You can look at all the type information and validity information as the document goes by.
But this is a low-level interface.

#15

XNI Parsing Example

Import the right classes:

import org.apache.xerces.util.*;
import org.apache.xerces.xs.*;
import org.apache.xerces.xni.*;
import org.apache.xerces.xni.parser.*;
import org.apache.xerces.parsers.*;

Create the parser and set the right features (much like JAXP):

XMLParserConfiguration parser = new StandardParserConfiguration();
parser.setFeature("http://xml.org/sax/features/validation",true);
parser.setFeature("http://apache.org/xml/features/validation/schema",true);
parser.setFeature("http://apache.org/xml/features/validation/schema-full-checking",true);

Setup the catalog:

String [] catalogs = { "catalog.xml" };
XMLCatalogResolver resolver = new XMLCatalogResolver(catalogs);
parser.setEntityResolver(resolver);

Set your document and error handlers

parser.setDocumentHandler(new MyDocumentHandler());
// DefaultErrorHandler is a utility class from Xerces that sends
// errors to stderr.
parser.setErrorHandler(new DefaultErrorHandler());

Parse your document:

String uri = new File("doc.xml").toURI().toString();
XMLInputSource source = new XMLInputSource(null,uri,uri);
parser.parse(source);

#16

XNI PSVI for Elements

For either the startElement() or endElement() on the XMLDocumentHandler interface you can get the PSVI information.
It is passed via the "augmentations".

For example, we can check the validity PSVI property:

public void endElement(QName qName, Augmentations augmentations) throws XNIException
{
   ElementPSVI psvi = (ElementPSVI)augmentations.getItem("ELEMENT_PSVI");
   XSTypeDefinition typedef = psvi.getTypeDefinition();
   switch (psvi.getValidity()) {
      case ItemPSVI.VALIDITY_VALID:
         System.out.println("Element {"+qName.uri+"}"+qName.localpart+
                            "\n\tvalid\n\tagainst {"+
                            typedef.getNamespace()+"}"+typedef.getName());
         break;
      case ItemPSVI.VALIDITY_NOTKNOWN:
         System.out.println("Element {"+qName.uri+"}"+qName.localpart+
                            "\n\tvalidity not known.");
         break;
      case ItemPSVI.VALIDITY_INVALID:
         System.out.println("Element {"+qName.uri+"}"+qName.localpart+
                            "\n\tNOT valid\n\tagainst {"+
                            typedef.getNamespace()+"}"+typedef.getName());
   }
}

Keep in mind that validity is only available at the end of the element.