Schema Best Practices and Packaging Schema Libraries

#1

Best Practices

Schema design:
- Local types vs global types.
- Local elements vs global
- Loose vs. Tightly constrained
Namespace management.
Instances & Versioning

#2

Simple Things First

You need to give your schema a name and a target namespace.
You should decide on:
- standard file naming
- namespace URI structure
- URN vs. URI.

#3

Namespace URI

Besides the "base URI", your namespace URI has structure:
- http://www.w3.org/1999/xhtml
  
  Means: "xhtml published in 1999 by the W3C"
- urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:course:200401:en
  
  Means: "The English version of the course schema published in 01-2004 by the CDE at Berkeley "
You need to choose between URI's and URN's.

#4

The "Dark Side" is swaying me...

In some people's mind (maybe a strange place), a namespace is a name and so using URLs is bad!
Other people say that URI needs to be resolveable and URN's are somewhat disfunctional in terms of resolving them to a resource.
Well respected people have been changing my thinking on using URNs because:
- You can't get a NID.
- You always need a catalog to resolve URNs as there is no default resolving mechanism.
- It is really super nice, actually fabulous, to type a namespace name into your browser and get information about the schema. ...maybe even the schema itself.
- The W3C doesn't use them and so why should you?

#5

Public Identifiers

Officially, public identifiers have three basic parts:
1. An owner
2. A owner specific name.
3. A "text class" (e.g. schema)
But you can do what you want.
If you are a standards consortium, you'd probably want to make it more "formal".

#6

Example Public Identifiers

You can use your Internet domain name as the owner identifier:
```
urn:publicid:IDN+cde.berkeley.edu:...
```

Separate segments by colons (':')

urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:slides:

Add some version information:

urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:slides:200403

Maybe some language/locale information

urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:slides:200403:en

Note that spaces are encoded as '+' as spaces aren't allowed in URI values.

#7

Where to go for More

This is the only spec: RFC 3151
Here's an article on formal public identifiers, but it is old (1994)

http://www.oasis-open.org/cover/petersonFPI-TAG7030101.html
The SGML standard (ISO 8879) -- good luck on that one!
Book: The SGML Handbook - by Charles Goldfarb

#8

Structuring a Namespace URI

People use URI values for many things.
You want to be able to look at the URI and understand what it represents.
Some common metadata in a namespace URI:
- A keyword like 'schema' to denote that it is a schema.
- A path-like structure to "describe" the namespace.
- An owner identifier (e.g. domain name, e-mail address, etc.)
- Version information (e.g. a date or version number)
- Locale information (e.g. 'us', 'en', etc.)

#9

Versioning & Instances

If you change the namespace of your schema:
- You can't validate existing instances.
- XSLT/XPath/etc. will have to change.
- Applications may have to change.
Rules for changing schemata:
- If you change is compatible with existing instances, consider whether to keep the namespace the same.
- If you change is incompatible with existing instances, you should change the namespace.
- Ultimately, a content/application should always be valid against the schema they were authored against.

#10

Organizing Your Schema Documents

You should consider how many schema parts you have.
Maybe you'll want different directories for different namespaces.
But your directory structure should be similar to how you intend to:
- distribute you schemata
- use them in tools
- use them in applications.
Changing 'schemaLocation' attributes on 'import' and 'include' elements can be a real pain!
Changing 'schemaLocation' attributes will probably introduce bugs and so orchestrate things so you don't have to do that.

#11

A Simple Recomendation

A recommendation:
- Make a folder called 'schemas'
- Put a OASIS XML Catalog in that directory called 'catalog.xml'.
- Put all your schemata in this directory or subdirectories of this directory.
- Make all your schemaLocation attributes relative to this directory structure.
- Put an entries in the catalog relative to this directory structure for all your namespace names.
The result is a portable directory that anyone can use:
- All the schemas are contained in a parent directory.
- The catalog lists the namespaces contained.
- All the schema documents refer to each other with relative URIs and so the URL starting the schema processing should work without special non-standard handling of the URI values.

#12

Yours vs. Theirs

You may want to separate other schemata (e.g. XHTML, UBL, etc.) from your schemata.
Version numbers in file or directory names is a double-edged sword:
- If you have version numbers, it is obvious which version you are using.
- When you change versions, you may have to modify the schemaLocation pointers in your schemata.
Always separate other people's schemata that have multiple schema documents:
- Use different directories so that you don't have file name collisions.
- Subdirectories are great when you want to bundle the other schemata.
- Sibling directories are better when your schemata augment other existing schemata.

#13

Handling Imports

When you import schemata, you may specify a 'schemaLocation' attribute.
Keep in mind that the schemaLocation attribute is a hint and that schema processors can use other methods of locating the schema documents for that namespace.
The value of 'schemaLocation' isn't always the same when you distribute schemata--but you should avoid having to change it.
My recommendations:
- Use a 'schemaLocation' attribute when you are distributing the imported schema documents.
- You can choose to use a catalog by:
  - Not using the schemaLocation attribute on the import.
  - Put the imported namespace in the catalog.
  - But then everyone will be required to use the catalog.

#14

General Strategy for Re-usable Schemata

You want to balance these against each other:
- Re-usability - using your definitions/declarations in other schemata.
- Composition - mixing your schemata with others
- Complexity - levels in the document, optionality, occurrences, etc.
- Compactness - of definitions, declarations, and instances.
The "balancing decisions" should be made against your requirements and user community.

#15

Naming

It is ridiculous to say that because you use one naming convention over another that somehow, magically, your schema is "better".
Such arguments aren't based on fact.
You should be consistent.
Your target community should accept your naming style.
Using "XML names" directly in other applications (e.g. code, class names, database tables, etc.) may lead to problems.
If your application requires specific naming conventions, it will break!

#16

Naming Elements

Elements are what applications process.
So the name can be important for usability.
Many markup applications only use lower case and hyphens.
But some people like capitalization (e.g. NamedElement, namedElement).
Case isn't a great concept for non-european languages.
XHTML, MathML, XSLT, etc. use all lower-case names.

#17

Naming Types

It may be nice to differentiate between element naming and type naming.
This isn't required.
The syntax is the same.
There is no set precedence here.

#18

What to Make Global

You want to minimize those things that are global because:
- It reduces the number of "things" to peruse.
- It makes you consider what parts should be used by reference.
You need a sufficient number of things global because:
- It is your only way to allow extension.
- This is what people use!

#19

What to make Global

Strategy: Types are the basis of re-use and not elements.
Make all your types (complex & simple) global.

The exceptions:

Children that are lists of other elements:

<xs:complexType name="department">
<xs:sequence>
   <xs:element name="xs:string"/>
   <xs:element name="people">
      <xs:complexType>
         <xs:sequence>
            <xs:element ref="my:person" maxOccurs="unbounded"/>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:sequence>
</xs:complexType>

Any element who's type can be justified as always being "local" to that element (a slippery slope):

<xs:complexType name="pipeline">
<xs:sequence>
   <xs:element name="version-information" minOccurs="0">
      <xs:complexType mixed="true">
          <xs:sequence>
             <xs:any processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
          </xs:sequence>
          <xs:anyAttribute processContents="lax"/>
      </xs:complexType>
   </xs:element>
</xs:sequence>
</xs:complexType>

#20

Global Elements

The global elements should be:
- Document elements.
- Substitution group elements.
- Major components.
- "Utility" elements (e.g. links, math, etc.)
All other elements should be declared locally:
- Using a global type reference.
- Using a local type definition--but think hard about why.

#21

Qualified vs Unqualified

The more I think about it...
...the more 'qualified' makes sense.
Unqualified causes your developers to go crazy:
- Where is the element defined?
- Which children are qualified and which are not?
Users won't care because they are suppose to have tools, right?!?
My recommendation:
- Make elements qualified (not the default).
- Make attributes unqualified (the default).

#22

How Many Namespaces?

There is a natural tendency to have one target namespace for every "kind" of schema construct.
That can cause problems (e.g. rainbows of namespaces).
It is better to have one target namespace for every "major area".
It is best to have just one.
But if you find conflicts, you probably need to have another namespace somewhere.
Do not alias someone else's elements into your namespace (e.g. XHTML in your namespace without XHTML's namespace).

#23

Documenting your Schemata

You should document your schema before you release it.
The "minimum" documentation:
- The target namespace with one annotation for the schema.
- Each element with at least one annotation.
- Each type with at least one annotation.
- For simple type enumerations, document each enumeration value.
Make sure you describe the use patterns, intended use of types, meanings, optionality, etc.

#24

Documenting Element Declarations or Complex Type Definitions

You really should document each:
- Attribute declaration.
- Embedded element declaration.
- Unique, Key, or Keyref Constraint.

An example:

<xs:schema xmlns:h="http://www.w3.org/1999/xhtml" ...>
...
<xs:element name="person" type="my:Person">
  <xs:annotation>
  <xs:documentation>
    <h:p>This element is use to describe a person's information such as name, email, address.</h:p> 
  </xs:documentation>
  </xs:annotation>
</xs:element>

<xs:complexType name="Person">
  <xs:sequence>
    <xs:element name="name" type="xs:string">
      <xs:annotation>
        <xs:documentation>
        <h:p>This element contains the name of the person.</h:p>
        </xs:documentation>
      </xs:annotation>
    </xs:element>
    <xs:element name="name" type="xs:string" minOccurs="0">
      <xs:annotation>
        <xs:documentation>
        <h:p>This element contains the person's email address in full domain format (e.g. you@yourdomain.com).</h:p>
        </xs:documentation>
      </xs:annotation>
    </xs:element>
  </xs:sequence>
  <xs:attribute name="id" type="xs:ID">
     <xs:annotation>
       <xs:documentation>
       <h:p>This attribute is used to provide a identifier for cross referencing within the document.</h:p>
       </xs:documentation>
     </xs:annotation>
  </xs:attribute>
</xs:complexType>

</xs:schema>

#25

Rules for Documentation

Document the schema so that everyone knows the purpose and scope of your schema.
Document all the global components of your schema.
Document your includes, imports, and redefines so the next schema author knows why they are there.
Document local type definitions as they aren't going to be documented elsewhere.
Any enumeration or options should be document so a user can choose between them.
When there is any doubt, add some kind of documentation.
Document your assumptions. Keep in mind that the schema definitions and declarations are not enough. You might think a name is obvious but they might have different definitions in different contexts.

#26

What Goes in a Release

You should have the following:
- All the schema documents necessary to validate (except dependencies not packaged).
- A main "readme" document that describes how to:
  1. Technology assumptions and dependencies.
  2. Any information about where to get additional schema documents if you aren't going to package dependencies.
  3. How to install.
  4. A description of where the parts are in the distribution.
  5. How to use the schemata.
- Schema documentation.
- Schema reference materials (e.g. xsddoc) generated from the schemata.
- Valid examples along with descriptions of the examples.
- License, ownership, and where to get more information.
- A catalog for using the schemata.

#27

Structuring for Release

This is my suggested structure:

/readme.html (or index.html)
/catalog.xml (XML catalog for using schemata)
/license.html (your licensing)
/schemas/
   youschemas.xsd
   otherschemas.xsd
   somemodule-0.1/
      module.xsd
/doc/
   index.html (general intro)
   reference/
      index.html (schema reference - xsddoc)
/examples
   guide.html (guide to samples)
   sample1.xml
   sample2.xml
   ...