Schema Best Practices
R. Alexander Milowski
milowski@sims.berkeley.edu
School of Information Management and Systems
#1
Best Practices
Schema design:
Local types vs global types.
Local elements vs global
Loose vs. Tightly constrained
Namespace management.
Instances & Versioning
#2
Simple Things First
You need to give your schema a name and a target namespace.
You should decide on:
standard file naming
namespace URI structure
URN vs. URI.
#3
Namespace URI
Besides the "base URI", your namespace URI has structure:
http://www.w3.org/1999/xhtml
Means: "xhtml published in 1999 by the W3C"
urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:course:200401:en
Means: "The English version of the course schema published in 01-2004 by the CDE at Berkeley "
I recommend using a URN rather than a URL (e.g. http://somewhere.com/).
#4
Public Identifiers
Officially, public identifiers have three basic parts:
An owner
A owner specific name.
A "text class" (e.g. schema)
But you can do what you want.
If you are a standards consortium, you'd probably want to make it more "formal".
#5
Example Public Identifiers
You can use your Internet domain name as the owner identifier:
urn:publicid:IDN+cde.berkeley.edu:...
Separate segments by colons (':')
urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:slides:
Add some version information:
urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:slides:200403
Maybe some language/locale information
urn:publicid:IDN+cde.berkeley.edu:schema:coursedoc:slides:200403:en
Note that spaces are encoded as '+' as spaces aren't allowed in URI values.
#6
Where to go for More
This is the only spec: RFC 3151
Here's an article on formal public identifiers, but it is old (1994)
The SGML standard (ISO 8879) -- good luck on that one!
Book: The SGML Handbook - by Charles Goldfarb
#7
Organizing Your Schema Documents
You should consider how many schema components you have.
Maybe you'll want different directories for different namespaces.
But your directory structure should be similar to how you intend to:
distribute you schemata
use them in tools
use them in applications.
Changing 'schemaLocation' attributes on 'import' and 'include' elements can be a real pain!
#8
Yours vs. Theirs
You may want to separate other schemata (e.g. XHTML, UBL, etc.) from your schemata.
Including version numbers in either file names or directories is a good idea.
Always separate other people's schemata that have multiple schema documents:
Use different directories.
Subdirectories are great when you want to bundle the other schemata.
Sibling directories are better when your schemata augment other existing schemata.
#9
Handling Imports
When you import schemata, you may specify a 'schemaLocation' attribute.
This isn't always the same when you distribute schemata:
My recommendations:
Use a 'schemaLocation' attribute when you are distributing the imported schema documents.
Don't specify a 'schemaLocation' attribute when you aren't distributing them. Instead:
Use a catalog to locate the imported schema.
Document the need for the imported schema document and recommend solutions.
#10
Structuring a Namespace URI
People use URI values for many things.
You want to be able to look at the URI and understand what it represents.
Some common metadata in a namespace URI:
A keyword like 'schema' to denote that it is a schema.
A path-like structure to "describe" the namespace.
An owner identifier (e.g. domain name, e-mail address, etc.)
Version information (e.g. a date or version number)
Locale information (e.g. 'us', 'en', etc.)
#11
Versioning & Instances
If you change the namespace of your schema:
You can't validate existing instances.
XSLT/XPath/etc. will have to change.
Applications may have to change.
Rules for changing schemata:
If you change is compatible with existing instances, consider whether to keep the namespace the same.
If you change is incompatible with existing instances, you should change the namespace.
Ultimately, a content/application should always be valid against the schema they were authored against.
#12
General Strategy for Re-usable Schemata
You want to balance these against each other:
Re-usability
Composition
Complexity
Compactness
The "balancing decisions" should be made against your requirements and user community.
#13
Naming
It is ridiculous to say that because you use one naming convention over another that somehow, magically, your schema is "better".
Such arguments aren't based on fact.
You should be consistent.
Your target community should accept your naming style.
Using "XML names" directly in other applications (e.g. code, class names, database tables, etc.) may lead to problems.
If your application requires specific naming conventions, it will break!
#14
Naming Elements
Elements are what applications process.
So the name can be important for usability.
Many markup applications only use lower case and hyphens.
But some people like capitalization (e.g. NamedElement, namedElement).
Case isn't a great concept for non-european languages.
XHTML, MathML, XSLT, etc. use all lower-case names.
#15
Naming Types
It may be nice to differentiate between element naming and type naming.
This isn't required.
The syntax is the same.
There is no set precedence here.
#16
What to make Global
You want to minimize those things that are global because:
It reduces the number of "things" to peruse.
It makes you consider what parts should be used by reference.
You need a sufficient number of things global because:
It is your only way to allow extension.
This is what people use!
#17
What to make Global - My Recommendations
Strategy: Types are the basis of re-use and not elements.
Make all your types (complex & simple) global.
The only except might be for elements that are lists of a single type/element.
The global elements should be:
Document elements.
Substitution group elements.
Major components.
"Utility" elements (e.g. links, math, etc.)
All other elements should be declared locally--typically with a global type.
#18
Qualified vs Unqualified
The more I think about it...
...the more 'qualified' makes sense.
Unqualified is nice for simple schemata with a few namespaces.
Otherwise, 'unqualified' is confusing for the developer (e.g. programmer, XSLT author, etc.)
Users won't care because they are suppose to have tools, right?!?
My recommendation:
Make elements qualified (not the default).
Make attributes unqualified (the default).
#19
How Many Namespaces?
There is a natural tendency to have one target namespace for every "kind" of schema construct.
That can cause problems (e.g. rainbows of namespaces).
It is better to have one target namespace for every "major area".
Elements give you ways to segment that "major area".
But if you find conflicts, you probably need to have another namespace somewhere.
#20
Documenting your Schemata
You should document your schema before you release it.
Document the following:
The target namespace with one annotation for the schema.
Each element with one annotation.
Each type with one annotation.
For simple type enumerations, document each enumeration value.
Make sure you describe the use patterns, intended use of types, meanings, optionality, etc.
#21
What Goes in a Release
You should have the following:
All the schema documents necessary to validate (except expected)
A main "readme" document that describes how to:
Dependancies/Assumptions.
How to install.
A description of where the parts are in the distribution.
How to use the schemata.
Schema documentation.
Schema reference materials (e.g. xsddoc) generated from the schemata.
Valid examples.
License, ownership, and where to get more information.
A catalog for using the schemata.
#22
Structuring for Release
This is my suggested structure:
/readme.html (or index.html) /catalog.xml (XML catalog for using schemata) /license.html (your licensing) /schemas/ youschemas.xsd otherschemas.xsd somemodule-0.1/ module.xsd /doc/ index.html (general intro) reference/ index.html (schema reference - xsddoc) /examples guide.html (guide to samples) sample1.xml sample2.xml ...