Week 3 Exercises:
Native XML Databases

Allow roughly one hour for each problem.

  1. Design an application that uses a native XML database to build complete documents from a set of document fragments. For example, suppose a car company builds several different models of the same car -- a luxury model with a five-speed transmission, air conditioning, and ABS brakes; a middle-of-the-road model with a four-speed transmission and ABS brakes; and a basic model with a four-speed transmission. The company builds model-specific manuals from the procedures required by that model.

    The description of your application should include:

    • An overall description of the application, including the problem it is trying to solve, how it will be used, and its software architecture.

    • A description of the documents (maintenance manuals, class schedules, mortgage contracts, magazines, etc.).

    • A very simple XML schema (DTD, XML Schema, RELAX NG schema) for the documents / fragments.

    • A description of the document fragments to be stored in the database and why these size fragments were chosen. For example, do you store whole procedures, groups of procedures, individual procedure steps, etc.?

    • A description of how fragments will be assembled into complete documents. For example, does the application use XQuery queries, XInclude, XLink, proprietary language, etc.? Also discuss why that technology was chosen.

    • What native XML database features will be used by your application and what features your application will have to add.

    • Anything else you think is relevant.

  2. del.icio.us is an application that allows users to add links to Web pages to a central repository. Each link is described by one or more keywords. For example, a person might add a link to the Infosys 290A/Section 2 home page with the keywords XML, databases, SIMS, and UC_Berkeley. Users can look at the list of links compiled by a particular user or search for links by keyword.

    Design an application that is similar to del.icio.us, but that uses XML-based metadata instead of keywords. The application stores metadata documents in a native XML database and allows users to query them to find Web pages of interest. For example, the link to the Infosys 290A/Section 2 home page might have the following XML documents as metadata:

          <name>University of California, Berkeley</name>
             <description>Information science</description>
          <number>290A/Section 2</number>
          <name>XML and Databases</name>
          <instructor>Ronald Bourret</instructor>

    This allows more complex queries than are possible with simple keywords. Also, by returning the metadata documents, users can understand that the Web page describes a course about XML and databases, rather than describing XML and databases themselves.

    The description of your application should include:

    • An overall description of the application, including its software architecture.

    • How users interact with the system, including how users enter new metadata, how users query the system and what kinds of queries are available, and how the system displays results.

    • What native XML database features will be used by your application and what features your application will have to add.

    • Technical problems such as how/if the system will handle collisions between element/attribute names in different documents.

    • How/if the system will handle synonyms (different element/attribute names that mean the same thing) and homonyms (identical element/attribute names that mean different things).

    • The pros and cons of your application vs. del.icio.us.

    • Anything else you think is relevant.


    There are, of course, no right answers to this question. The idea is to explore some of the problems in writing applications that use semi-structured data.

    Application architecture

    The application stores XML-based metadata for any page on the Web. Users interact through a (possibly complex) GUI and data is stored in a native XML database. Documents in the database consist of a root element, a child for the URL of the Web page, an optional child for the user who entered the metadata, and a complex child containing the metadata.

    More than one document can point to the same URL. This allows multiple users to enter metadata about the same Web page, or a single user to enter multiple metadata documents about a single Web page. (For example, a user documenting a Web page describing a movie might want one set of metadata to view the Web page as a page about a movie and another set to view it as a page about a particular director or actor. Obviously, both sets of metadata could be wrapped into a single document, but there is no reason to prevent users from entering separate documents.)

    Entering metadata

    The biggest problem in entering XML metadata is in creating a schema. This is more work than most people want to do. One solution to this is to allow people to reuse schemas. That way, a few motivated users can create schemas and others can search for them (e.g. by category name or simply by guessing a root element name). The system can then prompt the user for each element or attribute in the schema. People might also have the opportunity to modify schemas, but this is likely to be needlessly complex, as well as reducing the queryability of documents that match the existing schema. A better solution would be to let people use existing schemas as a starting point for new schemas. That would at least raise the possibility that existing documents would match some of the queries against the new schema. One other thing worth pointing out is that metadata schemas shouldn't be allowed to use mixed content, as there is little reason to use it for metadata and it makes user interface features such as drill-downs much more complex.

    Querying the data

    How do you create a user interface that allows people to explore a data set where the schema of the data set is (a) not known at design time and (b) not internally consistent? Here are some possible answers:

    • Metadata documents. Regardless of the search mechanism that is used, it is advantageous to return the entire metadata document that was found by the search. This allows the user can see the context in which a particular piece of data is used. For example, if I do a full-text search for "Steven Spielberg", the metadata document is likely to tell me if "Steven Spielberg" is a director, a producer, or perhaps just a name in a news article.

      Returning the entire metadata document also provides several other advantages. For example:

      • Users can look at it and get an idea of what the related Web page is about. For example, is it a Web page that describes a movie directed by Steven Spielberg or is it a news article about Steven Spielberg (or which perhaps just mentions him).

      • Users can learn what element and attribute names are actually used, which allows them to create more sophisticated searches that take advantage of these names and their structure.

      • Users can explore sideways. For example, if they find out that Steven Spielberg is a director of a movie, they can also see who acted in that movie and then go searching for other metadata documents that describe those actors.

    • Full-text search. This is the simplest kind of search and, as long as metadata documents are returned, provides users some information about how a given phrase is used, as described above.

    • Keyword/value searches. If users are allowed to guess at keywords and values, they will return a certain number of useful results. For example, the keyword/value pair (director, "Steven Spielberg") can easily be translated into the query //director[./string()="Steven Spielberg"). Of course, there are all sorts of problems in guessing keywords, so an optimal translation mechanism would have to worry about differences in both keywords and values across multiple documents. For example, the query could look in both elements and attributes and also flatten case differences between various XML names. (Synonym handling is described below.) Differences in values could be handled with various full-text techniques, such as word-stemming and algorithms that check for misspelled words.

    • Path-based searches. A more sophisticated variation of keyword/value searches is to let people enter paths. This provides context for keyword searches. For example, a user might want to look for the keyword/value pair (director, "Steven Spielberg") when it is used inside a movie element (//movie[.//director/string()="Steven Spielberg].

    • More sophisticated searches. Other easy-to-implement possibilities include things like allowing users to insert and/or conditions in their queries. And, of course, one could always directly expose XQuery to the user, but few users are likely to take advantage of it.

    • Drill-down searches. Rather than having people guess element and attribute names, these could simply be exposed to the user. To do this, the application would need to query the metadata documents and build a list of root element names, then present these to the user. (Having people who add metadata somehow categorize it would save this from creating an impossibly large list. Alternatively, the process could be started by having a user guess at a keyword (element name) and then starting the drill-down process from there.)

      Once a user has decided on a starting element, the system could display a list of known children of that element. The user could then choose another element to drill down. For example, the user might start with movie and then choose to drill down to director. The process is repeated until the user decides they don't want to drill down any further, at which point the user can decide to retrieve all matching metadata documents (e.g. a list of directors inside movie documents) or enter a specific value (e.g. a list of director elements inside movie documents whose value is "Steven Spielberg").

    • Synonyms and homonyms. There exists the very real possibility that people will enter different element and attribute names meaning the same thing (movie, film) or use the same name for multiple meanings (title = movie title, title = position in corporation).

      Synonyms are best sorted out with the use of a thesaurus, which builds lists of related terms. The reality is that thesauruses (thesauri? thesaurii? Thesaurus Rex? I don't know) are expensive to build, so it is best to let end users create these. For example, when a set of metadata documents are returned, the user could graphically link element or attribute names that mean the same thing.

      Homonyms are harder to deal with. As a general rule, people will discover that an element or attribute name is a homonym by looking at the context in which it is used. For example, is title a child of movie or a child of person? While XML namespaces were designed to handle homonyms, they probably won't help in this case. The reason is that, when people explore the names use in metadata documents, they are most likely to query local names without regard to namespace URI. For example, "title" is a meaningful term; "http://www.rpbourret.com/schemas/movies" is not.

      None of the options for handling this is very satisfying. For example, the application could allow people to rename homonyms. On the one hand, this might anger the person who created the schema. On the other hand, it is simple and would allow people to build a consensus about names. As long as it was applied sparingly, it might work. Another solution is to allow people to essentially create a thesaurus of names like movie_title and corporate_title. The more descriptive name (e.g. movie_title) could be presented to the user and the actual name (e.g. title) could be used in the underlying query. This could, however, get fairly complex.

    Native XML database features

    The most important features required of a native XML database are: the ability to store XML documents that don't doesn't conform to any known schema, the ability to all elements and attributes in an XML document without knowing what these are at design time, and a query language such as XQuery. Also important is scalability, so the system can handle a large number of metadata documents with reasonable performance, something that wouldn't be possible if the documents were stored, say, in the file system.

    Less important features are transactions (so no data is lost) and concurrent access (so multiple users can query and modify the data at once).

    Pros and cons vs. del.icio.us

    In reality, I can see little advantage to using XML-based metadata instead of the simple keywords used by del.icio.us. One reason is that the application is much more complex to build. But the main reason, as most students pointed out, is social.

    While XML-based metadata does provide more information and better query results, it is much harder to create and query. del.icio.us works because people can use it almost as an afterthought. This encourages people to enter information, which in turn makes the site more valuable, which in turn encourages more people to enter more information, and so on. An application using XML-based metadata is likely to find few people willing to enter metadata, which makes the application less valuable, which means fewer people will enter metadata, and so on. In other words, it is more important for socially based applications need to be accessible to society at large than to be technically nifty.

    This is not to say that this whole exercise is a waste of time. Exploring semi-structured data sets is a very real problem, just not a good idea in this case. A more realistic example is when an intelligence agency retrieves documents and data sets from various police departments, the FBI, the CIA, the customs bureau, etc., all marked up as XML. Obviously, there will be tremendous variability in the XML that is used, which makes the data semi-structured. While full-text searches are obviously a useful tool for searching the data, more and other kinds of searches, especially those that take advantage of the XML in the documents, is valuable.

Copyright (c) 2006, Ronald Bourret