Week 4 Exercises:
XQuery

Software

To do this homework, you will need software that supports XQuery. Although there are many choices, two choices are recommended:

  • The <oXygen/> XML editor. This is available in the SIMS lab.

  • Saxon-B 8.6.1. This is Java Open Source software you can download and use on your own laptop. To use Saxon:

    1. Download from the Saxon Source Forge page.

    2. Unzip the package.

    3. Add saxon8.jar to your classpath. Note that Saxon has its own XML parser, so you will want to add this before any other parsers in your classpath.

    4. Run Saxon from the command line using the command

         java net.sf.saxon.Query query-file
      

      where query-file is the name of the file containing your XQuery query.

Data

Use the XML documents book.xml and reviews.xml in the following exercises.

Comments

  • You will generally use the doc function to open XML documents.

  • If documents are stored on the file system, such as when using Saxon, you will use a file URL as the argument to the doc function. For example:

       doc("file://localhost/path/filename.xml")
    

    Remember to use %20 for spaces. If you are using Windows, you may need to replace colons (:) and backslashes (\) in the path with vertical bars (|) and forward slashes (/), respectively. For example:

       doc("file://localhost/c|/uc%20berkeley/homework/week4/book.xml")
    

    Exactly what substitions you need to make depends on your processor. Saxon uses a colons and forward slashes.

  • If documents are stored in a database, such as Berkeley DB XML or eXist, you will need to check the documentation for the syntax of the URIs used by the doc function.

  • It is not necessary for the white space in your results to exactly match the white space shown in the results listed in each question.

  • If you have questions about syntax, see the XQuery and XQuery Functions and Operators specifications. The Functions and Operators spec is easy to read; the XQuery spec less so, although still approachable. Both contain examples. You can download these at:

       http://www.w3.org/TR/xquery/
       http://www.w3.org/TR/xpath-functions/
    

Hints

  • To return a sequence from a single return clause in a FLWOR expression, use the comma operator (,) and enclose the expression in the return clause in parentheses. For example:

       for $n in (<a/>, <b/>, <c/>)
       return ("Name: ", node-name($n))
    

    returns the sequence ("Name: ", "a", "Name: ", "b", "Name: ", "c"). Note that this is different from:

       for $n in (<a/>, <b/>, <c/>)
       return "Name: ", node-name($n)
    

    which returns an error stating that the $n variable is not in scope (or similar message). This is because the comma operator has higher precedence than the return keyword in the FLWOR expression, so the above expression actually says, "First evaluate the FLWOR expression. When you are done, concatenate the result of node-name($n) to the sequence generated by the FLWOR expression.

  • If you want to insert carriage returns or line feeds into your results, use character reference literals. These have the form "&#nn;" or "&#xnn;", where nn is the decimal or hexadecimal (respectively) value of the character's code point in Unicode. For example:

       for $n in (<a/>, <b/>, <c/>)
       return ("Name: ", node-name($n), "&#10;")
    

Exercises

  1. Write an XPath expression that returns the title of the book:

       <title>XQuery Goes to School</title>
    

    Answer:

       doc("book.xml")/book/title
    
  2. Write a FLWOR expression that returns the title of the book:

       <title>XQuery Goes to School</title>
    

    Answer:

       for $t in doc("book.xml")
       return $t/book/title
    
  3. Write a FLWOR expression that retrieves the author elements in the book and returns them inside an authors element:

       <authors>
          <author>
             <name>Sly</name>
             <email>sly@rpbourret.com</email>
          </author
          <author>
             <name>Scrumps</name>
             <email>scrumps@rpbourret.com</email>
          </author>
          <author>
             <name>Dog</name>
          </author>
       </authors>
    

    Answer:

       <authors>
       {
       for $a in doc("book.xml")/book/author
       return $a
       }
       </authors>
    
  4. Same as exercise 3, except that, if an author does not have an email address, insert an email element with the content NO EMAIL ADDRESS.

       <authors>
          <author>
             <name>Sly</name>
             <email>sly@rpbourret.com</email>
          </author
          <author>
             <name>Scrumps</name>
             <email>scrumps@rpbourret.com</email>
          </author>
          <author>
             <name>Dog</name>
             <email>NO EMAIL ADDRESS</email>
          </author>
       </authors>
    

    Answer:

       <authors>
       {
          for $a in doc("book.xml")/book/author
          return <author>{
          $a/name,
          if ($a/email) then $a/email else <email>NO EMAIL ADDRESS</email>
          }
          </author>
       }
       </authors>
    
  5. Write an XQuery expression that returns the string "Hello, world". (Hint: This is very easy.)

    Answer:

       "Hello, world"
    
  6. Write an XQuery expression that returns the number of authors inside a number_of_authors element:

       <number_of_authors>3</number_of_authors>
    

    Answer:

       <number_of_authors>
       {
       let $a:=doc("book.xml")/book/author
       return count($a)
       }
       </number_of_authors>
    
  7. Write an XQuery expression that returns a query_languages element. The children of this element should be based on the data in the table element in the document -- the value in the second column in each row should be used as an element name and the value in the first column in each row should be used as the element's value.

       <query_languages>
          <PathLanguage>XPath</PathLanguage>
          <ExpressionLanguage>XQuery</ExpressionLanguage>
          <RulesBasedLanguage>XSLT</RulesBasedLanguage>
       </query_languages>
    

    Hint: Use a positional filter ([n]) to distinguish between the column elements. For example:

       let $s := (<a/>, <b/>, <c/>)
       return $s[2]
    

    uses the positional filter [2] to return the second node in the sequence represented by $s. That is, it returns <b />.

    Answer:

       <query_languages>
       {for $r in doc("book.xml")//row
       return element {$r/column[2]} {$r/column[1]/string()}
       }
       </query_languages>
    
  8. Write an XQuery expression that creates a document of the following form. The root element is summary. The first child of the summary element is title, which is the book's title. The following children are pairs of p and figure elements, where the p element contains the figure_ref element that refers to the figure.

       <summary>
          <title>XQuery Goes to School</title>
          <p>"Who knows how to transform this into HTML?" asked Mr. W3C, pointing
    to a simple bit of XML. (See <figure_ref idref="1" >figure 1</figure_ref>.)</p>
          <figure id="1" href="simple.xml">
             <caption>A simple bit of XML</caption>
          </figure>
          <p>"XPath, I'm sorry about making fun of you," said XPath. "I was just
    jealous of the way you transform things so easily." XPath hugged XSLT (See
    <figure_ref idref="2">figure 2</figure_ref>.)</p>
          <figure id="2" href="hug.jpg">
             <caption>XPath and XSLT make up.</caption>
          </figure>
       </summary>
    

    Hint: Assume that the paragraph and the referencing figure are in the same section.

    Answer:

    One way to do this is to find each figure, go back up to the containing section, and then go down to the referencing paragraph. (If you weren't sure if the figure and referencing paragraph were in the same section, you could go all the way back to the root (book) element.)

       <summary>
       {
       doc("book.xml")/book/title,
       for $f in doc("book.xml")//figure_ref
       return ($f/ancestor::p, $f/ancestor::section//figure[./@id = $f/@idref])
       }
       </summary>
    

    Another way to do this is to get all figure elements and paragraph elements that contain figure references, then match them up. This is easier to understand and also avoids using the ancestor axis, which is not always supported. (Kudos to the students who figured this out -- I didn't.)

       let $b := doc("book.xml")/book
       return  
           <summary>
           {$b/title,
            for $f in $b//figure, $p in $b//p[figure_ref]
            where $f/@id=$p/figure_ref/@idref
            return ($p, $f)
           }
           </summary>
    
  9. Write an XQuery expression that creates a document of the following form from the book.xml and the reviews.xml documents. The root element is a book_review element. The children a title element, one or more author elements, and a review element. The query must use the value of the title element to retrieve the correct review from the reviews.xml document.

       <book_review>
          <title>XQuery Goes to School</title>
          <author>
             <name>Sly</name>
             <email>sly@rpbourret.com</email>
          </author>
          <author>
             <name>Scrumps</name>
             <email>scrumps@rpbourret.com</email>
          </author>
          <author>
             <name>Dog</name>
          </author>
          <review>An almost incomprehensibly bad book. The plot is contrived
    and the writing simplistic. It is clear that only a vanity press would
    touch this book.</review>
       </book_review>
    

    Answer:

       <book_review>
       {
          let $b := doc("book.xml")/book
          return ($b/title,
                  $b/author,
                  for $i in doc("reviews.xml")/reviews/item
                  where $b/title = $i/title
                  return $i/review)
       }
       </book_review>
    
  10. Write an XQuery function that summarizes statistics for a chapter, then call the function as part of creating a statistical summary of the book. In particular, the function should accept a chapter element and return a statistics element with children listing the number of sections (not subsections), the number of paragraphs, the numbers of figures and figure references, and the number of tables.

    The output of the query should be as follows, with each statistics element produced by a function call:

       <summary>
          <title>XQuery Goes to School</title>
          <statistics title="The first day">
             <number_of_sections>3</number_of_sections>
             <number_of_paragraphs>14</number_of_paragraphs>
             <number_of_figures>2</number_of_figures>
             <number_of_figure_refs>2</number_of_figure_refs>
             <number_of_tables>0</number_of_tables>
          </statistics>
          <statistics title="The second day">
             <number_of_sections>1</number_of_sections>
             <number_of_paragraphs>10</number_of_paragraphs>
             <number_of_figures>0</number_of_figures>
             <number_of_figure_refs>0</number_of_figure_refs>
             <number_of_tables>1</number_of_tables>
          </statistics>
       </summary>
    

    Note: Use the local: namespace prefix for the function. This is predefined to correspond to the URI in which local functions are defined. For example, the function name might be local:statistics.

    Answer:

       declare function local:statistics($c as element(chapter))
          as element(statistics)
       {
          <statistics chapter="{$c/title}">
             <number_of_sections>{count($c/section)}</number_of_sections>
             <number_of_paragraphs>{count($c//p)}</number_of_paragraphs>
             <number_of_figures>{count($c//figure)}</number_of_figures>
             <number_of_figure_refs>{count($c//figure_ref)}</number_of_figure_refs>
             <number_of_tables>{count($c//table)}</number_of_tables>
          </statistics>
       };
       
       <summary>
       {
       let $b:=doc("book.xml")/book
       return ($b/title, for $c in $b/chapter return local:statistics($c))
       }
       </summary>
    
  11. OPTIONAL (NO CREDIT -- THIS IS JUST FOR FUN). Write an XQuery expression that creates a table of contents for the book. At a minimum, include the title of the book and the designations, "Chapter: " or "Section: ", as well as the chapter or section name. Including chapter numbers and proper indenting would be better. Write the query as if you do not know the number of chapters, the number of sections, or the depth of section nesting.

       XQuery Goes to School
          Chapter 1: The first day
             Section 1: Morning
                Section 1.1: The teacher arrives
                Section 1.2: A simple transformation
             Section 2: Lunch
                Section 2.1: XPath and XSLT make up
             Section 3: Afternoon
          Chapter 2: The second day
             Section 1: The morning
                Section 1.1: Roll call
                Section 1.2: Another transformation
    

    Hints:

    • Use a recursive function to retrieve nested section elements.
    • Use the typeswitch expression to distinguish between book, chapter, and section elements.
    • Use the concat function to concatenate strings. (When a sequence of nodes is serialized, spaces are placed between node values. This might not be the desired behavior.)

    Answer:

       declare function local:toc ($e as element()*, $indent as xs:string, $number as xs:string)
       {
          for $n at $position in $e
          return
          typeswitch ($n)
             case element(book)
                return ($indent,
                        $n/title/string(),
                        "&#10;",
                        local:toc($n/chapter, fn:concat($indent, "   "), ""))
             case element(chapter)
                return ($indent,
                        fn:concat("Chapter ", $position, ": "),
                        $n/title/string(),
                        "&#10;",
                        local:toc($n/section, fn:concat($indent, "   "), fn:concat($position, ".")))
             case element(section)
                return ($indent,
                        fn:concat("Section ", $number, $position, ": "),
                        $n/title/string(),
                        "&#10;",
                        local:toc($n/section, fn:concat($indent, "   "), fn:concat($position, ".")))
             default return ()
       };
       
       let $b := doc("book.xml")/book
       return local:toc($b, "", "")
    

General notes on answers

There were a number of common problems people had while doing this homework:

  • // queries. As a general rule, avoid // queries, as these tend to be expensive to process. This is doubly true when XQuery is implemented over a relational database. If you know the exact path of something you are searching for, then spell it out (e.g. a/b/c). Only use // when you don't know the exact path. Note that PowerPoint presentations like my lectures tend to use // because it uses less screen real estate.

  • Variable names. Use meaningful variable names, as these tend to make queries easier to read, modify, and debug. For example,

       for $i in doc("book.xml")/book/author
    

    is confusing. $a would be a better name, and $author would be better. Note that the use of full English names also allows the use of plurals, which are useful in distinguishing between individual values and sequences of values:

       for $author in doc("book.xml")/book/author    (: $author handles one element at a time :)
    
       let $authors := doc("book.xml")/book/author   (: $authors is a sequence of elements :)
    
  • doc function. Avoid multiple calls to the doc and collection functions when possible. Depending on how XQuery is implemented and the exact query, these may require reparsing of documents, which is expensive. For example, the following query needlessly calls doc twice:

       <summary>
          {doc("book.xml")/book/title}
          {doc("book.xml")/book/author}
       </summary>
    

    This can be avoided as follows:

       
       let $doc_node := doc("book.xml")
       <summary>
          {$doc_node/book/title}
          {$doc_node/book/author}
       </summary>
    

    or:

       
       let $b := doc("book.xml")/book
       <summary>
          {$b/title}
          {$b/author}
       </summary>
    
  • Nested for loops. The construct "for $a in seq_1, $b in seq_2" sets up nested for loops. That is, for each value of $a, all values of $b are evaluated. This is potentially very expensive, so it should be avoided if possible. It is undoubtedly implementation-dependent as to whether a given implementation can optimize this. However, I'm betting that implementations will be more likely to recognize and optimize:

       for $a in doc("a.xml")
       return
          <joined_data>
          {
             $a/child1,
             $a/child2,
             for $b in doc("b.xml")
             where $a/child1 = $b/child2
             return $b/child3
          }
          </joined_data>
    

    than:

       for $a in doc("a.xml"), $b in doc("b.xml")
       where $a/child1 = $b/child2
       return
          <joined_data>
          {
             $a/child1,
             $a/child2,
             $b/child3
          }
          </joined_data>
    
  • Specifying data types. You can use the element() and attribute() tests to specify specific element or attribute types. For example:

       declare function local:statistics($c as element(chapter))
       as element(statistics)
    

Copyright (c) 2006, Ronald Bourret