[xhtml]

Processing Models and XML Pipelines

R. Alexander Milowski

milowski at sims.berkeley.edu

#1

Processing is Always a Multi-step Procedure

#2

Parsing & Processing

#3

Example: Aggregation

#4

Pipeline Motif

Definition 1:

An XML Pipeline is a sequence of components each of which consumes a "primary" infoset and produces a "primary" infoset.

#5

Example: Aggregation Implemented via JAXP

#6

Application Needs

#7

Pipelines

#8

Pipeline Example 1

#9

Pipeline Example 2

#10

A "Real" Pipeline

#11

Possible Pipeline Components

#12

Data Flow & Pipelines

#13

Pipelining Languages/Technologies

  1. Sun's XML Pipeline Note at the W3C - language specification

  2. Norm Walsh's sxpipe project at Java.net - open source - language and implementation

  3. My smallx project at Java.net - open source - language and implementation

  4. Apache's Cocoon - open source - chaining of SAX filters in the sitemap.

  5. Markup Technology - commercial - language and implentation - also implements (1).

  6. Orbeon - open source/commercial - language and implentation embedded in product.

#14

Smallx

#15

Smallx History

#16

Smallx Pipelines

#17

Smallx Pipelines - Components

Smallx pipelines contain a growing set of components:

#18

Smallx - Large Document Example

This pipeline processes a large data file that is an XML document. Each 'training-scenario' element can be processed by XSLT but the whole document is too big to load into one in-memory tree. The output is a text data file that can be read by statistical software.

<p:pipe xmlns:p="urn:publicid:IDN+smallx.com:pipeline:1.0" name="scenario2text">

<!-- Limits the XSLT to the 'training-scenario' element -->
<p:subtree-view select="training-scenario">

<!-- Converts the scenario data to a text file for R -->
<p:xslt src="scenario2text-xt.xsl"/>

</p:subtree-view>

</p:pipe>

#19

Smallx - Aggregation Example

This pipeline processes a large data file that is an XML document. Each 'training-scenario' element can be processed by XSLT but the whole document is too big to load into one in-memory tree. The output is a text data file that can be read by statistical software.

<p:pipe xmlns:p="urn:publicid:IDN+smallx.com:pipeline:1.0" name="scenario2text"
        xmlns:c="urn:publicid:IDN+smallx.com:component-language:1.0"
>

<!-- add the aggregation specification to the input -->
<p:template>
<result>
<c:file href="header.xml"/>
<xsl:copy-of select="."/>
<c:file href="trailer.xml"/>
</result>
</p:template>

<!-- Aggregate by running through the file component -->
<p:file/>

</p:pipe>

#20

sxpipe

#21

sxpipe Components

#22

Cocoon Pipelines

#23

Event Application Example

#24

Event Application Example - Syntax

#25

Sun Pipeline Language

#26

An Example:

#27

Processors

#28

Steps

#29

Inputs and Outputs

#30

Latex Example

#31

Pipelines as Web Services

#32

Example Web Service - BART Schedule

We want to send a simple request of start and end station with a departing time to get a train schedule:

  1. We'll receive an XML document with the necessary information.

  2. We'll need to post that information to the bart website.

  3. Process the XHTML that comes back to get the schedule results.

  4. Return that schedule as XML.

#33

BART Schedule Input

The request:

<bart-schedule>
<from>BRK</from>
<to>EMBAR</to>
<departing><month>2</month><day>17</day><time>5:00 PM</time></departing>
</bart-schedule>

#34

BART Schedule Output

The request:

<routes>
<from>BRK</from>
<to>EMBAR</to>
<departing><month>2</month><day>17</day><time>5:00 PM</time></departing>
<route-option>
<train>
<depart>Downtown Berkeley at 4:55p</depart>
<board>Millbrae train</board>
<arrive>Embarcadero at 5:17p</arrive>
</train>
</route-option>
<route-option>
<train>
<depart>Downtown Berkeley at 5:02p</depart>
<board>Fremont train</board>
<arrive>MacArthur at 5:08p</arrive>
</train>
<train>
<depart>MacArthur at 5:08p</depart>
<board>Millbrae train</board>
<arrive>Embarcadero at 5:24p</arrive>
</train>
</route-option>
</routes>

#35

BART Schedule - Procedure

  1. We need to turn the input XML into a HTTP request while keeping a copy of the to/from/departing information.

  2. Make the request to the resource over HTTP.

  3. The result isn't quite valid XHTML--darn! So, we'll use tagsoup (a SAX HTML parser) to parse it as HTML to feed it as XML.

  4. www.bart.gov returns complicated stuff with lots of tables. We need to locate the right table and dump the rest.

  5. The remaining table is the train schedules--so convert that to the right XML elements.

#36

BART Schedule as a Smallx Pipeline

<p:pipe xmlns:p="urn:publicid:IDN+smallx.com:pipeline:1.0" name="bart-schedule"
    xmlns:c="urn:publicid:IDN+smallx.com:component-language:1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:h="http://www.w3.org/1999/xhtml"
>

<!-- Scope the service to the bart-schedule elements -->
<p:subtree select="bart-schedule">

   <!-- Translate the request to a 'routes' element and add the c:url-get input 
        for the url component -->
   <p:template>
   <xsl:for-each select="bart-schedule">
   <routes>
   <xsl:copy-of select="node()"/>
   <c:url-get href="http://www.bart.gov/index.asp?origin={from}&amp;destination={to}&amp;time_mode=departs&amp;depart_month={departing/month}&amp;depart_date={departing/day}&amp;depart_time={substring-before(departing/time,' ')}%20{substring-after(departing/time,' ')}&amp;cookiesTested=1"
              parse-as-html="true"/>
   </routes>
   </xsl:for-each>
   </p:template>

   <!-- Get the schedule -->
   <p:url/>

   <!-- delete unnecessary elements -->
   <p:subtree select='h:script|h:meta|h:head'>
      <p:delete/>
   </p:subtree>

   <!-- Find the contents table and drop the rest -->
   <p:subtree select="h:table">
      <p:template>
      <xsl:copy-of select="h:table/h:tbody/h:tr/h:td[h:a/@name='content']/h:div[@id='bodytext']/h:table[contains(h:tr[1]/h:td[1],'Your Schedule')]"/>
      </p:template>
   </p:subtree>

   <!-- Find the schedule tables and drop the rest -->
   <p:subtree select="h:table">
      <p:xslt>
      <xsl:transform version="1.0">

         <xsl:template match="/">
         <search-results>
         <xsl:apply-templates select="h:table/h:tr/h:td/h:table/h:tr/h:td/h:table"/>
         </search-results>
         </xsl:template>

         <xsl:template match="h:table|h:tr|h:td|h:a">
         <xsl:copy> 
         <xsl:apply-templates select="@href|node()"/>
         </xsl:copy>
         </xsl:template>

         <xsl:template match="h:br"><xsl:text> </xsl:text></xsl:template>
  
         <xsl:template match="@*"><xsl:copy/></xsl:template>

      </xsl:transform>
      </p:xslt>

      <!-- Translate the schedule tables into the 'route-option' element -->
      <p:xslt>
      <xsl:transform version="1.0">

         <xsl:template match="h:table[not(preceding-sibling::*)]"/>

         <xsl:template match="h:table" xml:space='preserve'>
         <route-option>
         <xsl:apply-templates select="h:tr"/>
         </route-option>
         </xsl:template>

         <xsl:template match="h:tr[not(preceding-sibling::*)]"/>
  
         <xsl:template match="h:tr" xml:space='preserve'>
         <train>
         <xsl:apply-templates select="h:td"/>
         </train>
         </xsl:template>

         <xsl:template match="h:td[1]">
         <depart><xsl:value-of select="normalize-space(.)"/></depart>
         <xsl:text>
         </xsl:text>
         </xsl:template>

         <xsl:template match="h:td[3]">
         <board><xsl:value-of select="normalize-space(.)"/></board>
         <xsl:text>
         </xsl:text>
         </xsl:template>

         <xsl:template match="h:td[4]">
         <arrive><xsl:value-of select="normalize-space(.)"/></arrive>
         <xsl:text>
         </xsl:text>
         </xsl:template>

         <xsl:template match="h:td"/>

      </xsl:transform>
      </p:xslt>
   </p:subtree>

   <!-- drop extras that remain -->
   <p:subtree select='h:br'>
      <p:delete/>
   </p:subtree>

   <!-- Unwrap the route-options from the XHTML -->
   <p:unwrap select='h:html'/>
   <p:unwrap select='h:body'/>

</p:subtree>

</p:pipe>