Lecture 1: Motivation, History, Basics of XML, Unicode, and Infosets

R. Alexander Milowski

milowski@sims.berkeley.edu

School of Information Management and Systems

#1

Motivation

[1] http://www.sgmlsource.com/history/AnnexA.htm

[2] B. Reid, "Scribe: A Document Specification Language and its Compiler," Ph.D. Dissertation, Carnegie Mellon University, Pittsburgh, PA (October, 1980).

#2

History of XML

[1] http://www.oasis-open.org/cover/yuriMemColl.html

[2] http://www.w3.org/TR/WD-xml-961114.html

[3] http://www.w3.org/TR/1998/REC-xml-19980210

[4] http://www.w3.org/TR/1998/REC-xml

[5] http://www.w3.org/TR/xml11/

#3

The W3C Process

  1. Interest in a particular topic is made known by the consortium members either as a result of a published note or from the results of a workshop.

  2. Either a Working Group is formed or the topic is assigned to an existing group.

  3. A Requirements Document is drafted and approved by the first the working group and then the consortium.

  4. All documents go through these stages: Initial Draft, Working Draft, Last Call Draft, Proposed Recommendation,and Recommendation.

#4

What is XML?

#5

What is Document?

[1] Merriam-Webster Online: http://www.m-w.com/

#6

What is an XML Document?

#7

The Real Answer

Documents are instances of units of information*.

* With XML you get to define "instance", "unit", and "information".

#8

What XML Provides

#9

An Example


  <slide>
    <title>What XML Provides</title>
    <contents>
      <ul>
        <li><p>Internationalization via Unicode</p></li>
        <li><p>Validation of instances.</p></li>
        <li><p>Localization of names via namespaces 
              (e.g. My &#39;tomato&#39; isn&#39;t your &#39;tomato&#39;).</p>
        </li>
        <li><p>A &#34;human readable&#34; format.</p></li>
        <li><p>Hierarchical structure.</p></li>
        <li><p>A &#34;motif&#34; for extensibility.</p></li>
      </ul>
    </contents>
  </slide>

      

#10

A More Complicated Example

<c:pseudocode name="Adj">
   <args><arg>v</arg><arg>j</arg></args>
   <c:for>
      <c:varassign> 
         <c:var>i</c:var>
         <c:constant>1</c:constant>
      </c:varassign>
      <to><c:constant>j</c:constant></to>
      <c:do>
         <c:varassign>
            <c:var>r</c:var>
            <c:func name="find-simplex">
               <c:value><c:var>v</c:var></c:value>
               <c:value><c:var>i</c:var></c:value>
            </c:func>
        </c:varassign>
      </c:do>
   </c:for>
   <c:return><c:value><c:var>r</c:var></c:value></c:return>
</c:pseudocode>

Adj(v, j):

for i ← 1 to j

do

r ← find-simplex(v,i)

return r

#11

XML Document Structure

  • XML Documents have three major parts:

    1. The prolog that contains "context setting" declarations.

    2. The document element--which is the root of the element tree.

    3. The epilog where additional constructs may be placed but no more elements may appear.

#12

Reading EBNF

[22]    prolog      ::=    XMLDecl? Misc* (doctypedecl Misc*)?
[23]    XMLDecl     ::=    '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
[24]    VersionInfo ::=    S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
[25]    Eq          ::=    S? '=' S?
[26]    VersionNum  ::=    ([a-zA-Z0-9_.:] | '-')+
[27]    Misc        ::=    Comment | PI | S

#13

Getting Started

#14

The XML Declaration

#15

The Prolog & Epilog

#16

Elements

#17

Empty Elements

#18

Attributes

#19

Text and Characters

#20

General Entities

[1] http://www.w3.org/TR/xinclude/

#21

Comments

#22

Processing Instructions

#23

CDATA Sections

#24

DTDs and Document Types

#25

Associating DTDs with Documents

#26

Eh?

#27

Characters and Unicode

#28

Unicode in XML

#29

Whitespace Handling

#30

Parsing and Applications

#31

Well Formed vs. Valid

#32

What is XML to an Application?

#33

Information Sets (infosets)

#34

The XML Infoset

#35

An Example Infoset Diagram

<?xml version="1.0">

<doc><title>My Document</title>

<body><p>something</p>
</body><citations/></doc>

#36

Document Info Item

#37

Element Info Item

#38

Attribute Info Item

#39

Character Info Item

#40

There's More...

#41

Information Sets are Extensible

#42

Summary