XML                               Peter Komisar  version 5.2     ©     2005

references: XML & Web Services Unleashed , R.Schmeizer et.al 'The
XML Bible' ,/Elliotte Harold,''The Birth of XML', Jon Bosak,
http://java.sun.com/xml/birth_of_xml.html,
  XML in a Nutshell, ER.
Harold & W.S Means, 'Mastering XML', Nazarro, White & Burman
'Professional XML Schemas', J. Duckett et.al.



GML, SGML & HTML

In 1969, IBM was seeking a method to simplify the handling of legal documents. IBM
wanted a technique that helped in creating, searching and storing these documents.
The project was headed up by Charles Goldfarb. He and co-workers Ed Moscher
and Ray Lorie recognized that IBM's different computer systems all used different
document formats. They recognized that a cross-platform mark-up standard was
needed.

// 1969 Goldfarb, Moscher & Lorie create GML to manage legal documents
 
In response to these needs, they invented General Markup Language or GML.
The researchers cleverly applied their personal signature to the language in that
GML is also an abbreviation of each of their last names, Goldfarb, Moscher &
Lorie. A standardized version was recognized by the international standards
organization, ISO. The standard version of GML became known as SGML.

SGML supplied a means where by markup could be applied to documents and
then the marked-up document could be interpreted across different computer
platforms. SGML did not define a single definition set. Instead SGML was
designed to allow tags to be assigned values as needed. This was important as
it enabled different industries to structure tags as were required to best describe
the information they processed.

// SGML allowed tags to be assigned custom values as needed


 "I learned the shape of the future by supervising the transition of Novell's NetWare
 documentation from print to online delivery. This transition, which took from 1990
 through 1994 to implement and perfect, was based on SGML. The decision to use
 SGML paid off in 1995 when I was able single-handedly to put 150,000 pages of
 Novell technical manuals on the web."    
                                                             - Jon Bosak
, SGML & XML pioneer  


Unfortunately SGML became complex and difficult and suffered from industry
infighting and did not reach the full potential use it might have. Another thing that
thwarted it's use was the success of one of it's more limited but very popular
'offspring', HTML. Tim Berners-Lee wrote HTML, using the SGML model, where
data definition tags and angle brackets were interpreted to create hierarchical page
structures. HTML though limited in what it could do was adequate for the needs
of a huge base of users, and became a 'run-away best seller' as far as programming
notation goes. HTML with it's links and the HTTP protocol eclipsed other popular
Internet services like FTP, Telnet and Gopher.

// HTML simplicity and utility thwarted SGML's growth. SGML was also stalled
//  because it was complex, difficult and a victim of industrial infighting

Despite the immense popularity of HTML it did not address all the issues that
SGML was created to solve. For instance, HTML does not supply customizable
tags and tags do not serve to describe the contents of the HTML page. There
was still a need for a language more capable then HTML and simpler to use then
SGML.

// Still HTML did not serve the needs that SGML was created to solve. HTML was
//  not customizable and had no mechanism to describe the data that it contained


The Advent of XML  
// a refinement of SGML 

The ability to exchange data in an open manner had still not been met. SGML
was too
complicated, HTML was not suited to representing structured data and
the rigid and
expensive EDI was not easily adapted to the Internet.

Jon, Bosak, Tim Bray and C.M. Sperberg-Queen, Jean Paoli and James Clark,
many of whom were SGML pioneers sought to filter SGMLs best features and
port them to the web. While saving the best features the group also looked to
delete optional and complex features. (In a sense XML was a distillation of SGML
similar to how Java is a simplification of C++.)

XML was focused to be used over the existing protocols of the Internet. It was
also decided that it would be managed by the World Wide
Web Consortium or
W3C, the same group that was managing the HTML standard. For
a short time
the language they created was called 'Web SGML' but this was
dropped in favor
of XML. John Bosak offers his own recollection of how XML
came to be at
http://java.sun.com/xml/birth_of_xml.html

 


Hello World in XML




We can introduce the general look of XML in a quick Hello World version.

Example    <?xml version = "1.0" standalone="yes"?>
                  <Earth>
                  Hello World in XML!
                 </Earth>

Write or copy and paste this text into a simple text editor like Notepad and save
it to a name with an .xml ending, such as HelloWorld.xml. Once saved, it can be
opened into a browser. The result is not very exciting as there is no formatting
associated with the text.

If we look at the <Earth> element, we can see that XML uses 'tags' or sets of
enclosing braces that surround identifiers. A 'start' and 'end' tag are distinguishable.
The 'end' tag includes a forward slash ahead of the element identifier.

A named tag is called an element. Elements can also contain attributes.
In the following example the attribute called 'type' holds the value 'planet'.

Example  < Earth type="planet" ></Earth>

Note in this reiteration of the element we left the content out. This creates an
'empty' element. XML supplies an abbreviated form for an 'empty' element as 
follows.

Example  < Earth type="planet" />

This is a recommended form as it reduces the risk of creating an 'orphaned'
end tag.

Speaking of form XML describes what makes a 'well formed' XML document.




Well Formed and Valid XML



Rules Governing XML Structure

1) XML Elements must have closing tags. That means all tags.

Example <Break></Break> or <Break / >
 

2)  XML Elements unlike HTML are case sensitive.

Example    <GO / > is not the same as <go / >
 

3) XML tags must all be properly nested. In other words tags must be
closed in reverse order they are opened. Below the tags open One,
Two Three, and close Three, Two One.

Example         <One> <!-- opens -->
                       <Two> <!-- opens -->
                          <Three> <!-- opens -->
                          </Thee> <!-- closes -->
                        </Two>  <!-- closes -->
                   </One>  <!-- closes -->

4) XML Documents must have a single root element. This implies all elements
of a document are nested inside the root document. The root identifier is the
same type as is declared in the document type declaration if one is present. 
 

5) Attribute values must all be quoted, by convention using double quotes.

Example     number="1029383454738";
 

6) Attributes may only appear once in an element.

Example  <!-- can't have -->  < X x = "y"   x = "z" >
 

7) Attribute values cannot contain references to external entities. XML text
can reference XML external entities but not tag attributes. Attributes can
use internally defined and pre-defined entity references.

Example      < ANC  nac = "CNA&apos;S"  >

8) Entities must be declared before they are used. Predefined entities are
already defined so they are ready to go.  // entities can't be forward referenced
 
 

Well Formed XML      // defining what a well formed XML document is

In the first case, XML requires that a document be 'well formed'. To be well
formed a document must follow the above stated rules and in addition, the
document must not contain markup or characters that XML cannot process.
 

Formula For Well Formed XML

Adherence to Structural Rules + Correct Syntax  =  Well Formed XML
 

Valid XML

An XML document is considered valid if it is first well-formed and in addition
it has a document type definition, a DTD or an XML Schema, that describes
constraints that the document is in compliance with.

A well formed document can be used without a schema. This will automatically
limit it from using certain advanced XML features that are available only through
some form of document type declaration.




Elements vs. Attributes


 

XML Elements - Elements are either mixed tag pairs or self closing tags.

Example       < bird > < /bird >      or          < bird / >

In the above example the two forms specified are functionally equivalent. The first
pair are technically a set of empty tags which can be abbreviated to the second,
self-enclosing form. The self-enclosing form signals the processor that no matching
tag will follow. Because the first form may lead to error and has some ambiguity, it
is recommended that in place of any empty XML tags, it is better to use the self-
enclosing form.

  // self closing tags are recommended as they are less ambiguous and less prone to error
 

XML Attributes - Attributes are quantifiable properties that can be applied to
an element to modify or enhance it's description and functionality. Attributes are
assigned in name value pairs. The name identifies the particular attribute of the
element. The value is some unique quantity that is assigned to the name. Following
is an example.

Example  <Student international ="true"  fulltime="true" resident="false" >
                Jampour MacEarthski
               </Student>
 

When to use Elements? When to Use Attributes?

A big advantage of attributes (over defining additional tags) is that they can be
controlled in a more granular way than regular XML tags. For instance an attribute
can be defined as required or optional. An attribute can be fixed or
variable. In
addition,
other sorts of constraints can be put on attributes. We will see later
how an attribute
can also be restricted to a certain valid ranges of values.

//  attributes can be specified in a granular way using DTDs or schemas.

Another aspect of using attributes, which may be advantageous or disadvantageous
depending on the application is that attributes are naturally limited from occupying
a primary node in an XML document's tree structure. In XML, the attribute occupies a
subordinate branch of a node element. This might be disadvantageous from a search
perspective as in many sorts of searches only nodes are checked for information.
On the other hand, where an attribute represents a relatively unimportant detail, it is
probably better that such information is hidden leaving the representation of the XML
tree structure simple and essential and not encumbered by excessive detail.

//  It should be added that XPath is able to search for subparts of nodes such
// as attributes so this argument may not be that significant

Two guideline have arisen that can help you decide when an subject should be treated
as an element or as an attribute. The first guideline is an attribute needs to be unique.
An attribute can only appear once in an element. As a possible example, a shirt size
once it has been specified cannot be specified again.

// attributes need to be unique appearing only once in an element

The second guideline states that attributes should be simple. If you are familiar with
Java you can think of primitive values as being prime or fundamental. On the other
hand Java class objects are most often created from several parts. Java primitive

values are analogous to attributes and Java classes are somewhat similar to XML
tag elements.

// attributes should be simple or atomic

To summarize, there have not been hard and fast rules offered to tell how to best
choose whether an element should be an attribute or a sub-element. There are some
good guidelines though. If an element has to be represented many times it is best
represented as an element. If a property is descriptive in a simple, unique and singular
sense it is a candidate to be an attribute. If an object is naturally complex and will
likely decompose to further properties and attributes it is likely best represented as
an element.

// a property may be described as an attribute when it is a) unique or non-repeating b) atomic
// as in elemental or not complex



XML Syntax



XML adheres to the basic method used by markup languages where text is
surrounded by markup. The markup associates some additional information
used to describe the enclosed text in some way. Markup languages only have
these two layers or essential parts, markup and content.


Example

<markup>content </markup> 

<!-- The 'markup' identifier is meta-data or data 'about' a description of the data, 'content' -->


An XML document is focused on supplying an extensible, hierarchal format that
provides tag labels to describe a document's contents. Self-describing data is also
referred to as 'metadata'. Metadata is descriptive data that describes (or is 'about')
the data that represents the contents of a document.

XML Data Model

The XML Data Model is described as a generalized tree data structure. This is
an extension of the  'linked list' where each node, instead of pointing to a single
next link, may point to more than one node so that different non-linear data
structures may be formed called 'trees' and 'graphs'.
 

Diagram showing a Simplified  XML Tree Structure

               parent     // root or document element has no parent
          _____|____
         |                 |
    child_1        child_2
 

  //  other than root every child has exactly one parent


Trees are limited form of a 'graph' characterized by having a unique starting point
called the 'root'. The tree is also a recursive structure where any node can be
itself can be defined as a root of what may be called a 'sub-tree'.  There are other
characteristics that we are use to by now, such as the property of a tree where
every node except root has a unique parent.

XML documents are hierarchal tree structures, that can accommodate any depth
of nested nodes and any number of child elements within the practical limits of the
system being used. Each XML document that is described as 'well formed' has
one 'root element' that has no parent and forms the base or root of the document
tree. (The term 'well formed' is used in XML to describe a document that adheres
to a certain set of compositional rules.) The root element is also referred to as the
'document element'.

The Binary Tree

Before describing how tree data structures are traversed it is useful to describe the
simple 'Binary Tree' data structure. We can consider a constrained form of the tree
data structure called the 'binary tree' whose nodes may have zero, one, or at most,
two children.

// a binary tree is a data structure whose nodes have zero, one or at most two children

The binary tree has an accepted set of terms used to describe it's parts. First there
is a node which is a point or link in the tree data structure. Each node in a binary tree
may have a 'left child' and may also have a 'right child'.  If a node has no children,
then the node is referred to as a 'leaf', and represents an end point on a tree's branch.
If the left and right child themselves have children, then the left and right child may
be referred to as the 'left subtree' and the 'right subtree'.

Binary Tree Traversals

It is typical to do searches on hierarchal systems like trees to abstract information
from them. Binary tree traversals describe different systematic ways of visiting each
node of a tree. There are three common forms of binary tree traversals that can be
extended to more general tree forms.  These are based on the order in which the
component parts of a node are visited. Consider the following simple case of a
node with a left child and a right child.

Simple Node with a Left and Right Child 


   
                 Node
                  /          \
  Left Child               Right Child

This representation can be paraphrased as follows.


                       A
                     /    \
                  B       C

Three popular methods of traversal, 'preorder', 'inorder' and 'postorder' traversal
are based on the relative sequence that each of these nodes are visited.  In 'preorder
traversal' the parent node is visited first followed by the left child and the right child.
In inorder traversal, the left child is visited first, followed by the parent node followed
by the right child. In postorder traversal the left child is visited first followed bythe right

child followed lastly by the parent node.

Recall that trees are recursive structures. The different sorts of traversal methods
are also applied recursively.  We can expand our example to include subtrees.

XML, and subsequently XML dialects used in Web Services use 'preorder traversal'.
Preorder traversal coincides with the way XML elements appear, top to bottom in an
XML document. Since this is the native traversal method used in XML we will example
this variety of traversal.


Preorder Traversal

In the simple scenario, shown below, The main 'Node' is visited, followed by the
'Left Child' followed by the right child.

Simple Node with a Left and Right Child 


   
           Node
            /     \
  Left Child       Right Child

When nested nodes need to be considered, the tree has to be traversed recursively.



Preorder Traversal Example   // The numbers indicate the order nodes are visited in preorder traversal.

                     1.A
               /   \
               2.B      5.C
            / \     / \ 
           3.D  4.E   6.F   7.


Using Preorder Traversal, The principle node is visited first, followed by it's left child.
But the left child is itself a parent node so we need to recurse, following the same pattern.
This leads us to the left child of node B which is D followed by the right child E. From
here the B subtree has been processed so the traversal moves to the right node of the
main node A which is C. Here too C is a subtree that itself needs to be traversed
according to the pattern. 

The only difference in  traversal in XML documents is that they are general trees rather
than binary trees and not limited in the number of children each node may have. The
following example shows the convenient aspect of preorder traversal where the order
taken is the same top to bottom order that occurs naturally in XML instance documents.
For instance, the tree described above would appear in the same order in an XML instance.

XML Instance Example of the Above Tree

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="It.xsl"?>
<A>1.
   <B>2.
      <D>3.</D>
      <E>4.</E>
   </B>
   <C>5.
      <F>6.</F>
      <G>7.</G>  
   </C>
</A>

 


XML Delimiters



XML Delimiters

Markup text has to be isolated from the content that makes up the XML document.
The special characters used to do this are called delimiters. XML has a four types
of delimiter characters listed below.

XML Delimiters

<   _  the start of an XML Tag
>   _  the close of an XML Tag
&   _  the start of an XML Entity
;   _   the end of an XML Entity

We can use these tags to create an XML tag example that points out the self-
describing feature of XML. Self describing data is called metadata.

Example       <first_name> Bart  </first_name>
 


XML Identifiers, Whitespace, & Comments



XML  Identifiers

Stricter than HTML, XML identifiers are case sensitive. This is to say, xXx is
a different identifier than XxX. XML naming has more latitude then is usually
found in programming languages. XML names can start with letters but not
with punctuation marks. Subsequent characters can be letters, numbers,
and some other non-letter characters.

For instance, XML names cannot include angle braces, < > or  white space.
The colon is a legal character but is reserved for use with namespaces. As
such, only one colon can be used in a qualified name. XML names can include
characters like periods and hyphens. Following we can summarize this
information in a rule.


XML Reserved Characters

<>The XML Recommendation states element or attribute names cannot begin with
the characters 'xml' in any combination of upper and lower case.

// checked in the browser and it didn't care. didn't try in a validation context


Rule for XML Identifiers

XML names may start with an underscore or a Unicode letter. Subsequent
characters may be underscore, Unicode letter, number, hyphen, period or a
colon. An identifier may not contain any whitespace.

The colon can be used legally in names but by convention is reserved for use
with namespaces where only one is allowed. We can create a rule to describe
these conditions as follows.

"XML Identifiers are defined as a letter or underscore followed by any number
of letters or numbers or underscores or hyphens or periods. A single colon may
optionally be included in conjunction with it's use with namespaces."

With respect to the colon, it is really only used to separate namespace names
from other associated identifiers as is shown in the following example.

Example     tools:hammer   // describes a 'hammer' variable that is in the 'tools' namespace


Following the rule is restated in Extended Backus Naur Form or EBNF. Notice XML
uses Unicode so letter may be non-English letters and ideograms such as Greek symbols.

XML Identifier Rule

XML_Identifier ::= ( letter | underscore)  ( letter | number | underscore | hyphen | period  )* 
                          ( colon? ) 
                          ( letter | number | underscore | hyphen | period  )* 

// colon by convention is reserved for use in namespaces.

// Extended Backus Naur Form notation has supplied many of the symbols commonly
// used in pattern matching dialects like Perl. The above rule can be read 'XML-Indentifier
// 'is defined as'  a letter or underscore, followed zero or more, letters or numbers or underscores
// or hyphens or periods, optionally followed by a colon, followed by zero or more letters or
// numbers or underscores or hyphens or periods. 


There may be processing advantages to having the facility to use non-typical characters
in XML identifiers. For instance XML would permit an element identifier to be the name of
a file with extension included via the period symbol. The content of the tag is then directly
associated with the file name. A process could then be used to reconstitute the file with
appropriate file extension.

// non-typical characters like the period, '.' might be useful for processing files

As a rule though, unless there is some specific reason to take advantage of XML's
liberal naming scheme, it is probably wise to stick with conservative identifiers. This
approach would allow the meta-data identifiers to be easily ported into other language
domains such as Java. This conservative policy might keep a body of data flexible for
future use and adaptation.

<>// conservative names that would be legal in other programming languages keeps applications flexible

Also, XML_names_can_be_longer_then_is_reasonable! However this could lead to
some unanticipated troubles with devices that have limited abilities so again it will
probably pay to keep names at a reasonable length. The final bit of advice is to keep
the names humanly readable to take advantage of XML's metadata scheme to the
fullest.

// keeping names short, simple and free from exotic characters may assist in future ports
 

Comments

XML uses the same comment form that HTML uses. Following is an example.

XML Comment Form        <!--     comment     -->

For the record, the opening comment tag consists of an opening angle brace
followed by two hyphens. The closing comment tag is two hyphens and a closing
angle brace.




Entity References



Entity References

In order to introduce special characters into documents, (such as a pair of angle
braces, < > ) an escape technique is needed. This functionality is supplied by the
entity reference. The entity reference is also used to create variables that 'stand in'
for a long or complicated section of text.

The entity reference starts with an ampersand and ends with a semi-colon.
 

Form of the Entity Reference              & the_entity ;

There are five predefined or built in entity references. Listed in the following table.
 

Table Summarizing Built in Entity References
 

 Symbol 

 Description

 &lt;

 less than,   < 

 &gt; 

 greater than,    >    

 &amp; 

 apersand,     & 

 &quot;

 double quote,    "  

 &apos;

 apostrophe aka single quote,     '   



In the following example angle brackets are escaped into the document using the

built-in entity references.

Example  <?xml version="1.0" ?>
                <HTML>
                &quot;&lt;&lt;&lt; S &apos; G &apos; M &apos; L &gt;&gt;&gt;&quot;
                </HTML>


Browser Output   
// Saved to xml extension, HTML tags are revealed rather than interpreted

<HTML>

          
   
"<<< S ' G ' M ' L >>>"
               
</HTML>


Entities with custom names can also be defined by an XML author. For instance,
an author's name, e-mail and mailing address might be encapsulated in an entity
called &authorID; . This form is often used to substitute text that often needs to
be repeated frequently into documents. An example are copyright notices.
 

Internal Entity References

To create your own custom internal entities, the basic form is as follows.

Form of an Internal Entity

<! ENTITY entityName  " entity textual content  "   >

In the following example, 'pageMove' is the identifier for the custom internal entity
and the text that follows, " Notice etc. "  provides the content that will be substituted
into the document.

Example

<! ENTITY pageMove
"Notice as of 01/01/2008 this page will moved to www. tee.vee.com "
>

Entities are really a part of the Document Type Definition discussion. It is there
where we take up a special element called the 'Document Type Declaration.' We
need to borrow it here as part of the formula to supply local samples of custom
internal entities. Suffice it to say that the Document Type Declaration is
characterized by the DOCTYPE element name. We see a sample of this
element in the following example.

Notice the square brackets that can be supplied optionally in the DOCTYPE
element. It is inside the square brackets where custom internal entities can be
declared. The entity is then substituted into text using the same escape characters
that internally defined entities use,  the ampersand before and the semi-colon after.
Following is an example of this.


Example

<?xml version="1.0"?>
<!DOCTYPE Entities
[
<!ENTITY def "This is an internal entity definition">
<!ENTITY pageMove "Notice as of 01/01/2008 this page
will moved to www. tee.vee.com " >
]>
    <Entities>
    Dereferencing the Entity Called 'def' : &def;
    Dereferencing the Entity Called 'pageMove' : &pageMove;    
    </Entities>


External Entities

External entities are referenced using a URL. They may represent whole documents.
The referenced document may be XML or other sorts of documents. External
entities require special notation to declare the file type. We defer these examples
to a brief discussion of Document Type Definitions.

 


CData Sections & Processing Instructions  



CDATA Sections  

// escapes large sections of what would otherwise be illegal xml content

Documents may have large sections of character data that would be more
efficiently handled if the XML processor could just ignore them. This includes
data that might be riddled with text that needs to be escaped. For larger
and more complex groupings it is more effective to create CDATA sections
which allow long passages to be escaped. Inside a CDATA section no
characters have any special meaning. The general form used for CDATA
Sections is as follows.

General form of a CDATA Section   <![CDATA[ contents ]]>

Here is an example of a CDATA section that follows this form.
 

Example <![CDATA[ <><><><>""""&&&&""""<><><><> ]]>

All these special characters that normally need to be escaped are passed as
raw data. One thing that cannot be included in a CDATA section is what
constitutes the CDATA Section end delimiter itself, ]]> as it prematurely end
the CDATA section. Also be careful to keep white spaces from between any
of the symbols of CDATA declaration.
 

Example   <![CDATA[      is not the same as     <!  CDATA  [

The following example from W3Schools shows a more realistic sample where
a JavaScript function with plenty of characters that need escaping is passed in
it's entirety using the CDATA mechanism.
 

Example 2     // from www. W3Schools.com

<![CDATA[
                 function matchwo(a,b) {
                   if (a < b && a < 0) then{
                     return 1
                    }
                 else{
                   return 0
                  }
               }
                 ]]>


Processing Instructions

Processing instructions are in one sense similar to comments as they create areas
or spaces in the document that are not interpreted explicitly as part of the XML
document. Instead of supplying a means of documenting the page for someone
who is viewing the source of the document, a processing instruction is intended to
carry special processing instructions for applications which will process the XML
text in some way. You may recall the xml declaration at the head of a typical
xml page is an example of a processing instruction as is shown in the next
example.

Example  <?xml version="1.0"?>


The general form of a processing instructions is as follows. 


Example 
    <?
target  instructions options ?>


The two angle braces enclosing two question marks is the general comment form

of the processing instruction.

Processing Instructions have some of their own jargon. The processing instruction
is often referred to in abbreviated form as a 'PI' and the instruction name is called a
'target'. A rule states that the processing instruction cannot start with the characters,
'xml'. While XML processors do not pass regular comments along with a document,
the processing instructions do accompany the page. The following interesting example
of a database access in the PHP script language is found in E. Harold & S. Means'
'XML in a Nutshell' by O'Reilly Press. In the following example, PHP is the target
that contains the PI or processing instructions.

Example

<?PHP
    mysql_connect("database.unc.edu","clerk","password");
    $result = mysql("CYNW", "SELECT LastName, FirstName FROM Employees
    ORDER BY LastName, FirstName");
    $i=0;
    while($i<mysql_numrows(&result) ){
         $fields = mysql_fetch_row($result);
         echo "<person> $fields[1] $fields[0] </person>\r\n";
         $i++;
         }
    mysql_close( );
 ?>


The XML Declaration



Anatomy of an XML Document

Following is a list of the basic parts of the document. We will treat XML content
first as it is very short. Also we are going to defer treatment of the Document Type
Declaration to when we cover DTDs to avoid redundancy. (You had a preview
of it when we declared an internal entity.)  

1) XML declaration
2) Document Type Declaration
3) Element data
4) Attribute data
5) XML content  // character data


Let us briefly state that Element data, Attribute data and XML content refers
to the Data and Meta Data that is carried in the XML document body within
the root element, the root element included. We can focus some attention
on the XML declaration and the Document Type Declaration.
  

The XML Declaration  

<>The first line of an XML document is The XML Declaration. The XML declaration
establishes that the current page is an XML document. The declaration takes the
form of an XML processing instruction. (A processing instruction allows special
comments to be made using angle braces enclosing question marks.) A processing
instruction begins with an opening prefix, <? and is suffixed with the characters, ?>.
The special characters 'xml' follow the opening prefix and serve to identify the XML
declaration.

Example <?xml version=1.0?>

An XML document can be composed without the declaration though it should
be included. If it is included it needs to start the document. This means the very
first character! This effectively means that you must be sure that there is no white
space in advance of the starting angle brace of the declaration. 

// A technical exception, the invisible Unicode byte order mark may precede the declaration.
// Recently it has been observed in one browser at least that white space is tolerated at
// the head of the document

 The version attribute

At this time, specifying the version of xml is said to be optional, however in practice
this is not the case. When testing using Mozilla 1.5 or Internet Explorer 6.0 for instance,
leaving out the 'version' attribute results in an error. Having said this, we have to be
careful that we don't consider browser support as being the only criteria to judge XML
compliance and behaviour. Dedicated standalone XML tools may be used that may
show different results. What can be said is, that there is a strong recommendation
to include the version tag as it is expected that, as time goes on, different versions
of XML will co-exist. The attribute takes the form, version="x.x".

Different browsers, including Mozilla1.x , Netscape 6.2 and IE 6 or Opera can be
used to test these tags.

Example  <?xml  version="1.0" ?>

We can mix this tag with the first example to create a functional XML document.
 

Example 2   <?xml version="1.0" ?>
                <first_name> Bart </first_name>
 

The standalone attribute

Besides the version attribute the XML declaration may also contain the 'standalone'
attribute. This attribute dictates if an internal data definition type is used exclusively.
This is indicated by assigning the attribute the string literal value, "yes".  If a "no"
value is assigned to the attribute, then the document will use an externally defined
data type definition and any internally defined DTDs will become optional. If the
standalone attribute is left out, the default value supplied is 'no'. (For now we may
think of a DTD as an XML related script.)

Example     standalone="yes"     //  means only internal DTDs are used


The encoding attribute
   // with reference to an explanation kindly provided by François Yergeau


The encoding attribute is used to specify which character
encoding will be used by
the document. XML uses an auto-detecting system to determine what character set
is being used in a document file. It interprets the first four bytes of the document
which will have different values depending on which encoding set is used. (This is
why it is important that the <?xml characters of the XML declaration appear as the
very first characters in the document.)

// If the first four bytes are 3C 3F 78 & 6D  then the encoding is ISO 646 or one of
// it's subsets, i.e . UTF-8,  ASCII, a subset of ISO 8859 etc. The encoding declaration
// is then looked at to distinguish which one ot these it is.

There is no official default value specifed for the  'encoding' attribute. The way the
auto-detection process works is, if an encoding is not explicitly specified either
by means of an external transport protocol, a Byte Order Mark (BOM) or an explicit
assignment to the XML declaration 'encoding' attribute, then 'UTF-8' is the only
encoding that will work and not result in a fatal error being thrown.

This makes UTF-8 the de facto equivalent of a default encoding for XML.

( UTF is a variable length character encoding system. For instance it uses a single
  byte for ASCII characters but three
or more bytes for Asian symbols. )


Example
  encoding="UTF-8"

The following table of  character encoding values was adapted from a table in the
'The XML Bible' by Elliote Rusty Harold.
 

Table of Some Common Encoding
 

 Name 

 Language / Country 

 US-ASCII 

 English

 UTF-8 

 Compressed Unicode 1

 UTF-16 

 Compressed UCS

 ISO-10646-UCS-2

 Raw Unicode

 ISO-10646-UCS-4

 Raw UCS 

 ISO-8859-1 

 Latin-1, Western Europe 

 ISO-8859-2

 Latin-2, Eastern Europe

 ISO-8859-3

 Latin-3, Southern Europe

 ISO-8859-4

 Latin-4, Northern Europe

 ISO-8859-5

 ASCII plus Cyrillic

 ISO-8859-6

 ASCII plus Arabic

 ISO-8859-7

 ASCII plus Greek

 ISO-8859-8

 ASCII plus Hebrew

 ISO-8859-9

 Latin-5, Turkish

 ISO-8859-10

 Latin-6, ASCII plus Nordic

 ISO-8859-11

 ASCII plus Thai

 ISO-8859-13

 Latin-7, ASCII plus Baltic/Latvian

 ISO-8859-14

 Latin-8, ASCII plus Gaelic/Welsh

 ISO-8859-15

 Latin-9, Latin-0; Western Europe

 ISO-2022-JP

 Japanese

 Shift_JIS

 Japanese, Windows

 EUC-JP

 Japanese, Unix

 Big5

 Chinese, Taiwan

 GB2312

 Chinese, mainland China

 KO16-R

 Russian

 ISO-2022-KR

 Korean

 EUC-KR

 Korean, Unix

 ISO-2022-CN

 Chinese

 ISCII-1991

 Indian


International Language Support

This is a good place to interject that XML supports the late versions of Unicode. While
Unicode is currently at version 4.0.1 ( as of September 2004 ) XML recommendations show
support
for version 3.2 as of circa February 2002 and may support higher version features now.
( Version 3.2 supports virtually every spoken language on earth so by
extension,  XML may
be thought of as a fully international language. (There are limitations that need to be resolved.
For instance, although all the characters of Unicode can be
used as content there are still
some limitations of what text can be used in tags.) If this area is of special interest to you
you may wish to read the following W3C article.

<>'Unicode in XML and other Markup Languages,Unicode Technical Report #20',
 http://www.unicode.org/unicode/reports/tr20/tr20-6.html



The Document Type Declaration   // the DOCTYPE element 

The document type declaration is used to specify the document type definition.
This declaration is associated with the DOCTYPE element. Stated more simply,
the DOCTYPE element declares the DTD, whether internal external or both.

SGML requires a DOCTYPE declaration but XML does not. This implies that
XML documents that are designated,  'well-formed' are not required to contain
a document type declaration.


Form of the DOCTYPE Element

<!DOCTYPE  name  SYSTEM | PUBLIC   DTD_URL  | (  PUBLIC_ID opt. DTD_URL)   [Internal DTDs] >

Where - <! -the exclamation mark marks the beginning of the declaration.
           - DOCTYPE  - keyword for element which abbreviates Document Type Declaration
           - name - the name of the root tag of the XML document
           - SYSTEM - used in conjunction with a URL describing an externally defined  DTD
           - PUBLIC - used in conjunction with a public id which may be backed up by a URL
          - [  ] - square braces house optionally an internally defined DTD subset.
 

The following example shows the common DOCTYPE declaration found in standard XHTML
pages. You will find it inserted by HTML editors ahead of the first HTML tag. You can see
that 'html' will be the root tag of the document. A PUBLIC ID form is used rather than a
SYSTEM ID which would typically be a URL. ( The PUBLIC ID encapsulates information
stating that XHTML is a non-ISO standard whose proprietor is the W3C and is described
as 'XHTML 1.0 Transitional' with the ISO language identifier for English.)  Following this
is the file name that supplies the Document Type Definition for XHTML.


The XHTML DOCTYPE Declaration

<! DOCTYPE html  PUBLIC  "-//W3C/DTD XHTML 1.0 Transitional//EN"  "DTD/xhtml1-transitional.dtd">

 

 Using PUBLIC and a Public ID  // just for reference


In the above declaration the "-//W3C . . . EN"  string is called a Public ID.
More precisely it is an Public ID in Literal Form. In the W3 XML specification
it is referred to as an 'External ID'. This ID is there to represent a format that is
understood to be well known and public. In the case the Public ID is not known
to the application that is interpreting this information, the program can make a
reference to an optionally provided URL that describes a DTD the appropriate
formatting information. The Public ID normally provided in a literal form that is
called in the W3 specification, a Public ID Literal form. Although there is nothing
disallowing the form from being used in a different pattern than that which is
described below, the general use of the Public ID is as follows.

General Use Public ID Form

"Public ID Character // DTD_proprietor // DTD_description // ISO_Language_Identifier"



A Brief Look at DTDs


The DOCTYPE element is used to supply DTDs or Document Type Definitions ot Documents.
DTDs were the original way envisions to control typing for XML documents. DTDs are part
of the original XML 1.0 speoification, and as such will always be with us. We can reconsider
our earlier example which we used to create custom entities. It uses the square brace section
which is an optional area that can be used to supply local Document Type Definitions. This
area can also be used to supply local DTDs for elements and attributes.

Example

<?xml version="1.0"?>
<!DOCTYPE Entities
[
<!ENTITY def "This is an internal entity definition">
<!ENTITY pageMove "Notice as of 01/01/2008 this page
will moved to www. tee.vee.com " >
]>
    <Entities>
    Dereferencing the Entity Called 'def' : &def;
    Dereferencing the Entity Called 'pageMove' : &pageMove;    
    </Entities>


Notice in the following example, how three elements are declared.


Example Demonstrating the Declaration of an Optional Element

< style="font-family: helvetica,arial,sans-serif;"> <?xml version = "1.0" standalone="yes"?>
<!DOCTYPE options  [

<!ELEMENT options (name, description?) > 
<!ELEMENT name (#PCDATA) >
<!ELEMENT description (#PCDATA) >
]>
<options> 
<name>chrome trim</name>



EBNF contributes the following cardinality controls that enable a DTD
to dictate to some extent the number of times an element may appear.
( Note these controls don't enable dictating a element should appear
3 to 5 times for instance.)

* _ the asterisk to represent zero or many
+ _ the plus symbol to represent 'one or more'
? _ the question mark, to represent 'zero or one' or the optional state.

In the above example you will notice the question mark is used to make
the second element optional. These controls

Now consider how attributes are added to elements in DTDs. In the following
example the air element is provided with four required attributes that are of
the CDATA or Character Data Type.

In the subsequent XML document the

Example of an ATTLIST Element Used with An Element

<?xml version="1.0"?>

<!DOCTYPE Requirements [
    <!ELEMENT Requirements (air,water,food,shelter) >
    <!ELEMENT  air  (#PCDATA)>
    <!ATTLIST  air   oxygen          CDATA  #REQUIRED
                     nitrogen        CDATA  #REQUIRED
                     carbon_dioxide  CDATA  #REQUIRED
                     noble_gases     CDATA  #REQUIRED
     >
    <!ELEMENT water   (#PCDATA)>
    <!ELEMENT food    (#PCDATA)>
    <!ELEMENT shelter (#PCDATA)>
    ]>

<!-- The implementation of the locally defined DTD -->

<Requirements>
  <air  oxygen = "19%"         nitrogen = "79%"
        carbon_dioxide = "1%"  noble_gases = "1%" >
    Present and Acceptable
      </air>
  <water>
    Present and Not Acceptable
      </water>
  <food>
    Present but Not Acceptable
      </food>
  <shelter>
    Not Present
      </shelter>

</Requirements>

If this document were being validated using a validating program the program
would check that one of each of these elements described in the DTD were
present and that the 'air' element would have present the four attributes
described in the DTD.

The following table shows types that attributes can be assigned to constrain
whether an attribute needs to be used and what value it will have.

Table Summarizing the Default Attribute Types
 

 #REQUIRED

 attribute required 

 #IMPLIED 

 attribute optional

 #FIXED 

 attribute constant & final

 Literal Default

 describes specifying a default value 




Schema Supplant DTDs in Web Services

If this were a dedicated XML course we would be obliged to cover a lot more
information regarding DTDs. However, because Web Services is predominantly
dedicated to using XML Schema which is the XML typing system that has
supplanted DTDs we will not look at DTDs in any more depth.

It was mentioned that the DOCTYPE element was an optional. This is convenient
as it allows us to cleanly substitute Schema typing in place of DTDs. In fact we
will discover that XML Schema is largely embedded in Web Services Transport
Mechanisms, in particular in the SOAP 1.2 Specification. 

DTDs were limited by the fact that came before the advent XML Namespaces.
XML Namespaces play an important role in both XML Schema and Web Services
so we will look at XML Namespace support now.


XML Namespaces



XML Namespaces provide a mechanism to allow several sets of tags from different
XML applications to use similar identifiers without conflicting with each other inside
the same XML document. The use of namespaces allow elements with the same
name to be distinguished from each other.

Consider two hypothetical containers.

Example

DentistKit{ drill, pliers, floss}
Carpenter { drill, pliers, saw}

If we mixed references to identifiers from each set of information. without in some way
qualifying that it is a Dentist's drill or that is a Carpenter's pliers the intent of whatever
we were doing could be lost.

We could keep things separate though by qualifying 'local' names as in the following

Example

DentistKit:drill
CarpenterKit:drill


XML Namespaces uses a 'colonized' form as shown in the above example to keep
XML elements from different XML applications in separate namespaces.

//  'colonized' means with a  colon, ' : '



Namespaces are created  using a special attribute, 'xmlns' which is referred to as
a 'reserved attribute'. It can be used in two forms, with and without an additional
prefix. These identifiers are assigned URIs.

Example

xmlns = " someURI "

xmlns:prefix = "someURI"

Significance of Namepace URIs

It is tempting and logical to assume that the URI in it's capacity as a Resource
Indentifier or locator brings some important information over the network. This is
not the case (at least not at this time.) 
There is nothing at the URL location on
the web that is important to the document.
What is important is that URI represent
a unique identifier in whatever context the local document is running in.


The Default NameSpace Form

Consider the first of the two simple forms we outlined earlier.

Example
xmlns = "anyURI"

This is the 'default' form and creates an 'implicit', invisible prefix for any un-prefixed
element in the scope of the element in which the namespace is declared. If this is
in the root element of the document, then the scope is 'global' to the document. In
the following example, all the elements that do not have a prefix, implicitly become
qualified and part of the default namespace defined by xmlns="http://www.OutThere.com". 
Because the top level element holds the namespace declaration the default namespace
is 'global' and applies also to the root element, 'Outer'.

Example

<Outer xmlns ="http://www.OutThere.com">
<x>African Coffee Table</x>
<y>80</y>
<z>120</z>
</Outer>



The No Namespace Default Form

If we use the reserved attribute, 'xmlns' and assign it an empty string we create a
'No Name' default namespace. This is in fact an explicit way of stating the implicit
default namespace condition of an XML document that has no namespace defined.
In other words, putting this form into an root element has the same effect as not
putting any namespace declaration into an XML document.

Example    xmlns = ""

If you removed the URI from the above example and assigned it the empty string, "",
then the elements x,y z and Outer would belong to the default 'no Namespace'
namespace, in other words, to no namespace.

The Prefixed Form of Namespace Declarations

When a prefix is used with a local name, the prefix stands in for a longer
URI. In this sense, creating a prefix to represent a URI is a namespace
declaration. For instance, in the following declaration,

Example   xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"

xsl stands in for "http://www.w3.org/1999/XSL/Transform". Prefixes
are bonded to XML namespace URIs using the xmlns:prefix syntax.

XML URIs are more limited than regular URLs as some conventional URL
characters are illegal inside an XML document. i.e the characters,  / , ~ 
and  %. XML prefixes work around this restriction, allowing prefixes to
represent URIs inside XML documents. Each prefix maps to a given URI.

Inside the document the prefix is used to associate element(s) with that
namespace.

Example   <xsl:template>


Following is a common namespace declaration we will see when using XML
Schema language.

Example  xmlns:xs = "http://www.w3.org/2001/XMLSchema

The 'xmlns' attribute is native to the 'schema' element.

As a result we will typically see this attribute appear inside the schema element.


Example  <schema xmlns:xs = "http://www.w3.org/2001/XMLSchema> . . . <schema>


Then inside the schema document, we will see the xs prefix bound to elements that
belong to the schema application.

Example <xs:ComplexType>



Following is a simple complete example that shows a prefix

Example <?xml version= "1.0"?>

<g:rock xmlns:g="http://www.sentex.net/">
<g:type>
 sedementary
 </g:type>
</g:rock>


Namespace Terminology

If we isolate the <g:type>  element from the complete example above, 'g' is the 'prefix', 
'type' is the 'local part' and together the full name, g:type is called  the 'qualified name'.



Nested or Local Namespaces

In the the following example the namespaces are not defined globally. They are
instead created within nested elements. In this scenario, the namespace only
applies to the element in which the namespace is declared.


Example

<?xml version= "1.0"?>

<!--  This XML document carries information in a table -->
<mix>

<g:rock xmlns:g="http://www.sentex.net/">
<g:type>
 sedementary
 </g:type>
</g:rock>

<!--  This XML document carries information about a piece of furniture -->

<m:rock xmlns:m="http://www.sentex.net/~pkomisar">
<m:name> The Rolling Stones </m:name>
<m:type> R&amp;B </m:type>
</m:rock>

</mix>



A Note on Attribute Namespaces

"Unless we explicitly state otherwise, the attributes that an element carries are
considered to be in no namespace ( or null namespace) although most applications
treat them as though they are in the same namespace as the element that carries
them. " 

                                            - quote from  'Professional XML Schemas', J.Ducett et. al.

This is an important because we cannot assume attributes will be treated as belonging
to the namespace of their associated elements and probably should be explicitly qualified.

In XML an element cannot have two attributes with the same name. Also two prefixes
cannot reference the same URI. However there is a variation that is possible where the
default namespace can be assigned the same URI as an associated prefix. As is shown
in the following example from 'Professional XML Schemas'. In this case, although it would
appear that two height attributes are in for a clash and might be in the same namespace,
we need to recall the above rule, that states that attributes if not explicitly qualified are in
the 'no name' namespace. So in this case, the default qualification does not apply to the
'height' attribute while the cm:height attribute is in the 'cm' prefix namespace.

Example from Professional XML Schemas, J. Duckett et.al., Wrox Press


<Dimensions xmlns= "http://www.example.org/measurements"
                    xmlns:cm="http://www.example.org/measurements" >
  <Vertical height="24inches" cm:height="60cms" />
 </Dimensions>

 



Self Test                                                         Self Test With Answers



1) True or False? All XML elements except for the root element have a single
     parent? False \ True
 

2) Which of the following XML identifiers may be thought of as legally correct
    however it breaks naming convention?

a) _after.all
b) xmlns:volume:section
c) _---_9289187
d) bingo-nite

3) Which of the following is not an XML delimiter?

a) <
b) >
c) &
d) :
 

4) Which of the following is not an attribute of the xml declaration.
a)  version
b)  encoding
c)  type
d)  standalone
 

5) What key character is missing from the following example of an XML
CDATA section?

<[CDATA[ content ]]>                                                       
 

6) True / False Processing Instructions and Comments serve the same
purpose in an XML document. True/False
 

7) a) All well formed documents are valid. True/False 
    b) All valid documents are well formed.True/False         
 

8) True or False. All elements inside an element that has a namespace declared
belong to that namespace. True / False   
 


Exercise 1



Do exercise 1 or 2. Optionally if you have extra time do both
but it is certainly not required. The dual exercise accomodates
individuals who have taken the XML course. However the two
exercises do investigate different aspects of the material
covered.


Note For Browser Testing in Exercises

Firefox is a very lightweight version of Mozilla. It is just a browser, and unlike
Mozilla doesn't carry an HTML editor, a mail agent and an address book. It
would be a very quick download and would not take up much space on your
G drive.  You may wish to use it as the alternative browser to IE for testing.


A single XML document can be used to demonstrate all the following.

1) Create the simplest form of this XML document, include an XML declaration
    and then remove the version attribute. Run in both IE and Netscape or Mozilla
    and report what happens.

2)  Create two identifiers that demonstrate how letters, numbers and special
    characters can be used legally in XML tag names. ( Refer to the EBNF
    formula.) Test in both IE and Netscape/Mozilla. Start one of these identifiers
    with an underscore. Use XML comments to summarize which sorts of different
    characters were used.

3) Insert a colon into an XML identifier. Try in both IE and Netscape or Mozilla.
    What is the result?

4) Create a sentence that demonstrates the use of the five predefined escape entities.
    Then substitute this content into the XML document created in question number one
    using the appropriate escape entities.

5) Insert a CDATA section into your XML document that uses a Character Data
    Section to escape a short but complete HTML page.

6) Insert a processing instruction that includes a complete HelloWorld program
    written in Java. If you are not familiar with Java put in any sort of technical
    data or syntax you wish. 

7) Use an internally defined escape entity to represent the phrase '-Webster's Dictionary'.
    Make up a definition or use one you know. Use the internal entity to represent the
    'Webster's Dictionary' phrase inside your document. Make up another internally
   defined entity and add it to your document.

8) Create two namespaces in the XML document with short fictitious URIs. Create
    three tags that have the same name and use use two sets of these tags in the same
    document demonstrating how they are kept separate in terms of namespace by using
    the namespace prefix.
 


  Exercise 2



1)  Without referring to examples, or at least after referring to examples,
and then with books closed. For those of you for which this is a review
it should all come back to you as you build this document.

a) Create an xml declaration with version assigned.

b) Create a root element called 'schema'

c) Inside it create a prefixed namespace using the prefix 'xs'
      and the URI "http:// www.w3.org/2001/XMLSchema"

d) Associate the xs prefix with the schema root element

e) Nest inside the schema element an xs prefixed element
    called 'complexType'

f) Give it an un-prefixed attribute called 'name' assigned
    the quoted value "Client".

g) Nest inside the complexType element an 'xs' prefixed
     element called 'sequence'.

h)  Inside the 'sequence' element nest three elements called
      'element', each with an unqualified 'name' attribute holding
      the value "FirstName", "Initial" and "LastName" respectively.
      Include in each of these elements, a 'type' attribute assigned
      the qualified value, 'xs:string'.

i) Save to an xsd. file.

<>Don't forget the rules of well formedness, where all assigned values
are quoted and tags are all nested and closed properly. View in a
browser to confirm your file is well-formed.



2) The XML Recommedation states  that the characters 'xml' in any
combination upper or lower case should not be used to begin  XML
identifiers. Since this should be an issue of well-formedness, create
an XML page to test if this rule is enforced. Test in Explorer and
Firefox. Briefly report your results.

3) Again using browsers test whether

a)  more than one same name attributes can appear in an element.

b) test if namespaces can be assigned to the same prefix.

c) Test the form shown in the note where a default namespace
and a prefix can share the same URI. Test if an attribute belonging
to the the 'No Name' namespace is allowed to co-exist with a same
named attribute that is namespace prefixed.  // in other word test
the final example of the note taken from  'Professional XML Schemas"


Report results briefly.