XML Peter Komisar version 5.2 © 2005
references:
XML & Web Services Unleashed , R.Schmeizer et.al 'The
XML
Bible' ,/Elliotte Harold,''The Birth of
XML', Jon Bosak,
http://java.sun.com/xml/birth_of_xml.html,
XML in a
Nutshell, ER.
Harold & W.S Means, 'Mastering XML', Nazarro,
White & Burman
'Professional XML Schemas', J. Duckett et.al.
GML,
SGML & HTML
In 1969, IBM was seeking
a method to simplify the handling of legal documents. IBM
wanted
a technique that helped in creating, searching and storing these
documents.
The project was headed up by Charles Goldfarb. He and
co-workers Ed Moscher
and Ray Lorie recognized that IBM's
different computer systems all used different
document formats.
They recognized that a cross-platform mark-up standard was
needed.
//
1969 Goldfarb, Moscher & Lorie create GML to manage legal
documents
In
response to these needs, they invented General Markup Language or
GML.
The researchers cleverly applied their personal signature to
the language in that
GML is also an abbreviation of each of their
last names, Goldfarb, Moscher &
Lorie. A standardized version
was recognized by the international standards
organization, ISO.
The standard version of GML became known as SGML.
SGML supplied a means
where by markup could be applied to documents and
then the
marked-up document could be interpreted across different computer
platforms. SGML did not define a single definition set. Instead
SGML was
designed to allow tags to be assigned values as needed.
This was important as
it enabled different industries to
structure tags as were required to best describe
the information
they processed.
// SGML allowed tags to be assigned custom values as needed
|
Unfortunately SGML
became complex and difficult and suffered from industry
infighting
and did not reach the full potential use it might have. Another thing
that
thwarted it's use was the success of one of it's more
limited but very popular
'offspring', HTML. Tim Berners-Lee wrote
HTML, using the SGML model, where
data definition tags and angle
brackets were interpreted to create hierarchical page
structures.
HTML though limited in what it could do was adequate for the needs
of a huge base of users, and became a 'run-away best seller' as
far as programming
notation goes. HTML with it's links and the
HTTP protocol eclipsed other popular
Internet services like FTP,
Telnet and Gopher.
//
HTML simplicity and utility thwarted SGML's growth. SGML was also
stalled
// because it was complex, difficult and a victim
of industrial infighting
Despite the immense
popularity of HTML it did not address all the issues that
SGML
was created to solve. For instance, HTML does not supply customizable
tags and tags do not serve to describe the contents of the HTML
page. There
was still a need for a language more capable then
HTML and simpler to use then
SGML.
//
Still HTML did not serve the needs that SGML was created to solve.
HTML was
// not customizable and had no mechanism to
describe the data that it contained
The
Advent of XML
// a refinement of SGML
The
ability to exchange data in an open manner had still not been met.
SGML
was too complicated, HTML was
not suited to representing structured data and
the rigid and
expensive EDI was not easily adapted to the
Internet.
Jon,
Bosak, Tim Bray and C.M. Sperberg-Queen, Jean Paoli and James Clark,
many of whom were SGML pioneers sought to
filter SGMLs best features and
port
them to the web. While saving the best features the group also looked
to
delete optional and complex
features. (In a sense XML was a distillation of SGML
similar
to how Java is a simplification of C++.)
XML
was focused to be used over the existing protocols of
the Internet. It was
also decided that it would be managed by the
World Wide Web Consortium or
W3C,
the same group that was managing the HTML standard. For a
short time
the language they created was called 'Web SGML' but
this was dropped in favor
of XML.
John Bosak offers his own recollection of how XML came
to be at
http://java.sun.com/xml/birth_of_xml.html
Hello World in XML
We can introduce the general look of XML in a quick Hello World version.
Example
<?xml version = "1.0" standalone="yes"?>
<Earth>
Hello World in XML!
</Earth>
Write
or copy and paste this text into a simple text editor like Notepad
and save
it to a name with an .xml
ending, such as HelloWorld.xml. Once saved, it can be
opened
into a browser. The result is not very exciting as there is no
formatting
associated with the text.
If we look at the
<Earth> element, we can see that XML uses 'tags' or sets of
enclosing braces that surround identifiers. A 'start' and 'end'
tag are distinguishable.
The 'end' tag includes a forward slash
ahead of the element identifier.
A named tag is called an
element. Elements can also contain attributes.
In the following
example the attribute called 'type' holds the value 'planet'.
Example < Earth type="planet" ></Earth>
Note in this reiteration
of the element we left the content out. This creates an
'empty'
element. XML supplies an abbreviated form for an 'empty' element
as
follows.
Example < Earth type="planet" />
This is a recommended
form as it reduces the risk of creating an 'orphaned'
end tag.
Speaking of form XML describes what makes a 'well formed' XML document.
Well Formed and Valid XML
Rules
Governing XML Structure
1) XML Elements must have closing tags. That means all tags.
Example
<Break></Break> or <Break / >
2) XML Elements unlike HTML are case sensitive.
Example
<GO / > is not the same as <go / >
3)
XML tags must all be properly nested. In other words tags must be
closed in reverse order they are opened.
Below the tags open One,
Two Three,
and close Three, Two One.
Example
<One> <!--
opens -->
<Two> <!--
opens -->
<Three> <!--
opens -->
</Thee> <!--
closes -->
</Two> <!--
closes -->
</One> <!--
closes -->
4)
XML Documents must have a single root element. This implies all
elements
of a document are nested
inside the root document. The root identifier is the
same
type as is declared in the document type declaration if one is
present.
5) Attribute values must all be quoted, by convention using double quotes.
Example
number="1029383454738";
6) Attributes may only appear once in an element.
Example
<!-- can't have
-->
< X x = "y"
x = "z" >
7)
Attribute values cannot contain references to external entities. XML
text
can reference XML external
entities but not tag attributes. Attributes can
use
internally defined and pre-defined entity references.
Example < ANC nac = "CNA'S" >
8)
Entities must be declared before they are used. Predefined entities
are
already defined so they are
ready to go. //
entities can't be forward referenced
Well Formed XML // defining what a well formed XML document is
In
the first case, XML requires that a document be 'well formed'. To be
well
formed a document must follow
the above stated rules and in addition, the
document
must not contain markup or characters that XML cannot process.
Formula For Well Formed XML
Adherence
to Structural Rules + Correct Syntax = Well Formed XML
Valid XML
An
XML document is considered valid if it is first well-formed and in
addition
it has a document type
definition, a DTD or an XML Schema, that describes
constraints that the document is in
compliance with.
A
well formed document can be used without a schema. This will
automatically
limit it from using
certain advanced XML features that are available only through
some form of document type declaration.
Elements vs. Attributes
XML Elements - Elements are either mixed tag pairs or self closing tags.
Example < bird > < /bird > or < bird / >
In
the above example the two forms specified are functionally
equivalent. The first
pair are
technically a set of empty tags which can be abbreviated to the
second,
self-enclosing form. The
self-enclosing form signals the processor that no matching
tag will follow. Because the first form may
lead to error and has some ambiguity, it
is
recommended that in place of any empty XML tags, it is better to use
the self-
enclosing form.
//
self closing tags are recommended as they are less ambiguous and less
prone to error
XML
Attributes - Attributes are
quantifiable properties that can be applied to
an
element to modify or enhance it's description and functionality.
Attributes are
assigned in name
value pairs. The name identifies the particular attribute of the
element. The value is some unique quantity
that is assigned to the name. Following
is
an example.
Example
<Student international ="true" fulltime="true"
resident="false" >
Jampour MacEarthski
</Student>
When to use Elements? When to Use Attributes?
A
big advantage of attributes (over defining additional tags) is that
they can be
controlled in a more
granular way than regular XML tags. For instance an
attribute
can be defined as required or optional. An attribute
can be fixed or variable. In
addition,
other sorts of constraints can be put on
attributes. We will see later
how an attribute can
also be restricted to a certain valid ranges of values.
// attributes can be specified in a granular way using DTDs or schemas.
Another aspect of using
attributes, which may be advantageous or disadvantageous
depending
on the application is that attributes are naturally limited from
occupying
a primary node in an XML document's tree structure. In
XML, the attribute occupies a
subordinate branch of a node
element. This might be disadvantageous from a search
perspective
as in many sorts of searches only nodes are checked for information.
On the other hand, where an attribute represents a relatively
unimportant detail, it is
probably better that such information
is hidden leaving the representation of the XML
tree structure
simple and essential and not encumbered by excessive detail.
//
It should be added that XPath is able to search for subparts of nodes
such
// as attributes so this argument may not be that
significant
Two guideline have
arisen that can help you decide when an subject should be treated
as
an element or as an attribute. The first guideline is an attribute
needs to be unique.
An attribute can only appear once in an
element. As a possible example, a shirt size
once it has been
specified cannot be specified again.
// attributes need to be unique appearing only once in an element
The second guideline
states that attributes should be simple. If you are familiar
with
Java you can think of primitive values as being prime or
fundamental. On the other
hand Java class objects are most often
created from several parts. Java primitive
values
are analogous to attributes and Java classes are somewhat similar to
XML
tag elements.
// attributes should be simple or atomic
To
summarize, there have not been hard and fast rules offered to tell
how to best
choose whether an
element should be an attribute or a sub-element. There are some
good guidelines though. If an element has
to be represented many times it is best
represented
as an element. If a property is descriptive in a simple, unique and
singular
sense it is a candidate to
be an attribute. If an object is naturally complex and will
likely decompose to further properties and
attributes it is likely best represented as
an
element.
//
a property may be described as an attribute when it is a) unique or
non-repeating b) atomic
//
as in elemental or not complex
XML Syntax
XML adheres to the
basic method used by markup languages where text is
surrounded by
markup. The markup associates some additional information
used to
describe the enclosed text in some way. Markup languages only have
these two layers or essential parts, markup and content.
Example
<markup>content </markup>
<!-- The 'markup' identifier is meta-data or data 'about' a description of the data, 'content' -->
An XML document is
focused on supplying an extensible, hierarchal format that
provides
tag labels to describe a document's contents. Self-describing data is
also
referred to as 'metadata'. Metadata is descriptive data that
describes (or is 'about')
the data that represents the contents
of a document.
XML Data Model
The XML Data Model is
described as a generalized tree data structure. This is
an
extension of the 'linked list' where each node, instead of
pointing to a single
next link, may point to more than one node
so that different non-linear data
structures may be formed called
'trees' and 'graphs'.
Diagram showing a Simplified XML Tree Structure
parent
// root or document element has
no parent
_____|____
|
|
child_1 child_2
// other than root every child has exactly one parent
Trees are limited
form of a 'graph' characterized by having a unique starting point
called the 'root'. The tree is also a recursive structure where
any node can be
itself can be defined as a root of what may be
called a 'sub-tree'. There are other
characteristics that
we are use to by now, such as the property of a tree where
every
node except root has a unique parent.
XML documents are
hierarchal tree structures, that can accommodate any depth
of
nested nodes and any number of child elements within the practical
limits of the
system being used. Each XML document that is
described as 'well formed' has
one 'root element' that has no
parent and forms the base or root of the document
tree. (The term
'well formed' is used in XML to describe a document that adheres
to
a certain set of compositional rules.) The root element is also
referred to as the
'document element'.
The Binary Tree
Before describing how
tree data structures are traversed it is useful to describe
the
simple
'Binary Tree' data structure. We can consider a constrained form of
the tree
data
structure called the 'binary tree' whose nodes may have zero, one, or
at most,
two
children.
// a binary
tree is a data structure whose nodes have zero, one or at most two
children
The binary tree has an accepted set
of terms used to describe it's parts. First there
is a node which
is a point or link in the tree data structure. Each node in a binary
tree
may have a 'left child' and may also have a 'right child'.
If a node has no children,
then the node is referred to as a
'leaf', and represents an end point on a tree's branch.
If the
left and right child themselves have children, then the left and
right child may
be referred to as the 'left subtree' and the
'right subtree'.
Binary Tree Traversals
It is typical to do
searches on hierarchal systems like trees to abstract
information
from them. Binary tree traversals describe different
systematic ways of visiting each
node of a tree. There are three
common forms of binary tree traversals that can be
extended to
more general tree forms. These are based on the order in which
the
component parts of a node are visited. Consider the following
simple case of a
node with a left child and a right child.
Simple Node with a Left and Right Child
Node
/ \
Left
Child Right
Child
This representation can be paraphrased as follows.
A
/
\
B C
Three popular methods of
traversal, 'preorder', 'inorder' and 'postorder' traversal
are
based on the relative sequence that each of these nodes are
visited.
In 'preorder
traversal'
the parent node is visited first followed by the left child and the
right child.
In
inorder traversal, the left child is visited first, followed by the
parent node followed
by the right child. In postorder traversal
the left child is visited first followed bythe right
child
followed lastly by the parent node.
Recall that trees are
recursive structures. The different sorts of traversal methods
are
also applied recursively. We can expand our example to include
subtrees.
XML, and subsequently
XML dialects used in Web Services use 'preorder traversal'.
Preorder
traversal coincides with the way XML elements appear, top to bottom
in an
XML document. Since this is the native traversal method used
in XML we will example
this variety of traversal.
Preorder
Traversal
In the simple scenario,
shown below, The main 'Node' is visited, followed by the
'Left
Child' followed by the right child.
Simple Node with a Left and Right Child
Node
/
\
Left Child
Right Child
When nested nodes need to be considered, the tree has to be traversed recursively.
Preorder Traversal Example // The numbers indicate the order nodes are visited in preorder traversal.
1.A
/
\
2.B
5.C
/
\ / \
3.D
4.E
6.F
7.G
Using Preorder
Traversal, The principle node is visited first, followed by it's left
child.
But the left child is itself a parent node so we need to
recurse, following the same pattern.
This leads us to the left
child of node B which is D followed by the right child E. From
here
the B subtree has been processed so the traversal moves to the right
node of the
main node A which is C. Here too C is a subtree that
itself needs to be traversed
according to the pattern.
The only difference
in
traversal in XML documents is that they are general trees rather
than
binary trees and not limited in the number of children each node may
have. The
following example shows the convenient aspect of
preorder traversal where the order
taken is the same top to
bottom order that occurs naturally in XML instance documents.
For
instance, the tree described above would appear in the same order in
an XML instance.
XML Instance Example of the Above Tree
<?xml
version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="It.xsl"?>
<A>1.
<B>2.
<D>3.</D>
<E>4.</E>
</B>
<C>5.
<F>6.</F>
<G>7.</G>
</C>
</A>
XML Delimiters
XML
Delimiters
Markup text has to be
isolated from the content that makes up the XML document.
The
special characters used to do this are called delimiters. XML has a
four types
of delimiter characters listed below.
XML Delimiters
< _
the start of an XML Tag
> _ the close of an
XML Tag
& _ the start of an XML Entity
;
_ the end of an XML Entity
We can use these tags to
create an XML tag example that points out the self-
describing
feature of XML. Self describing data is called metadata.
Example
<first_name> Bart </first_name>
XML Identifiers, Whitespace, & Comments
XML
Identifiers
Stricter than HTML, XML
identifiers are case sensitive. This is to say, xXx is
a
different identifier than XxX. XML naming has more latitude then is
usually
found in programming languages. XML names can start with
letters but not
with punctuation marks. Subsequent characters can
be letters, numbers,
and some other non-letter characters.
For instance, XML names
cannot include angle braces, < > or white space.
The
colon is a legal character but is reserved for use with namespaces.
As
such, only one colon can be used in a qualified name. XML
names can include
characters like periods and hyphens. Following
we can summarize this
information in a rule.
XML
Reserved Characters
<>The XML
Recommendation states element or attribute names cannot begin with
the characters 'xml' in any combination of upper and lower case.
// checked in the
browser and it didn't care. didn't try in a validation context
Rule
for XML Identifiers
XML names may start with
an underscore or a Unicode letter. Subsequent
characters may be
underscore, Unicode letter, number, hyphen, period or a
colon. An
identifier may not contain any whitespace.
The
colon can be used legally in names but by convention is reserved for
use
with namespaces where only one
is allowed. We can create a rule to describe
these conditions as
follows.
"XML
Identifiers are defined as a letter or underscore followed by any
number
of letters or numbers or underscores or hyphens or
periods. A single colon may
optionally be included in conjunction
with it's use with namespaces."
With
respect to the colon, it is really only used to separate namespace
names
from other associated identifiers as is shown in the
following example.
Example tools:hammer // describes a 'hammer' variable that is in the 'tools' namespace
Following
the rule is restated in Extended Backus Naur Form or EBNF. Notice
XML
uses Unicode so letter may be non-English letters and
ideograms such as Greek symbols.
XML Identifier Rule
XML_Identifier
::= ( letter | underscore) (
letter | number | underscore | hyphen | period )*
( colon? )
( letter | number | underscore | hyphen | period )*
// colon by convention is reserved for use in namespaces.
//
Extended Backus Naur Form notation has supplied many of the symbols
commonly
// used in pattern matching dialects like Perl. The above
rule can be read 'XML-Indentifier
// 'is defined as' a
letter or underscore, followed zero or more, letters or numbers or
underscores
// or hyphens or periods, optionally followed by a
colon, followed by zero or more letters or
// numbers or
underscores or hyphens or periods.
There may be
processing advantages to having the facility to use non-typical
characters
in XML identifiers. For instance XML would permit an
element identifier to be the name of
a file with extension
included via the period symbol. The content of the tag is then
directly
associated with the file name. A process could then be
used to reconstitute the file with
appropriate file extension.
// non-typical characters like the period, '.' might be useful for processing files
As a rule though, unless
there is some specific reason to take advantage of XML's
liberal
naming scheme, it is probably wise to stick with conservative
identifiers. This
approach would allow the meta-data identifiers
to be easily ported into other language
domains such as Java.
This conservative policy might keep a body of data flexible for
future use and adaptation.
<>//
conservative names that would be legal in other programming languages
keeps applications flexible
Also,
XML_names_can_be_longer_then_is_reasonable! However this could lead
to
some unanticipated troubles with devices that have limited
abilities so again it will
probably pay to keep names at a
reasonable length. The final bit of advice is to keep
the names
humanly readable to take advantage of XML's metadata scheme to the
fullest.
//
keeping names short, simple and free from exotic characters may
assist in future ports
Comments
XML uses the same comment form that HTML uses. Following is an example.
XML Comment Form <!-- comment -->
For the record, the
opening comment tag consists of an opening angle brace
followed
by two hyphens. The closing comment tag is two hyphens and a
closing
angle brace.
Entity References
Entity
References
In order to introduce
special characters into documents, (such as a pair of angle
braces,
< > ) an escape technique is needed. This functionality is
supplied by the
entity reference. The entity reference is also
used to create variables that 'stand in'
for a long or
complicated section of text.
The entity reference
starts with an ampersand and ends with a semi-colon.
Form of the Entity Reference & the_entity ;
There
are five predefined or built in entity references. Listed in the
following table.
Table
Summarizing Built in Entity References
Symbol |
Description |
< |
less than, < |
> |
greater than, > |
& |
apersand, & |
" |
double quote, " |
' |
apostrophe aka single quote, ' |
In the following
example angle brackets are escaped into the document using
the
built-in
entity references.
Example
<?xml version="1.0" ?>
<HTML>
"<<< S ' G ' M
'
L >>>"
</HTML>
Browser
Output // Saved to xml
extension, HTML tags are revealed rather than interpreted
<HTML>
"<<< S ' G ' M ' L
>>>"
</HTML>
Entities
with custom names can also be defined by an XML author. For instance,
an author's name, e-mail and mailing address might be
encapsulated in an entity
called &authorID; . This form is
often used to substitute text that often needs to
be repeated
frequently into documents. An example are copyright notices.
Internal Entity References
To create your own custom internal entities, the basic form is as follows.
Form of an Internal Entity
<! ENTITY entityName " entity textual content " >
In the
following example, 'pageMove' is the identifier for the custom
internal entity
and the text that follows, " Notice etc. "
provides the content that will be substituted
into the document.
Example
<! ENTITY pageMove
"Notice as of 01/01/2008 this page will moved to www.
tee.vee.com "
>
Entities are really a
part of the Document Type Definition discussion. It is there
where
we take up a special element called the 'Document Type Declaration.'
We
need to borrow it here as part of the formula to supply local
samples of custom
internal entities. Suffice it to say that the
Document Type Declaration is
characterized by the DOCTYPE element
name. We see a sample of this
element in the following example.
Notice the square
brackets that can be supplied optionally in the DOCTYPE
element.
It is inside the square brackets where custom internal entities can
be
declared. The entity is then substituted into text using the
same escape characters
that internally defined entities use,
the ampersand before and the semi-colon after.
Following is an
example of this.
Example
<?xml
version="1.0"?>
<!DOCTYPE Entities
[
<!ENTITY
def "This is an internal entity definition">
<!ENTITY
pageMove "Notice as of 01/01/2008 this page
will moved to
www. tee.vee.com " >
]>
<Entities>
Dereferencing the Entity
Called 'def' : &def;
Dereferencing the
Entity Called 'pageMove' : &pageMove;
</Entities>
External
Entities
External entities are
referenced using a URL. They may represent whole documents.
The
referenced document may be XML or other sorts of documents. External
entities require special notation to declare the file type. We
defer these examples
to a brief discussion of Document Type
Definitions.
CData Sections & Processing Instructions
CDATA
Sections
// escapes large sections of what would otherwise be illegal xml content
Documents may have large
sections of character data that would be more
efficiently handled
if the XML processor could just ignore them. This includes
data
that might be riddled with text that needs to be escaped. For larger
and more complex groupings it is more effective to create CDATA
sections
which allow long passages to be escaped. Inside a CDATA
section no
characters have any special meaning. The general form
used for CDATA
Sections is as follows.
General form of a CDATA Section <![CDATA[ contents ]]>
Here is an example of a
CDATA section that follows this form.
Example <![CDATA[ <><><><>""""&&&&""""<><><><> ]]>
All these special
characters that normally need to be escaped are passed as
raw
data. One thing that cannot be included in a CDATA section is what
constitutes the CDATA Section end delimiter itself, ]]> as it
prematurely end
the
CDATA section. Also be careful to keep white spaces from between any
of the symbols of CDATA declaration.
Example <![CDATA[ is not the same as <! CDATA [
The
following example from W3Schools shows a more realistic sample where
a JavaScript function with plenty of
characters that need escaping is passed in
it's
entirety using the CDATA mechanism.
Example 2 // from www. W3Schools.com
<![CDATA[
function matchwo(a,b) {
if (a < b && a < 0) then{
return 1
}
else{
return 0
}
}
]]>
Processing
Instructions
Processing
instructions are in one sense similar to comments as they create
areas
or spaces in the document that
are not interpreted explicitly as part of the XML
document.
Instead of supplying a means of documenting the page for someone
who is viewing the source of the document,
a processing instruction is intended to
carry
special processing instructions for applications which will process
the XML
text in some way. You may
recall the xml declaration at the head of a typical
xml page is an
example of a processing instruction as is shown in the next
example.
Example <?xml version="1.0"?>
The
general form of a processing instructions is as follows.
Example
<?target instructions
options ?>
The
two angle braces enclosing two question marks is the general comment
form
of the processing instruction.
Processing
Instructions have some of their own jargon. The processing
instruction
is often referred to in
abbreviated form as a 'PI' and the instruction name is called a
'target'. A rule states that the processing
instruction cannot start with the characters,
'xml'.
While XML processors do not pass regular comments along with a
document,
the processing
instructions do accompany the page. The following interesting example
of a database access in the PHP script
language is found in E. Harold & S. Means'
'XML
in a Nutshell' by O'Reilly Press. In the following example, PHP is
the target
that contains the PI or
processing instructions.
Example
<?PHP
mysql_connect("database.unc.edu","clerk","password");
$result = mysql("CYNW",
"SELECT LastName, FirstName FROM Employees
ORDER BY LastName, FirstName");
$i=0;
while($i<mysql_numrows(&result) ){
$fields = mysql_fetch_row($result);
echo "<person> $fields[1] $fields[0] </person>\r\n";
$i++;
}
mysql_close( );
?>
The XML Declaration
Anatomy
of an XML Document
Following
is a list of the basic parts of the document. We will treat XML
content
first as it is very short.
Also we are going to defer treatment of the Document Type
Declaration to when we cover DTDs to avoid
redundancy. (You had a preview
of it
when we declared an internal entity.)
1)
XML declaration
2) Document Type Declaration
3)
Element data
4) Attribute data
5) XML content
// character data
Let us briefly state
that Element data, Attribute data and XML content refers
to the
Data and Meta Data that is carried in the XML document body within
the root element, the root element included. We can focus some
attention
on the XML declaration and the Document Type
Declaration.
The XML Declaration
<>The first line
of an XML document is The XML Declaration. The XML declaration
establishes that the current page is an XML document. The
declaration takes the
form of an XML processing instruction. (A
processing instruction allows special
comments to be made using
angle braces enclosing question marks.) A processing
instruction
begins with an opening prefix, <? and is suffixed with the
characters, ?>.
The special characters 'xml' follow the
opening prefix and serve to identify the XML
declaration.
Example <?xml version=1.0?>
An XML document can be
composed without the declaration though it should
be included. If
it is included it needs to start the document. This means the very
first character! This effectively means that you must be sure
that there is no white
space in advance of the starting angle
brace of the declaration.
//
A technical exception, the invisible Unicode byte order mark may
precede the declaration.
//
Recently it has been observed in one browser at least that white
space is tolerated at
// the head of the document
The version attribute
At this time, specifying
the version of xml is said to be optional, however in practice
this
is not the case. When testing using Mozilla 1.5 or Internet Explorer
6.0 for instance,
leaving out the 'version' attribute results in
an error. Having said this, we have to be
careful that we don't
consider browser support as being the only criteria to judge XML
compliance and behaviour. Dedicated standalone XML tools may be
used that may
show different results. What can be said is, that
there is a strong recommendation
to include the version tag as it
is expected that, as time goes on, different versions
of XML will
co-exist. The attribute takes the form, version="x.x".
Different browsers,
including Mozilla1.x , Netscape 6.2 and IE 6 or Opera can be
used
to test these tags.
Example <?xml version="1.0" ?>
We can mix this tag with
the first example to create a functional XML document.
Example
2 <?xml version="1.0" ?>
<first_name> Bart </first_name>
The standalone attribute
Besides the version
attribute the XML declaration may also contain the 'standalone'
attribute. This attribute dictates if an internal data definition
type is used exclusively.
This is indicated by assigning the
attribute the string literal value, "yes". If a "no"
value is assigned to the attribute, then the document will use an
externally defined
data type definition and any internally
defined DTDs will become optional. If the
standalone attribute is
left out, the default value supplied is 'no'. (For now we may
think
of a DTD as an XML related script.)
Example standalone="yes" // means only internal DTDs are used
The
encoding attribute //
with reference to an explanation kindly provided by François
Yergeau
The
encoding attribute is used to specify which character encoding
will be used by
the document. XML uses an auto-detecting system
to determine what character set
is being used in a document file.
It interprets the first four bytes of the document
which will
have different values depending on which encoding set is used. (This
is
why it is important that the <?xml characters of the XML
declaration appear as the
very first characters in the document.)
//
If the first four bytes are 3C 3F 78 & 6D then the encoding
is ISO 646 or one of
// it's subsets, i.e . UTF-8, ASCII, a
subset of ISO 8859 etc. The encoding declaration
// is then looked
at to distinguish which one ot these it is.
There
is no official default value specifed for the 'encoding'
attribute. The way the
auto-detection process works is, if an
encoding is not explicitly specified either
by means of an
external transport protocol, a Byte Order Mark (BOM) or an
explicit
assignment to the XML declaration 'encoding' attribute,
then 'UTF-8' is the only
encoding that will work and not result
in a fatal error being thrown.
This makes UTF-8 the de facto equivalent of a default encoding for XML.
(
UTF is a variable length character encoding
system. For instance it uses a single
byte for ASCII
characters but three or more bytes for
Asian symbols. )
Example
encoding="UTF-8"
The
following table of character encoding values was adapted from a
table in the
'The XML Bible' by
Elliote Rusty Harold.
Table
of Some Common Encoding
Name |
Language / Country |
US-ASCII |
English |
UTF-8 |
Compressed Unicode 1 |
UTF-16 |
Compressed UCS |
ISO-10646-UCS-2 |
Raw Unicode |
ISO-10646-UCS-4 |
Raw UCS |
ISO-8859-1 |
Latin-1, Western Europe |
ISO-8859-2 |
Latin-2, Eastern Europe |
ISO-8859-3 |
Latin-3, Southern Europe |
ISO-8859-4 |
Latin-4, Northern Europe |
ISO-8859-5 |
ASCII plus Cyrillic |
ISO-8859-6 |
ASCII plus Arabic |
ISO-8859-7 |
ASCII plus Greek |
ISO-8859-8 |
ASCII plus Hebrew |
ISO-8859-9 |
Latin-5, Turkish |
ISO-8859-10 |
Latin-6, ASCII plus Nordic |
ISO-8859-11 |
ASCII plus Thai |
ISO-8859-13 |
Latin-7, ASCII plus Baltic/Latvian |
ISO-8859-14 |
Latin-8, ASCII plus Gaelic/Welsh |
ISO-8859-15 |
Latin-9, Latin-0; Western Europe |
ISO-2022-JP |
Japanese |
Shift_JIS |
Japanese, Windows |
EUC-JP |
Japanese, Unix |
Big5 |
Chinese, Taiwan |
GB2312 |
Chinese, mainland China |
KO16-R |
Russian |
ISO-2022-KR |
Korean |
EUC-KR |
Korean, Unix |
ISO-2022-CN |
Chinese |
ISCII-1991 |
Indian |
International
Language Support
This
is a good place to interject that XML supports the late versions of
Unicode. While
Unicode is currently at version 4.0.1 ( as of
September 2004 ) XML recommendations show
support for
version 3.2 as of circa February 2002 and may
support higher version features now.
( Version 3.2 supports
virtually every spoken language on earth so by extension,
XML may
be thought of as a fully international language. (There
are limitations that need to be resolved.
For instance, although
all the characters of Unicode can be used
as content there are still
some limitations of what text can be
used in tags.) If this area is of special interest to you
you may
wish to read the following W3C article.
The Document Type Declaration // the DOCTYPE element
The document type
declaration is used to specify the document type definition.
This
declaration is associated with the DOCTYPE element. Stated more
simply,
the DOCTYPE element declares the DTD, whether internal
external or both.
SGML requires a DOCTYPE
declaration but XML does not. This implies that
XML documents
that are designated, 'well-formed' are not required to
contain
a document type declaration.
Form of the DOCTYPE Element
<!DOCTYPE name SYSTEM | PUBLIC DTD_URL | ( PUBLIC_ID opt. DTD_URL) [Internal DTDs] >
Where -
<!
-the exclamation mark marks the beginning of the declaration.
-
DOCTYPE - keyword for element which abbreviates Document Type
Declaration
- name - the name of the root tag of the XML document
-
SYSTEM - used in conjunction with a URL describing an externally
defined DTD
- PUBLIC - used in conjunction with a public id which
may be backed up by a URL
- [ ] - square braces house optionally an
internally defined DTD subset.
The following example
shows the common DOCTYPE declaration found in standard XHTML
pages.
You will find it inserted by HTML editors ahead of the first HTML
tag. You can see
that 'html' will be the root tag of the document.
A PUBLIC ID form is used rather than a
SYSTEM ID which would
typically be a URL. ( The PUBLIC ID encapsulates information
stating
that XHTML is a non-ISO standard whose proprietor is the W3C and is
described
as 'XHTML 1.0 Transitional' with the ISO language
identifier for English.) Following this
is the file name
that supplies the Document Type Definition for XHTML.
The
XHTML DOCTYPE Declaration
<! DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
Using PUBLIC and a Public ID // just for reference In
the above declaration the "-//W3C . . . EN" string is called a
Public ID. General Use Public ID Form "Public ID Character // DTD_proprietor // DTD_description // ISO_Language_Identifier" |
A
Brief Look at DTDs
The
DOCTYPE element is used to supply DTDs or Document Type Definitions
ot Documents.
DTDs were the original way envisions to control
typing for XML documents. DTDs are part
of the original XML 1.0
speoification, and as such will always be with us. We can
reconsider
our earlier example which we used to create custom
entities. It uses the square brace section
which is an optional
area that can be used to supply local Document Type Definitions.
This
area can also be used to supply local DTDs for elements and
attributes.
Example
<?xml
version="1.0"?>
<!DOCTYPE Entities
[
<!ENTITY
def "This is an internal entity definition">
<!ENTITY
pageMove "Notice as of 01/01/2008 this page
will moved to
www. tee.vee.com " >
]>
<Entities>
Dereferencing the Entity
Called 'def' : &def;
Dereferencing the
Entity Called 'pageMove' : &pageMove;
</Entities>
Notice
in the following example, how three elements are declared.
Example
Demonstrating the Declaration of an Optional Element
< style="font-family: helvetica,arial,sans-serif;"> <?xml
version = "1.0" standalone="yes"?>
<!DOCTYPE
options [
<!ELEMENT options (name, description?) >
<!ELEMENT name (#PCDATA) >
<!ELEMENT description
(#PCDATA) >
]>
<options>
<name>chrome
trim</name>
EBNF
contributes the following cardinality controls that enable a DTD
to
dictate to some extent the number of times an element may appear.
(
Note these controls don't enable dictating a element should appear
3
to 5 times for instance.)
* _ the asterisk to represent zero
or many
+ _ the plus symbol to represent 'one or more'
? _ the
question mark, to represent 'zero or one' or the optional state.
In
the above example you will notice the question mark is used to make
the second element optional. These controls
Now consider
how attributes are added to elements in DTDs. In the
following
example the air element is provided with four required
attributes that are of
the CDATA or Character Data Type.
In
the subsequent XML document the
Example of an ATTLIST Element Used with An Element
<?xml version="1.0"?>
<!DOCTYPE
Requirements [
<!ELEMENT Requirements
(air,water,food,shelter) >
<!ELEMENT
air (#PCDATA)>
<!ATTLIST
air oxygen CDATA
#REQUIRED
nitrogen CDATA
#REQUIRED
carbon_dioxide CDATA #REQUIRED
noble_gases CDATA #REQUIRED
>
<!ELEMENT water (#PCDATA)>
<!ELEMENT food (#PCDATA)>
<!ELEMENT shelter (#PCDATA)>
]>
<!-- The implementation of the locally defined DTD -->
<Requirements>
<air oxygen = "19%"
nitrogen = "79%"
carbon_dioxide = "1%" noble_gases = "1%" >
Present and Acceptable
</air>
<water>
Present
and Not Acceptable
</water>
<food>
Present but Not
Acceptable
</food>
<shelter>
Not Present
</shelter>
</Requirements>
If
this document were being validated using a validating program the
program
would check that one of each of these elements described
in the DTD were
present and that the 'air' element would have
present the four attributes
described in the DTD.
The
following table shows types that attributes can be assigned to
constrain
whether an attribute needs to be used and what value it
will have.
Table
Summarizing the Default Attribute Types
#REQUIRED |
attribute required |
#IMPLIED |
attribute optional |
#FIXED |
attribute constant & final |
Literal Default |
describes specifying a default value |
Schema
Supplant DTDs in Web Services
If this were a
dedicated XML course we would be obliged to cover a lot more
information regarding DTDs. However, because Web Services is
predominantly
dedicated to using XML Schema which is the XML
typing system that has
supplanted DTDs we will not look at DTDs
in any more depth.
It was mentioned that the DOCTYPE element
was an optional. This is convenient
as it allows us to cleanly
substitute Schema typing in place of DTDs. In fact we
will
discover that XML Schema is largely embedded in Web Services
Transport
Mechanisms, in particular in the SOAP 1.2
Specification.
DTDs were limited by the fact that came
before the advent XML Namespaces.
XML Namespaces play an important
role in both XML Schema and Web Services
so we will look at XML
Namespace support now.
XML Namespaces
XML Namespaces
provide a mechanism to allow several sets of tags from different
XML
applications to use similar identifiers without conflicting with each
other inside
the same XML document. The use of namespaces allow
elements with the same
name to be distinguished from each other.
Consider two hypothetical containers.
Example
DentistKit{ drill, pliers, floss}
Carpenter {
drill, pliers, saw}
If we mixed references to identifiers
from each set of information. without in some way
qualifying that
it is a Dentist's drill or that is a Carpenter's pliers the intent of
whatever
we were doing could be lost.
We could keep
things separate though by qualifying 'local' names as in the
following
Example
DentistKit:drill
CarpenterKit:drill
XML Namespaces uses a 'colonized' form as shown in the
above example to keep
XML elements from different XML applications
in separate namespaces.
//
'colonized' means with a colon, ' : '
Namespaces
are created using a special attribute, 'xmlns' which is
referred to as
a 'reserved attribute'. It can be used in two
forms, with and without an additional
prefix. These identifiers
are assigned URIs.
Example
xmlns
= " someURI "
xmlns:prefix = "someURI"
Significance of Namepace URIs
It is tempting and
logical to assume that the URI in it's capacity as a Resource
Indentifier or locator brings some important information over the
network. This is
not the case (at least not at this time.)
There is nothing at the URL
location on
the web that is important to the document.
What is important is that
URI represent
a unique identifier in whatever context the local
document is running in.
The
Default NameSpace Form
Consider
the first of the two simple forms we outlined earlier.
Example
xmlns = "anyURI"
This is the 'default' form and
creates an 'implicit', invisible prefix for any un-prefixed
element
in the scope of the element in which the namespace is declared. If
this is
in the root element of the document, then the scope is
'global' to the document. In
the following example, all the
elements that do not have a prefix, implicitly become
qualified
and part of the default namespace defined by
xmlns="http://www.OutThere.com".
Because the top
level element holds the namespace declaration the default namespace
is 'global' and applies also to the root element, 'Outer'.
Example
<Outer
xmlns ="http://www.OutThere.com">
<x>African
Coffee Table</x>
<y>80</y>
<z>120</z>
</Outer>
The No Namespace Default Form
If
we use the reserved attribute, 'xmlns' and assign it an empty string
we create a
'No Name' default namespace. This is in fact an
explicit way of stating the implicit
default namespace condition
of an XML document that has no namespace defined.
In other words,
putting this form into an root element has the same effect as not
putting any namespace declaration into an XML document.
Example xmlns = ""
If you removed the URI
from the above example and assigned it the empty string, "",
then the elements x,y z and Outer would belong to the default 'no
Namespace'
namespace, in other words, to no namespace.
The Prefixed Form of Namespace Declarations
When a prefix is used
with a local name, the prefix stands in for a longer
URI. In this
sense, creating a prefix to represent a URI is a namespace
declaration. For instance, in the following declaration,
Example xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"
xsl stands in for
"http://www.w3.org/1999/XSL/Transform". Prefixes
are
bonded to XML namespace URIs using the xmlns:prefix syntax.
XML URIs are more
limited than regular URLs as some conventional URL
characters are
illegal inside an XML document. i.e the characters, / , ~
and %. XML prefixes work around this restriction, allowing
prefixes to
represent URIs inside XML documents. Each prefix maps
to a given URI.
Inside the document the
prefix is used to associate element(s) with that
namespace.
Example <xsl:template>
Following is a
common namespace declaration we will see when using XML
Schema
language.
Example xmlns:xs = "http://www.w3.org/2001/XMLSchema
The 'xmlns' attribute is native to the 'schema' element.
As a result we will typically see this attribute appear inside the schema element.
Example
<schema xmlns:xs
= "http://www.w3.org/2001/XMLSchema> . .
. <schema>
Then inside the
schema document, we will see the xs prefix bound to elements
that
belong to the schema application.
Example
<xs:ComplexType>
Following
is a simple complete example that shows a prefix
Example
<?xml version= "1.0"?>
<g:rock
xmlns:g="http://www.sentex.net/">
<g:type>
sedementary
</g:type>
</g:rock>
Namespace
Terminology
If
we isolate the <g:type> element from the complete example
above, 'g' is the 'prefix',
'type' is the 'local part'
and together the full name, g:type is called the 'qualified
name'.
Nested
or Local Namespaces
In the the following
example the namespaces are not defined globally. They are
instead
created within nested elements. In this scenario, the namespace only
applies to the element in which the namespace is declared.
Example
<?xml version=
"1.0"?>
<!--
This XML document carries information in a table -->
<mix>
<g:rock
xmlns:g="http://www.sentex.net/">
<g:type>
sedementary
</g:type>
</g:rock>
<!--
This XML document carries information about a piece of furniture
-->
<m:rock
xmlns:m="http://www.sentex.net/~pkomisar">
<m:name>
The Rolling Stones </m:name>
<m:type> R&B
</m:type>
</m:rock>
</mix>
A
Note on Attribute Namespaces
"Unless
we explicitly state otherwise, the attributes that an element carries
are
considered to be in no namespace ( or null namespace)
although most applications
treat them as though they are in the
same namespace as the element that carries
them. "
- quote from
'Professional XML Schemas', J.Ducett et. al.
This is an
important because we cannot assume attributes will be treated as
belonging
to the namespace of their associated elements and
probably should be explicitly qualified.
In XML an element
cannot have two attributes with the same name. Also two prefixes
cannot reference the same URI. However there is a variation that
is possible where the
default namespace can be assigned the same
URI as an associated prefix. As is shown
in the following example
from 'Professional XML Schemas'. In this case, although it would
appear that two height attributes are in for a clash and might be
in the same namespace,
we need to recall the above rule, that
states that attributes if not explicitly qualified are in
the 'no
name' namespace. So in this case, the default qualification does not
apply to the
'height' attribute while the cm:height attribute is
in the 'cm' prefix namespace.
Example from
Professional XML Schemas, J. Duckett et.al., Wrox Press
<Dimensions
xmlns= "http://www.example.org/measurements"
xmlns:cm="http://www.example.org/measurements" >
<Vertical height="24inches" cm:height="60cms" />
</Dimensions>
Self Test Self Test With Answers
1) True or False?
All XML elements except for the root element have a single
parent? False \ True
2) Which of the
following XML identifiers may be thought of as legally correct
however it breaks naming convention?
a) _after.all
b)
xmlns:volume:section
c) _---_9289187
d) bingo-nite
3) Which of the following is not an XML delimiter?
a) <
b) >
c)
&
d) :
4) Which of the
following is not an attribute of the xml declaration.
a)
version
b) encoding
c) type
d)
standalone
5) What key character is
missing from the following example of an XML
CDATA section?
<[CDATA[ content
]]>
6) True / False
Processing Instructions and Comments serve the same
purpose in an
XML document. True/False
7) a) All well formed
documents are valid. True/False
b) All valid documents are well formed.True/False
8) True or False. All
elements inside an element that has a namespace declared
belong
to that namespace. True / False
Exercise 1
Do
exercise 1 or 2. Optionally if you have extra time do both
but it
is certainly not required. The dual exercise accomodates
individuals
who have taken the XML course. However the two
exercises do
investigate different aspects of the material
covered.
Note
For Browser Testing in Exercises
Firefox
is a very lightweight version of Mozilla. It is just a browser, and
unlike
Mozilla doesn't carry an HTML editor, a mail agent and an
address book. It
would be a very quick download and would not
take up much space on your
G drive. You may wish to use it
as the alternative browser to IE for testing.
A single XML document can be used to demonstrate all the following.
1) Create the simplest
form of this XML document, include an XML declaration
and then remove the version attribute. Run in both IE and Netscape or
Mozilla
and report what happens.
2) Create two
identifiers that demonstrate how letters, numbers and special
characters can be used legally in XML tag names. ( Refer to the EBNF
formula.) Test in both IE and
Netscape/Mozilla. Start one of these identifiers
with an underscore. Use XML comments to summarize which sorts of
different
characters were used.
3) Insert a colon into
an XML identifier. Try in both IE and Netscape or Mozilla.
What is the result?
4) Create a sentence
that demonstrates the use of the five predefined escape entities.
Then substitute this content into the XML
document created in question number one
using the
appropriate escape entities.
5) Insert a CDATA
section into your XML document that uses a Character Data
Section to escape a short but complete HTML page.
6) Insert a processing
instruction that includes a complete HelloWorld program
written in Java. If you are not familiar with Java put in any sort of
technical
data or syntax you wish.
7) Use an internally
defined escape entity to represent the phrase '-Webster's
Dictionary'.
Make up a definition or use one
you know. Use the internal entity to represent the
'Webster's Dictionary' phrase inside your document. Make up another
internally
defined entity and add it to your
document.
8) Create two namespaces
in the XML document with short fictitious URIs. Create
three tags that have the same name and use use two sets of these tags
in the same
document demonstrating how they
are kept separate in terms of namespace by using
the namespace prefix.
Exercise 2
1) Without
referring to examples, or at least after referring to examples,
and
then with books closed. For those of you for which this is a
review
it should all come back to you as you build this document.
a) Create an
xml declaration with version assigned.
b)
Create a root element called 'schema'
c)
Inside it create a prefixed namespace using the prefix 'xs'
and the URI
"http:// www.w3.org/2001/XMLSchema"
d)
Associate the xs prefix with the schema root element
e)
Nest inside the schema element an xs prefixed element
called 'complexType'
f)
Give it an un-prefixed attribute called 'name' assigned
the quoted value
"Client".
g)
Nest inside the complexType element an 'xs' prefixed
element called
'sequence'.
h)
Inside the 'sequence' element nest three elements called
'element', each with an
unqualified 'name' attribute holding
the value "FirstName",
"Initial" and "LastName" respectively.
Include in each of these
elements, a 'type' attribute assigned
the qualified value,
'xs:string'.
i) Save to an xsd. file.
<>Don't forget the
rules of well formedness, where all assigned values
are
quoted and tags are all nested and closed properly. View in a
browser to confirm your file is well-formed.
2)
The XML Recommedation states that the characters 'xml' in
any
combination
upper or lower case should not be used to begin XML
identifiers.
Since this should be an issue of well-formedness, create
an
XML page to test if this rule is enforced. Test in Explorer and
Firefox. Briefly
report your results.
3)
Again using browsers test whether
a)
more than one same name attributes can appear in an element.
b)
test if namespaces can be assigned to the same prefix.
c)
Test the form shown in the note where a default namespace
and
a prefix can share the same URI. Test if an attribute belonging
to
the the 'No Name' namespace is allowed to co-exist with a same
named
attribute that is namespace prefixed. //
in other word test
the final example of the note taken from
'Professional XML Schemas"
Report
results briefly.