eXtensible Markup Language
Jean-Philippe Martin
Typographical notes: Function names, markup language and example files are given in non-proportional type. External links are underlined, local ones are not. In general the style has been chosen so that the document will retain most of its usefulness when printed.
XML (Extensible Markup Language) is a markup language for documents containing structured information. It was developed by the World Wide Web Consortium (W3C) and the full specification is available online at http://www.w3.org/TR/REC-xml.
XML Provides the means to represent data in a structured way using a user-defined vocabulary.
Here is a simple example of an XML document:
<?xml version="1.0"?> <greeting>Hello, world!</greeting>
Download CDAlbum, a slightly more involved example (discussed later)
An XML document contains text annotated with markers (also called "tags" or "elements"). The markers are delimited with "<" and ">" characters. Markers go in pairs: an opening and a closing one. The closing tag must have the same name as the opening one, except that it is preceded with the "/" character.
Unlike HTML, in XML the markers do not have any direct influence on how the text will look when displayed. The goal of the markers is to indicate semantic information, i.e. what kind of information the text represents. A possible application of this idea is to use the markers to transform the initial xml document into a number of different representations (html or pdf, for example) - hence providing a way to keep these different documents in synch. XML was created so that richly structured documents could be used over the web.
XML describes a syntax for marking up text that makes it easy to describe complex structures. Those structures have to follow some strict rules (about nesting properly), but the rules are simple and the structures are incredibly flexible.
The real strength of XML is that these labeled structures can make a great foundation for all kinds of processing, from styled presentation for human readers to collection by agents seeking best bargains to interchange between databases or even businesses.
The whole point of XML is opening up information, and using commodity components to process that information.
One of the main benefits of using XML is that this allows you to leverage on many existing processing tools that are XML-aware. We will mention Xerces and LotusXSL later in this document, see http://www.w3.org/XML/#software for additional references.
Elements can also be empty, in which case they are written <element/>. Empty elements are pretty much equivalent to <element></element> and can be used when the element does not identify the nature of their content but instead indicate some condition. For example, consider the hypothetical <horizontalRuler/>.
<div class="preface">
is a div element with the attribute class having the value preface. In XML (unlike html), all attribute values must be quoted.
For example, the lt entity inserts a literal < into a document. So the string <element> can be represented in an XML document as <element>.
A special form of entity reference, called a character reference, can be used to insert arbitrary Unicode characters into your document. Character references take one of two forms: decimal references, ℞, and hexadecimal references, ℞. Both of these refer to character number U+211E from Unicode.
Comments are not part of the XML document information, and as such the XML parser is not required to pass them along to the application.
Processing Instructions (PI) have the following form: <?name pidata?>. The name, called the PI target, identifies the PI to the application. Applications should process only the targets they recognize and ignore all other PIs. Any data that follows the PI target is optional, it is for the application that recognizes the target. The names used in PIs may be declared as notations in order to formally identify them.
PI names beginning with xml are reserved for XML standardization.
CDATA sections look like this:
The only string that cannot occur in a CDATA section is ]]>.
More generally, declarations allow a document to communicate
meta-information to the parser about its content. Meta-information
includes the allowed sequence and nesting of tags, attribute values and
their types and defaults, the names of external files that may be
referenced and whether or not they contain XML, the formats of some
external (non-XML) data that may be referenced, and the entities that
may be encountered.
There are four kinds of declarations in XML: element type declarations,
attribute list declarations, entity declarations, and notation declarations.
<!ELEMENT CDAlbum (title, author+, image?)>
This special tag defines a new tag named CDAlbum. This tag will contain another tag called title, followed by one or more tags named author, followed by an optional tag named image.
Here is an example of a tag that can contain arbitrary text:
<!ELEMENT title (#PCDATA)*>
The special symbol #PCDATA means "any text", and the final star means that the title tag can be comprised of zero or more #PCDATA.
To indicate a choice, the pipe (|) symbol is used. Here is an example:
<!ELEMENT author (name|nickname)>
This says that the author_contact tag can contain either a tag of type "name" or of type "nickname", but not both.
Here is an example of an XML document that includes element type declarations:
<?xml version="1.0"?>
<?xml version="1.0"?> is boilerplate code that needs to be present on
top of every XML file. <!DOCTYPE CDAlbum [ ... ]> is boilerplate that must enclose
the element definitions.
<name pronunciation="JP" unusual-in-the-US="yes">Jean-Philippe</name>
The XML DTD that describes this will look like:
ATLIST takes as arguments the name of the tag and a series of triplets (attribute-name,
attribute-type, default-value). Default-value can be set to #IMPLIED to indicate that
this attribute is optional or #REQUIRED to indicate that it is required.
The example given says that the 'name' tag has an optional
attribute 'name' that contains a string (CDATA) and another attribute called
'unusual-in-the-US' which can be either yes or no, but defaults to no.
There are six possible attribute types:
There are four possible default values:
CDATA Sections
The CDATA sections are yet another way to include comments in XML. CDATA is useful because
almost any character is allowed inside a CDATA section, including < and &.
<![CDATA[
*p = &q;
b = (i <= 3);
]]>
Between the start of the section, <![CDATA[ and the end of the section, ]]>,
all character data is passed directly to the application, without
interpretation. Elements, entity references, comments, and processing
instructions are all unrecognized and the characters that comprise them
are passed literally to the application.
3. XML Grammar
3.1 Document Type Declarations
We saw earlier that the XML tags indicated semantics instead of formatting. In order to specify semantics information, part of the XML languages helps you specify the tags in your document can be used: the Document Type Declaration (DTD). Intuitively, this section describes the grammar of your document.
Element Type Declarations
Element type declarations identify the names of elements
and the nature of their content. A typical element type declaration looks like this:
<!DOCTYPE CDAlbum [
<!ELEMENT CDAlbum (title, author+, image?)>
<!ELEMENT title (#PCDATA)*>
<!ELEMENT author (name|nickname)>
<!ELEMENT name (#PCDATA)*>
<!ELEMENT nickname (#PCDATA)*>
<!ELEMENT image (#PCDATA)*>
]>
<CDAlbum>
<title>Meet Virginia</title>
<author><name>Train</name></author>
</CDAlbum>
Download this example at
Attribute List Declarations
Attribute list declarations identify which elements can have attributes and what
these attributes can be. They also indicate which attribute values are allowed
and provide defaults.
Suppose for example that a tag 'name' can have an attribute
'pronunciation' and a flag 'unusual-in-the-US' defaulting to 'no'. Here is some XML
that uses this new tag:
<!ATTLIST name
pronunciation
CDATA
#IMPLIED
unusual-in-the-US
( yes | no ) 'no'
>
The XML processor performs attribute value normalization [Section 3.3.3] on attribute values: character references are replaced by the referenced character, entity references are resolved (recursively), and whitespace is normalized.
<!ENTITY ut "The University of Texas at Austin">
Then an XML document that contains
&ut;
Will come out as: The University of Texas at Austin.
This first type of entity declaration is known is known as an internal entity. A second type of entities, external entities, contain a link to data (in the form of an URI reference, as defined in IETF RFC 2396 http://www.w3.org/TR/REC-xml#rfc2396). Just like the first type of entities, the referenced text will replace the entity. For example if the following is defined:
<!ENTITY gpl_header SYSTEM "/standard/GPL_Header.xml">
Then every instance of &gpl_header; in the text will be replaced by the contents of the file called /standard/GPL_Header.xml. It is the keyword "SYSTEM" that distinguishes internal from external entity declarations.
For example:
<?xml version="1.0"?>
<!DOCTYPE CDAlbum [
<!ELEMENT CDAlbum (title, author+, image?)>
(...)
]>
<CDAlbum>
<title>Meet Virginia</title>
<author><name>Train</name></author>
</CDAlbum>
Another, more common, possibility is to have the DTD as an external file (e.g. "cdalbum.dtd" here):
<?XML version="1.0" standalone="no"?>
<!DOCTYPE CDAlbum SYSTEM "cdalbum.dtd">
<CDAlbum>
<title>Meet Virginia</title>
<author><name>Train</name></author>
</CDAlbum>
Finally, it is also possible to mix internal and external declarations. In this case, the internal declaration takes precedence (XML standard, section 2.8). For example:
<?XML version="1.0" standalone="no"?>
<!DOCTYPE CDAlbum SYSTEM "cdalbum.dtd" [
<!ELEMENT CDAlbum (title, author+, image?, price)>
<!ELEMENT price (#PCDATA)*>
]>
<CDAlbum>
<title>Meet Virginia</title>
<author><name>Train</name></author>
</CDAlbum>
In this case, the referenced DTD is modified to allow the CDAlbum element to also
include pricing information. You can download this last program at
Consider the following example:
<price>29.99</price> <price>typo</price>It is difficult to write a DTD rule that allows numerical inputs but rejects strings. What's more, it is impossible to write a rule that accepts only prices within a certain range.
XML Schemas, on the other hand, allow for very straightforward of both these constraints.
Another feature of XML Schemas is that they are XML themselves, so they can be manipulated via XSL or any other XML-aware technology. Another argument in favor of this syntax is that it makes life easier for people writing parsers: otherwise all XML parsers have to embed a DTD parser as well.
<xsl:element name="CDAlbum">
<xsl:complexType>
<xsd:sequence>
<xsl:element name="title" type="string"/>
<xsl:element name="author"/>
<xsd:choice>
<xsl:element name="name" type="string">
<xsl:element name="nickname" type="string">
</xsd:choice>
</xsl:element>
<xsl:element name="image" type="string" minOccurs="0"/>
<xsl:element name="price" type="reasonablePrice" minOccurs="0">
</xsd:sequence>
</xsl:complexType>
</xsl:element>
<xsd:simpleType name="reasonablePrice">
<xsd:restriction base="xsd:integer">
<xsd:minInclusive value="0"/>
<xsd:maxInclusive value="100"/>
</xsd:restriction>
</xsd:simpleType>
The restriction mechanism is powerful, and can be applied to
strings as well. Here is an example (taken from the W3C's XML Schema primer):
<xsd:simpleType name="SKU">
<xsd:restriction base="xsd:string">
<xsd:pattern value="\d{3}-[A-Z]{2}"/>
</xsd:restriction>
</xsd:simpleType>
As you can see, the string is restricted to a regular expression. In
this case, three digits, a hyphen and two uppercase letters.
Lists are an example of the constructs that XML Schema allow that are not directly possible in DTD. Here is how a list of prices would be specified:
<xsd:simpleType name="PriceList"> <xsd:list itemType="reasonablePrice"/> </xsd:simpleType>A key difference between DTD and XML Schema (XSD) is that XSD differentiates between tags and types. In DTDs, tags are used to approximate typing: the type of data is the name of the tag around it. A list of numbers, for example, would be specified as a repetition of a certain tag, num. The resulting XML would look something like this:
<listofnumbers> <num> 1 </num> <num> 2 </num> <num> 3 </num> </listofnumbers>XSD, in contrast, gives you more power and therefore the listofnumbers tag can be given a type such that only lists of numbers will be allowed inside it. The XML would then look like:
<listofnumbers> 1 2 3 </listofnumbers>XSD has support for string, integer, real number, date, URI, language and many other types, each of which can be further refined through use of inheritance. So while DTDs are commonplace today, the extra power associated with XML Schemas and their XML syntax will make them attractive to developers who need more expressivity regarding their XML data.
For our example we are going to use the multi-platform Java version (a C++ version is also available). It requires JDK 1.1.8 or 1.2.2 (available from http://java.sun.com/products).
To try out a transformation, simply type:
java org.apache.xalan.xslt.Process -in xmlSource -xsl stylesheet -out outputfile
Where xmlSource is the name of an XML file (for example CDAlbum1.xml), stylesheet is the corresponding XSL file (for example CDAlbum.xsl) and outputfile is the name of the file in which you want the results to be output (for example CDAlbum.html).
It is also possible to output directly to the screen: simply omit the -out parameter. In the case of CDAlbum1, the result would be:
========= Parsing file:D:/Documents/JP/XFer/cdalbum.xsl ==========
Parse of file:D:/Documents/JP/XFer/cdalbum1.xsl took 390 milliseconds
========= Parsing cdalbum1.xml ==========
Parse of cdalbum11.xml took 60 milliseconds
=============================
Transforming...
<html>
<body>Train<p>
</p>
</body>
</html>
transform took 110 milliseconds
XSLProcessor: done
Xalan and lotusXSL have many more possibilities than just applying XSL transformations. We list them below, but we will postpone their exploration until the sections XML programming and XSL Programming.
<?xml version="1.0"?>
<top>
<mail>
<header>
<from>JP</from>
<to>Cedric</to>
<subject>Hi!</subject>
</header>
<body>
Howdy!
</body>
</mail>
<mail>
<header>
<from>Cedric</from>
<to>JP</to>
<subject>Re: Hi!</subject>
</header>
<body>
I'm good, thanks!
</body>
</mail>
</top>
Download the complete file.
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html">
<xsl:template match="/">
<html><body>
<hr/>
<xsl:apply-templates/>
<p/>
Number of mails: <xsl:value-of select="count(/top/mail)"/>
<hr/>
</body></html>
</xsl:template>
<xsl:template match="mail">
<xsl:value-of select="header/from"/>
to <xsl:value-of select="header/to"/>:
<xsl:value-of select="header/subject"/>
<br/>
</xsl:template>
</xsl:output>
</xsl:stylesheet>
Download the complete file.
If we apply this stylesheet we get a page that lists the sender, destination and subject for each mail in the file:
Number of mails: 4
It is important to understand that XML is a declarative language. It describes how each node of the XML parse tree should be transformed. In the above file for example, the node beginning with <xsl:template match="mail"> describes how the text between <mail> and </mail> should be handled. This functions in a way very similar to Java Server Pages: everything inside that node will be copied verbatim, except for the xsl:value-of tags. These tags have a select attribute which indicates which data should be inserted in their place - relative to the original mail node. Note that indented nodes are referred to as if they were a directory path. For example, to access the data item <header><from> inside of mail, the command is:
<xsl:value-of select="header/from"/>
The final "/" is necessary since all XSL stylesheets are also XML documents: empty tags are required to be closed by />.
The astute reader will have noticed that the HTML command for line break is written as <br/> instead of <br>. This is a side effect to the requirement that XSL documents must be valid XML documents as well: since the HTML code to be inserted is embedded in the document, it must follow the XML conventions, too. During processing, the XSL processor will transform into the correct HTML form, <br>. This is the purpose of the <xsl:output method="html"> line in the XSL header.
The template for the root element is <xsl:template match="/">. The "/" symbol is a shortcut for the name of the root element ("top" in our example). This template is sightly different from the mail template. First, it delegates the handling of subnodes by calling <xsl:apply-templates/>. It then counts the number of mails in the mailbox with another value-of tag. But this time it applies a function to the node - count. This function returns the number of times that this node appears.
The select attribute is so powerful that the complete description of what it can do is presented in a separate w3c recommendation called XPath. This recommendation is available at http://www.w3.org/TR/xpath.
The features of XPath are:
We will demonstrate sorting and selecting with a new example.
By default, the template is applied to each element of the set in the order in which they appear in the xml file. For example, mbox1.html contains one line per mail, and the mails are listed in the same order as they are in mbox.xml. It is however possible to modify this order via sorting.
Here is an example code that sorts mails by priority and then by sender. The important part is indicated in color.
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html">
<xsl:template match="/top">
<html><body>
<hr/>
<xsl:apply-templates select="mail[@priority!='zero']">
<xsl:sort select="@priority"/>
<xsl:sort select="header/from"/>
</xsl:apply-templates>
<p/>
Total number of mails: <xsl:value-of select="count(/top/mail)"/>
<br/>
Number of low-priority mails (not displayed): <xsl:value-of select="count(/top/mail[@priority='zero'])"/>
<hr/>
</body></html>
</xsl:template>
<xsl:template match="mail">
[<xsl:value-of select="position()"/>] -
<xsl:value-of select="header/from"/>
to <xsl:value-of select="header/to"/>:
<xsl:value-of select="header/subject"/>
(priority: <xsl:value-of select="@priority"/>)
<br/>
</xsl:template>
</xsl:output>
</xsl:stylesheet>
The resulting HTML is:
Total number of mails: 7
Let's first consider the sorting. Sorting is done by including the <xsl:sort select="..."> tag inside of apply-templates or for-all. This causes the nodes in the template of for loop (called the nodeset) to be sorted in lexicographical order by that key. If more than one sort command is included, the first is the primary key, the second will be the secondary, and so on. In this example, messages are sorted first by priority and then by sender.
Another feature that this example illustrates is selecting. Only the mails which have a priority different from 'zero' are selected for display. This is done using the XPath selection command [filter]. In this case, the filter is @priority!='zero'. Strings are enclosed by single quotes. @priority means "this node's priority attribute". You can notice that this expression is used again inside of the mail template when displaying the priority value.
Finally, we used the XPath function position() to number the mails in the listing. position() can be used in conjunction with select statements to display only a specific element (for example the 42nd).
However, one thing that was easy to do with sequential program becomes more difficult in this framework: how to transform the same node twice? Suppose for example that from the mailbox file of the previous examples we want to generate a single html file with a "table of contents" on top consisting of the subject line of each mail linked to the body of the mail further down the page. Should the template for the <mail> element transform it into summary form or complete form? Both. This is achieved through modes.
It is possible to define several templates for the same node by assigning them different modes. When using apply-templates, you can now give a mode attribute to indicate which template should be used. If you don't then the parser will look for a template without mode.
We can use modes to enhance the mailbox example to include the full body of the mails below the summary. The XSL code looks like this:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html">
<xsl:template match="/top">
<html><body>
<xsl:apply-templates mode="short" select="mail[@priority!='zero']"/>
<p/>
Number of mails: <xsl:value-of select="count(/top/mail)"/>
<br/>
Number of low-priority mails (not displayed in summary):
<xsl:value-of select="count(/top/mail[@priority='zero'])"/>
<xsl:apply-templates mode="long"/>
<hr/>
</body></html>
</xsl:template>
<xsl:template match="mail" mode="short">
<xsl:value-of select="header/from"/>
to <xsl:value-of select="header/to"/>:
<xsl:value-of select="header/subject"/>
(<a>
<xsl:attribute name="href">#M<xsl:number value="position()"/></xsl:attribute>Jump</a>)
<br/>
</xsl:template>
<xsl:template match="mail" mode="long">
<hr/>
<a>
<xsl:attribute name="name">M<xsl:number value="position() div 2"/></xsl:attribute>
</a>
From: <xsl:value-of select="header/from"/> <br/>
To: <xsl:value-of select="header/to"/> <br/>
Subject: <xsl:value-of select="header/subject"/><p/><PRE>
<xsl:value-of select="body"/></PRE>
<br/>
</xsl:template>
</xsl:output>
</xsl:stylesheet>
Download the code at
The result of this xsl stylesheet is shown in this other page. You can see that the template for mail is defined twice: once for the short mode and once for the long mode. When the template is called from within the /top template, the mode is indicated: in the summary we use the short mode and then later on it is the long mode that is used.
This xsl document also uses filtering, which we discussed in a previous section, to remove the low priority mails from the summary and functions to indicate the number of mails. A new function, position() is introduced:
<a><xsl:attribute name="href">#M<xsl:number value="position()"/></xsl:attribute>Jump</a>
There are two different ways of introducing conditionals in xsl: the if and the choose statement. The syntax of if is the following:
<xsl:if
test = boolean-expression >
<!-- content -->
</xsl:if>
For example, the following code fragment can be used to create a comma-separated list (a comma must be written after each element except the last one).
<xsl:template match="namelist/name"> <xsl:apply-templates/> <xsl:if test="not(position()=last())">, </xsl:if> </xsl:template>
Note that there is no "else" statement. This limitation is not present in the choose statement. Its syntax is:
<xsl:choose>
<!-- Content: (xsl:when+, xsl:otherwise?) -->
</xsl:choose>
<xsl:when
test = boolean-expression >
<!-- content -->
</xsl:when>
<xsl:otherwise>
<!-- content -->
</xsl:otherwise>
Here is the final mailbox example which incorporates the choose statement.
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html">
<xsl:template match="/top">
<html><body>
<xsl:apply-templates mode="short" select="mail[@priority!='zero']"/>
<p/>
Number of mails: <xsl:value-of select="count(/top/mail)"/>
<br/>
Number of low-priority mails (not displayed in summary):
<xsl:value-of select="count(/top/mail[@priority='zero'])"/>
<xsl:apply-templates mode="long"/>
<hr/>
</body></html>
</xsl:template>
<xsl:template match="mail" mode="short">
<xsl:choose>
<xsl:when test="header/from='JP'">
<em>
<xsl:call-template name="short_inner"/>
</em>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="short_inner"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="short_inner">
<xsl:value-of select="header/from"/>
to <xsl:value-of select="header/to"/>:
<xsl:value-of select="header/subject"/>
(<a>
<xsl:attribute name="href">#M<xsl:number value="position()"/></xsl:attribute>Jump</a>)
<br/>
</xsl:template>
<xsl:template match="mail" mode="long">
<hr/>
<a>
<xsl:attribute name="name">M<xsl:number value="position() div 2"/></xsl:attribute>
</a>
From: <xsl:value-of select="header/from"/> <br/>
To: <xsl:value-of select="header/to"/> <br/>
Subject: <xsl:value-of select="header/subject"/><p/><PRE>
<xsl:value-of select="body"/></PRE>
<br/>
</xsl:template>
</xsl:output>
</xsl:stylesheet>
Download the complete file.
The result of applying this stylesheet can be seen on this page.
The colored section uses choose to identify the mails with JP as the sender. In this case we use it like an if-then-else but we could have indicated many choose clauses instead of just one. Note that to avoid code duplication we had to introduce a new mode, short_inner. This template is not called with apply-templates but instead call-template. The reason for this is that apply-templates processes the child nodes while call-template calls a different template on the current node, which is what we want in this case.
However, call-template causes the position() counter to be incremented twice. As a workaround, we divide its value by two to obtain the correct anchor names.
XSL is a language that makes it possible to transform XML into any format - but typically XML or html. It takes some time to get used to its declarative syntax but the language is quite powerful once assimilated. Just to illustrate this power, one can find xsl files one the web that perform object orientation (http://www.angelfire.com/tx4/cus/shapes/xsl.html) or even self-reproducing xsl stylesheets (http://www.informatik.hu-berlin.de/~obecker/XSLT/#quine)!
Currently browsers cannot directly display XML using an XSL template. Internet Explorer comes close, but its parser is not compliant. This will change, however, and will open up a lot of interesting possibilities for the web community because the XSL can be manipulated from a script inside the page. This means, for example, that a page could be written in which the user can select how the information should be sorted - entirely from the client side, without having to reload the page. As is often the case, the applications of XSL that will be most successful are probably the ones that we cannot yet imagine.
The parser included in the package we downloaded, Xerces, is a non-validating parser written in Java.
The DOM interface provide functions to manipulate the parsed tree. We will focus on the XML tree, but DOM applies to HTML as well. The DOM W3C standard was started to unify the efforts from both Netscape and Microsoft, which started expanding their scripting languages (Javascript and VBScript, at the time) to allow it to manipulate the elements on the page. DOM level 1 is the result of these efforts, it came out in October 1998. The more recent DOM level 2 W3C Recommendation was released November 13, 2000. A Level 3 DOM is currently under review.
The DOM specification contains a "core" part that remains constant and a set of documents that describe how DOM applies to different types of parse trees. DOM level 2, for example, contains a views specification (for XML and HTML) , a style specification (for cascading style sheets), an events specification, and more.
DOM is not an application, but an interface. In order to allow as many languages as possible to access this API, DOM is entirely defined using the Object Management Group IDL (Interface Definition Language). To make things absolutely clear, the W3C also publishes the Java and JavaScript (ECMAScript, to be precise). Note that even though an IDL is used, it is not necessary to use any of the tools generally associated with IDL (IDL compilers, CORBA, COM) in order to use DOM. The IDL is used for the specification but the tools directly use the appropriate language binding.
A typical DOM program goes through the following steps:
For our examples we will use the Xerces parser that was installed at the beginning of the XSL section.
Here is a selection of the most commonly methods.
Loading and saving XML documents are very common methods, but unfortunately they are not standardized and therefore the calls differ from implementation to implementation.
This example requires the xerces.jar file to be in the classpath. All it does is read some XML through the parser and spit it back out. The code is located at programs/DOM/ReadXML.java and you can run it with java ReadXML ../CDAlbum/CDAlbum1.xml (or any other XML file).
The Xerces parser is created via a org.apache.xerces.parsers.DOMParser object, and the useful methods are parse() and getDocument() which open a file and return the Document object, respectively.
Here is how the program initializes the XML parser:
org.apache.xerces.parsers.DOMParser parser = new org.apache.xerces.parsers.DOMParser();
ReadXML reader = new ReadXML();
long before = System.currentTimeMillis();
parser.setFeature( "http://apache.org/xml/features/dom/defer-node-expansion",
true );
parser.setFeature( "http://xml.org/sax/features/validation",
false );
parser.setFeature( "http://xml.org/sax/features/namespaces",
true );
parser.parse(uri);
reader.printDOMTree(parser.getDocument());
(...)
Download the code
The method where most of the work is done is printDOMTree: it takes a node as input and recursively calls itself on all children. The interesting part about it is how it uses getNodeType() to differentiate the various nodes.
public void printDOMTree(Node node) {
int type = node.getNodeType();
switch (type) {
// print the document element
case Node.DOCUMENT_NODE:
{
System.out.println("");
printDOMTree(((Document) node).getDocumentElement());
break;
}
// print element with attributes
case Node.ELEMENT_NODE:
{
System.out.print("<");
System.out.print(node.getNodeName());
NamedNodeMap attrs = node.getAttributes();
for (int i = 0; i < attrs.getLength(); i++) {
Node attr = attrs.item(i);
System.out.print("" + attr.getNodeName() +
"=\"" + attr.getNodeValue() +
"\"");
}
System.out.println(">");
NodeList children = node.getChildNodes();
if (children != null) {
int len = children.getLength();
for (int i = 0; i < len; i++) {
printDOMTree(children.item(i));
}
}
break;
}
(...)
This example is not very exciting but presents all the necessary elements
to get started. The Xerces package is well documented, but I discovered
that this documentation was not present in the LotusXSL package. To start
working with XML I recommend that you download the Xerces package (which
includes full documentation and samples) at
http://xml.apache.org.
Here are the main SAX callbacks:
// Have the XSLTProcessorFactory obtain a interface to a
// new XSLTProcessor object.
XSLTProcessor processor = XSLTProcessorFactory.getProcessor();
// Have the XSLTProcessor processor object transform "foo.xml" to
// System.out, using the XSLT instructions found in "foo.xsl".
processor.process(new XSLTInputSource("foo.xml"),
new XSLTInputSource("foo.xsl"),
new XSLTResultTarget(System.out));
This examples applies the "foo.xsl" stylesheet to "foo.xml" and
outputs the results to the screen. Note the use of
XSLTInputSource and XSLResultTarget. These are part
of a very powerful mechanism that allows LotusXSL to work with DOM
or SAX instead of files. For example, if we want to keep the
transformed tree in memory for later modification, all we have to
is modify that last line to:
// Create a DOM Document node to attach the result nodes to.
Document out = new org.apache.xerces.dom.DocumentImpl();
// Transform to DOM
processor.process(new XSLTInputSource("foo.xml"),
new XSLTInputSource("foo.xsl"),
new XSLTResultTarget(out));
The out variable now contains the transformed tree and we can
process it using the techniques described in the "XML Programming" section.
The processor includes several performance optimizations. I will just mention the XSLTProcessor.processStylesheet() method which "compiles" the stylesheet for faster processing. When applying several stylesheets, this method offers significant performance improvements.
Another functionality is that it is possible to specify parameters to the stylesheet. This is done by calling
on the XSLTProcessor object.setStylesheetParam(String key, String expression)
XSL makes extensive use of XPath expressions. They are extremely powerful and LotusXSL allows you use XPath directly from your Java program. This feature is not complete yet, however. It is currently available in the samples directory but will be moved to the main API after customer feedback is gathered.
The parser also supports an extension mechanism. It allows you to define your own stylesheet tags and bind them to arbitrary Java or script code. The approach taken is elegant enough that Javascript code can be directly embedded into the XSL file: in this case it is not even necessary to use any special tool, calling lotusXSL from the command-line is sufficient.
The last feature I want to mention is applet wrapping. The idea is that the XSL parser can be embedded in an applet and sent to the client along with the XML and XSL code. This allows arbitrary processing to be done at the client side (sorting, filtering, ...) and can be used to deploy XML solutions today, even though the browsers are not quite ready yet.
To summarize, it is possible to apply XSL stylesheets programmatically but it is only meaningful if some other processing has to be done as well; otherwise the command-line utilities suffice.