This section is designed to give you a broad overview of the Extensible Markup Language (XML). It discusses XML's history, need, and rationale, together with a quick look at some basic XML constructs and applications.
XML actually is a subset of the Standardized General Markup Language (SGML). SGML is an internationally accepted standard for describing just about any type of information; however, it's way too complex for the relatively simple world of the web. And so, the World Wide Web Consortium (W3C) created a modified version of SGML specifically for the web, named it XML, and released it on an unsuspecting public sometime in 1998.
Over the past few years, XML has received more than its allotted fifteen minutes of fame, with technology pundits and business leaders alike singing its praises. It's been crowned the Next Big Thing, both on account of its ease of use and its potential for revolutionizing the way information is exchanged and used. Some of this is hype and some of it isn't; either way, it's quite clear that XML is going to be around for a while, and that, wisely used, it can indeed be a powerful tool for the management and effective exploitation of information.
XML works by "marking up" data with descriptive tags, in much the same way HTML does. The difference is that HTML was designed specifically to format data for web browsers and, as such, is limited to a predefined set of tags and functions. XML, by contrast, was designed as a web-friendly meta-data language and, therefore, merely lays down the rules for document markup (leaving it to the document author to define his or her own tags). As an example, consider the following block of text:
A Man and His Mouse
J. Gilbert Gumpfinch III, 23 Nov 2001
In a development many consider to be the first of its kind, the Hungarian scientist, Professor Haarbert Floopshot, today announced that he had succeeded in inventing "a better mouse." The new mouse, created using advanced genetic splicing techniques and "some good old-fashioned SuperGlue," can emit ultrasonic squeals to frighten off predators twice its size, leap tall mousetraps in a single bound, and comes equipped with a built-in CatDetector to detect approaching felines.
Sure, you can tell that it's a newspaper report, primarily because you read newspapers, can see the similarities, and can make a conclusion based on those similarities.You can even break it up conceptually into the headline, the byline, and the body of the article. But a computer can't do those thingsto a computer, the block of text above is simply a bunch of alphanumeric characters, with very little to distinguish the headline from the body.
That's where XML comes in. It can be used to transform the anonymous block of text above into something that even a computer can make sense of (see Listing 1.1).
By marking up the data with descriptive tags, XML makes it easy to distinguish between different types of informationeven for a computer. In today's wired world, this capability is more valuable than you might surmise; with many of today's business decisions handled by computer, XML can significantly improve the accuracy of information processing, thereby increasing overall business efficiency, streamlining business processes, and (ultimately) fattening the bottom linewhich probably also explains why the industry's so enthusiastic about it.
XML was designed to incorporate the following features:
Ease of use.
By virtue of the descriptive tags they contain, XML documents are easy to read and understand, even for users with little or no computer knowledge. They're also exceedingly simple to create: a non-technical user is typically able to create an XML document in far less time than it takes to create a corresponding HTML document.
This simplicity is perhaps XML's greatest selling point. Think about itadopting a markup language that can be understood and used by all employees without any special training makes it easier and more cost-effective for businesses to exchange information between individuals, departments, and business units.
Although XML offers document authors tremendous flexibility in naming and using their own tags, it nonetheless does impose some formal structure on a document. Tags must be named and nested correctly; opening tags must have corresponding closing tags; and namespaces must be defined wherever they are required. These rules ensure that every XML document meets some minimum expectations of structure and syntax, and make it easier for applications and processors to deal with XML data.
This emphasis on structured markup is in stark contrast to the "anything-goes" approach of HTML, which frequently sacrifices conformance in the name of greater flexibility. An HTML document containing incorrectly nested tags, for example, would immediately generate conformance errors when passed through an XML parser; the same document would likely be displayed correctly, with no errors recorded, when viewed in any of today's web browsers (many of which incorporate their own error-correction routines for precisely this sort of situation).
XML was designed to be used on the Internet, where it plays two very important roles.
First, XML provides a toolkit that enables users to describe the huge amount of data floating around on the Internetthis immediately opens the door to better organization and classification of information on the web, more intelligent search engines, and new types of links between data.
Second, XML provides a standard mechanism for information exchange, encoding data in a format that is easily transmittable from one computer to another using existing Internet protocols and transport mechanisms. Because XML is essentially text, it can be used to transmit information over email, FTP, HTTP, or any other text-capable protocol. This feature makes it an ideal candidate for information sharing between businesses or individuals located in different geographical areas.
Examples of technologies that use XML to exchange data include Resource Description Framework (RDF), Channel Definition Format (CDF), Web Distributed Data Exchange (WDDX), and Simple Object Access Protocol (SOAP).
Wide application support.
Because XML is easy to use and easy to move around, it's not hard to write an application that uses XML-encoded data. A number of XML parsers are available online; XML editors, validators, and similar tools are gaining market share; and most popular web browsers now support XML.
As mentioned earlier in this chapter, XML is an offshoot of SGML, and XML documents comply with all the rules and constraints of SGML markup. That being said, XML is far easier to use than SGML because its focus is much narrower. By straddling both worlds, XML combines the power of SGML with the flexibility and ease of use we've come to expect from the Internet.
XML documents come in two flavors: well-formed and valid.
A well-formed document is one that adheres to the basic rules laid down in the XML specification; for example, all elements must be properly nested; attribute values must be enclosed within quotation marks; and the document must contain at least one nonempty element.
A valid document is one that, in addition to being well-formed, meets the requirements and constraints laid down in a Document Type Definition (DTD) or XML schema. This DTD or schema is an additional ruleset an author can use to specify the element names and data types that are allowed in the document. This helps to reduce the risk of corrupted or invalid data. Listing 1.2 shows well-formed XML document, which describes an invoice for materials purchased. It's a slightly contrived example, but it will serve to illustrate XML's most commonly used constructs.
As you can see, an XML document is merely ASCII text, broken into separate sections by markup. This markup has several components, each with its own distinct role:
These components are explained in the sections that follow.
Every XML document begins with a special identifier called the document prolog, such as
This prolog appears at the top of an XML document and specifies things like the XML version and type of encoding used, the location of any DTD that may be used to validate the document, and one or more entity definitions.
In Listing 1.2, the prolog contained two entity definitions and one notation:
The document prolog is followed by one or more elements. Elements are the most basic units of XML datathey consist of attributes and content (or character data) surrounded by descriptive tags (or markup). Here are three examples:
To be well-formed, an XML document must contain at least one nonempty element. This outermost element, sometimes referred to as the root element, serves as the container for the remainder of the document.
Elements can be empty, contain other elements nested within them, or enclose a combination of both character data and elements.
Elements can be enhanced further by the addition of attributes, which are name-value pairs that can be used to attach any type of additional descriptive information to an element. Here are two examples:
In order to be well-formed, attribute values must be enclosed within quotation marks, and attribute names cannot be repeated within the same element.
Entities serve as placeholders for frequently used pieces of text within an XML document. They provide a convenient shortcut for document authors to store and easily update commonly used text snippets.
Entities consist of two components:
The entity definition, which links an entity name to the text block it represents:
The entity reference, which serves as a placeholder for the longer text block:
When an XML parser processes an XML document, entity references automatically are replaced with their actual values.
XML comes with five predefined entities.You might already be familiar with them if you have worked with HTML (see Table 1.1).
A variant of the regular entity just described is the unparsed entity, typically used to reference data that should not be processed by the XML parser. This is usually binary dataimages, audio files, video streams, and the like. The preceding example demonstrates one such unparsed entity, which holds the path to the company :
Note the NDATA keyword, which tells the parser that it should look up the appropriate notation to find out how to handle this data (notations are discussed next).
In order for a document to be well-formed, entities cannot contain references to themselvesthink infinite loop and you'll understand why.
A notation is an XML construct designed to help the parser identify non-XML datafor example, images or sound filesand typically goes hand in hand with unparsed entities. A notation is always enclosed within a notation declaration, which appears either within a DTD or the document prolog, and looks like this:
The notation name is a unique identifier used within unparsed entities, while the notation identifier is a string that tells the XML processor how to handle that particular entity. This string could be anything from a URL that identifies the data type to the location of a program that can decode the data. Here's an example:
CDATA blocks are "boxes" within an XML document, identified by special opening and closing delimiters. The text within these boxes is treated by the parser as character data, not markup, and can therefore contain special characters which would normally cause the parser to generate an error.
CDATA blocks begin with the special sequence <![CDATA[ and end with the sequence ]]>. For example,
The option to CDATA blocks is, of course, using the predefined entities discussed earlier to represent special characters like the less-than (<), greater-than (>), and ampersand (&) symbols. Because entities allow for reusability within the document, using entities is sometimes preferable to using CDATA blocks.
Processing instructions (PIs) are special instructions embedded within an XML document. These PIs are not usually intended for human readers; rather, they provide special information or commands to the XML application responsible for parsing the document. Parsers that do not recognize these instructions will simply ignore them.
PIs are typically enclosed within <? ... ?> tags, as demonstrated here:
Notice that the very first line in an XML document is actually a PI indicating the version number and encoding to the parser:
The parser can use this information to make decisions on how to process the XMLfor example, reject the document if the XML version is unsupported, or switch its internal character handling routines to use the encoding specified in the prolog.
Finally, comments provide a simple and convenient way for document authors to include human-readable notes within their XML markup. Comments must be placed within <!-- ... --> markers, and they are usually ignored by the parser.
As XML's popularity has grown, so has an understanding of its capabilities and potential; and this, in turn, has spawned a new generation of related technologies. Together, they are an oft-confusing morass of acronyms and buzzwords; individually, they each make an important contribution to the overall picture.
Here's a brief list of the better-known XML development efforts underway at the W3C:
XSL and XSLT.
The Extensible Stylesheet Language (XSL) is a language designed to handle the presentation of XML-encoded data. The language has three components: XSL Transformations (XSLT), which is responsible for restructuring XML data into something new and different; XML Path Language (XPath), which provides constructs to address specific parts of an XML document; and XSL Formatting Objects (XSL-FO), which is responsible for the formatting and presentation of the new-and-different result.
You can find out more about XSL at http://www.w3.org/Style/XSL/. Also, see Inside XML by Steven Holzner (ISBN: 0735710201, New Riders Publishing, 2001) and Inside XSLT by Steven Holzner (ISBN:0735711364, New Riders Publishing, 2001)
XPointer is a mechanism for locating specific nodes within an XML document. XPointer "addresses" use both relative and absolute paths to find and identify particular nodes in an XML document and are sometimes used with XLink for more precise links between data fragments.
You can find out more about XPointer at http://www.w3.org/XML/Linking.
XLink makes it possible to link resources in much the same way as standard HTML hyperlinks. However, XLink takes things a step further, allowing for single links that have multiple destinations and offering link authors the ability to control the direction of link traversal.
You can find out more about XLink at http://www.w3.org/XML/Linking.
XHTML is a re-write of HTML 4.0 that makes it compliant with the XML 1.0 specification, combining standard XML rules with HTML markup for more efficient and structured web pages. By incorporating the best features of both versions, XHTML hopes to lay the groundwork for the next generation of web applications.
You can find out more about XHTML at http://www.w3.org/MarkUp/. Also, see XHTML by Chelsea Valentine and Chris Minnick (New Riders Publishing, 2001).
XML Query hopes to do for XML what SQL did for databases: provide a standard interface to query XML documents and extract specific subsets of the data contained within them. Possible applications of this technology include more efficient full-text searches, multilanguage search engines, and easier access to XML-encoded information and the relationships they embody.
You can find out more about XML Query at http://www.w3.org/XML/Query.
Like DTDs, XML Schemas are rulesets specifying constraints on XML data. Unlike DTDs, they include support for namespaces, derived types and inheritance, and merged rulesets, and are expected to quickly supplant DTDs as the tool of choice for imposing conformance on XML documents.
You can find out more about XML Schema at http://www.w3.org/XML/Schema.
XML Signature offers a mechanism to digitally sign electronicusually XMLcontent and verify this signature in an error-free manner at the other end.
You can find out more about XML Signature at http://www.w3.org/Signature/.
As the preceding list demonstrates, XML has the potential to change the way we deal with web-based content. Here are four of XML's most important applications:
Better search engines.
Because XML describes data, it can significantly improve the indexing techniques used by most popular search engines, thereby resulting in faster, more efficient searches and more relevant results. Today, given the search term "rock," search engines can't distinguish between rock, a mass of stone, and rock, the musical genre. Tomorrow, document authors will be able to use XML to ensure that the distinction between the two is clear.
Better analysis of data.
XSLT makes it possible to create different views of the same XML data, simplifying the task of understanding and acting on complex business information. Using XSLT, the same XML document can be sliced, diced, and served up in an innumerable amount of ways, making it easier to see hidden relationships between the data and to gain a better understanding of the big picture.
More efficient content management.
By providing document authors with standard ways to describe data, XML opens the door to more efficient content management and publishing solutions. Content publishers can use XML to create, classify, and publish data in a standard format; because this data now meets certain basic rules of structure and syntax, it can easily be shared with other XML-compliant organizations.
More efficient information exchange.
To businesses, the ease and flexibility of the web and the inherent power of XML make a powerful combination, one which enables a new generation of web applications. These new applications are capable of receiving XML data from different sources, integrating these data fragments to create a composite picture, and using this information to make crucial business decisions on purchases, inventory, and billing. This is good for the organization, for the employees, and (let's not forget!) the bottom line.
Of course, this is just the tip of the iceberg. XML and its related technologies are still coming to full fruition, and new applications for this family of powerful technologies appear all the time.
If you'd like to learn more about XML, there are a number of very good books available to get you up to speed. The book's companion web site (