Chapter 22. XML - Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید

Chapter 22. XML


Contents:

Introduction

Parsing XML into Data Structures

Parsing XML into a DOM Tree

Parsing XML into SAX Events

Making Simple Changes to Elements or Text

Validating XML

Finding Elements and Text Within an XML Document

Processing XML Stylesheet Transformations

Processing Files Larger Than Available Memory

Reading and Writing RSS Files

Writing XML


John Donne, Holy Sonnets

I am a little world made cunningly. Of elements, and an angelic sprite


22.0. Introduction


The Extensible
Markup Language (XML) standard was released in 1998. It quickly
became the standard way to represent and exchange almost every kind
of data, from books to genes to function calls.

XML succeeded where other past "standard" data formats failed
(including XML''s ancestor, SGML—the Standard Generalized Markup
Language). There are three reasons for XML''s success: it is
text-based instead of binary, it is simple rather than complex, and
it has a superficial resemblance to HTML.


Text


Unix realized nearly 30 years before XML
that humans primarily interact with computers through text. Thus text
files are the only files any system is guaranteed to be able to read
and write. Because XML is text, programmers can easily make legacy
systems emit XML reports.


Simplicity


As we''ll see, a lot of complexity has arisen around XML, but the XML
standard itself is very simple. There are very few things that can
appear in an XML document, but from those basic building blocks you
can build extremely complex systems.


HTML


XML is not HTML, but
XML and HTML share a common ancestor: SGML. The superficial
resemblance meant that the millions of programmers who had to learn
HTML to put data on the web were able to learn (and accept) XML more
easily.



22.0.1. Syntax


Example 22-1 shows a simple XML document.

Example 22-1. Simple XML document


<?xml version="1.0" encoding="UTF-8"?>
<books>
<!-- Programming Perl 3ed -->
<book id="1">
<title>Programming Perl</title>
<edition>3</edition>
<authors>
<author>
<firstname>Larry</firstname>
<lastname>Wall</lastname>
</author>
<author>
<firstname>Tom</firstname>
<lastname>Christiansen</lastname>
</author>
<author>
<firstname>Jon</firstname>
<lastname>Orwant</lastname>
</author>
</authors>
<isbn>0-596-00027-8</isbn>
</book>
<!-- Perl & LWP -->
<book id="2">
<title>Perl &amp; </title>
<edition>1</edition>
<authors>
<author>
<firstname>Sean</firstname>
<lastname>Burke</lastname>
</author>
</authors>
<isbn>0-596-00178-9</isbn>
</book>
<book id="3">
<!-- Anonymous Perl -->
<title>Anonymous Perl</title>
<edition>1</edition>
<authors />
<isbn>0-555-00178-0</isbn>
</book>
</books>

At first glance, XML looks a lot like HTML: there are elements (e.g.,
<book> </book>), entities (e.g.,
&amp; and &lt;), and
comments (e.g., <!-- Perl & LWP -->).
Unlike HTML, XML doesn''t define a standard set of elements, and
defines only a minimum set of entities (for single quotes, double
quotes, less-than, greater-than, and ampersand). The XML standard
specifies only syntactic building blocks like the
< and > around elements.
It''s up to you to create the vocabulary, that
is, the element and attribute names like books,
authors, etc., and how they nest.

XML''s opening and closing elements are familiar from HTML:

<book>
</book>

XML adds a variation for empty elements (those with no text or other
elements between the opening and closing tags):

<author />

Elements may have attributes, as in:

<book id="1">

Unlike HTML, the case of XML elements, entities, and attributes
matters: <Book> and
<book> start two different elements. All
attributes must be quoted, either with single or double quotes
(id=''1'' versus id="1"). Unicode
letters, underscores, hyphens, periods, and numbers are all
acceptable in element and attribute name, but the first character of
a name must be a letter or an underscore. Colons are allowed only in
namespaces (see Namespaces, later in this
chapter).

Whitespace is surprisingly tricky. The XML specification says
anything that''s not a markup character is content. So (in theory) the
newlines and whitespace indents between tags in Example 22-1 are text data. Most XML parsers offer the
choice of retaining whitespace or sensibly folding it (e.g., to
ignore newlines and indents).

22.0.2. XML Declaration





Example 22-1 is the XML
declaration
:

<?xml version="1.0" encoding="UTF-8" ?>

This declaration is
optional—Version 1.0 of XML and UTF-8 encoded text are the
defaults. The encoding attribute specifies the
Unicode encoding of the document. Some XML parsers can cope with
arbitrary Unicode encodings, but others are limited to ASCII and
UTF-8. For maximum portability, create XML data as UTF-8.

22.0.3. Processing Instructions


Similar to declarations are
processing instructions, which are instructions
for XML processors. For example:

<title><?pdf font Helvetica 18pt?>XML in Perl</title>

Processing instructions have the general structure:

<?target data ... ?>

When an XML processor encounters a processing instruction, it checks
the target. Processors should ignore
targets they don''t recognize. This lets one XML file contain
instructions for many different processors. For example, the XML
source for this book might have separate instructions for programs
that convert to HTML and to PDF.

22.0.4. Comments



XML
comments have the same syntax as HTML comments:

<!-- ... -->

The comment text can''t contain --, so comments
don''t nest.

22.0.5. CDATA



Sometimes
you want to put text in an XML document without having to worry about
encoding entities. Such a literal block is called
CDATA in XML, written:

<![CDATA[literal text here]]>

The ugly syntax betrays XML''s origins in SGML. Everything after the
initial <![CDATA[ and up to the
]]> is literal data in which XML markup
characters such as < and
& have no special meaning.

For example, you might put sample code that contains a lot of XML
markup characters in a CDATA block:

<para>The code to do this is as follows:</para>
<code><![CDATA[$x = $y << 8 & $z]]>

22.0.6. Well-Formed XML


To ensure that all XML documents are
parsable, there are some minimum requirements expected of an XML
document. The following list is adapted from the list in
Perl & XML, by Erik T. Ray and Jason
McIntosh (O''Reilly):


  • The document must have one and only one top-level element (e.g.,
    books in Example 22-1).


  • Every element with content must have both a start and an end tag.


  • All attributes must have values, and those values must be quoted.


  • Elements must not overlap.


  • Markup characters (<, >,
    and &) must be used to indicate markup only.
    In other words, you can''t have <title>Perl &
    XML</title>
    because the & can
    only indicate an entity reference. CDATA sections are the only
    exception to this rule.


If an XML document meets these rules, it''s said to be "well-formed."
Any XML parser that conforms to the XML standard should be able to
parse a well-formed document.

22.0.6. Schemas



There are two
parts to any program that processes an XML document: the XML parser,
which manipulates the XML markup, and the program''s logic, which
identifies text, the elements, and their structure. Well-formedness
ensures that the XML parser can work with the document, but it
doesn''t guarantee that the elements have the correct names and are
nested correctly.

For example, these two XML fragments encode the same information in
different ways:

<book>
<title>Programming Perl</title>
<edition>3</edition>
<authors>
<author>
<firstname>Larry</firstname>
<lastname>Wall</lastname>
</author>
<author>
<firstname>Tom</firstname>
<lastname>Christiansen</lastname>
</author>
<author>
<firstname>Jon</firstname>
<lastname>Orwant</lastname>
</author>
</authors>
</book>
<work>
<writers>Larry Wall, Tom Christiansen, and Jon Orwant</writers>
<name edition="3">Programming Perl</name>
</work>

The structure is different, and if you wrote code to extract the
title from one ("get the contents of the book element, then find the
contents of the title element within that") it would fail completely
on the other. For this reason, it is common to write a specification
for the elements, attributes, entities, and the ways to use them.
Such a specification lets you be confident that your program will
never be confronted with XML it cannot deal with. The two formats for
such specifications are DTDs and schemas.

DTDs are the older and more limited format, acquired by way of XML''s
SGML past. DTDs are not written in XML, so you need a custom
(complex) parser to work with them. Additionally, they aren''t
suitable for many uses—simply saying "the
book element must contain one each of the
title, edition,
author, and isbn elements in
any order" is remarkably difficult.

For these reasons, most modern
content specifications take the form of schemas. The World Wide Web
Consortium (W3C), the folks responsible for XML and a host of related
standards, have a standard called XML Schema (http://www.w3.org/TR/xmlschema-0/). This is
the most common schema language in use today, but it is complex and
problematic. An emerging rival for XML Schema is the OASIS group''s
RelaxNG; see http://www.oasis-open.org/committees/relax-ng/spec-20011203l
for more information.

There are Perl modules for working
with schemas. The most important action you do with a schemas,
however, is to validate an XML document against
a schema. Recipe 22.5 shows how to use
XML::LibXML to do this. XML::Parser does not support validation.

22.0.7. Namespaces



One
especially handy property of XML is nested elements. This lets one
document encapsulate another. For example, you want to send a
purchase order document in a mail message. Here''s how you''d do
that:

<mail>
<header>
<from>me@example.com</from>
<to>you@example.com</to>
<subject>PO for my trip</subject>
</header>
<body>
<purchaseorder>
<for>Airfare</for>
<bill_to>Editorial</bill_to>
<amount>349.50</amount>
</purchaseorder>
</body>
</mail>

It worked, but we can easily run into problems. For example, if the
purchase order used <to> instead of
<bill_to> to indicate the department to be
charged, we''d have two elements named <to>.
The resulting document is sketched here:

<mail>
<header>
<to>you@example.com</to>
</header>
<body>
<to>Editorial</to>
</body>
</mail>

This document uses to for two different purposes.
This is similar to the problem in programming where a global variable
in one module has the same name as a global variable in another
module. Programmers can''t be expected to avoid variable names from
other modules, because that would require them to know every module''s
variables.

The solution to the XML problem is similar to the programming
problem''s solution: namespaces. A namespace is a unique prefix for
the elements and attributes in an XML vocabulary, and is used to
avoid clashes with elements from other vocabularies. If you rewrote
your purchase-order email example with namespaces, it might look like
this:

<mail xmlns:email="http://example.com/dtds/mailspec/">
<email:from>me@example.com</email:from>
<email:to>you@example.com</email:to>
<email:subject>PO for my trip</email:subject>
<email:body>
<purchaseorder xmlns:po="http://example.com/dtd/purch/">
<po:for>Airfare</po:for>
<po:to>Editorial</po:to>
<po:amount>349.50</po:amount>
</purchaseorder>
</email:body>
</mail>

An attribute like
xmnls:prefix="URL"
identifies the namespace for the contents of the element that the
attribute is attached to. In this example, there are two namespaces:
email and po. The
email:to element is different from the
po:to element, and processing software can avoid
confusion.

Most of the XML parsers in Perl support namespaces, including
XML::Parser and XML::LibXML.

22.0.8. Transformations


One of the
favorite pastimes of XML hackers is turning XML into something else.
In the old days, this was accomplished with a program that knew a
specific XML vocabulary and could intelligently turn an XML file that
used that vocabulary into something else, like a different type of
XML, or an entirely different file format, such as HTML or PDF. This
was such a common task that people began to separate the
transformation engine from the specific transformation, resulting in
a new specification: XML Stylesheet Language for Transformations
(XSLT).

Turning XML into something else with XSLT involves writing a
stylesheet. A stylesheet says "when you see this
in the input XML, emit that." You can encode
loops and branches, and identify elements (e.g., "when you see the
book element, print only the contents of the
enclosed title element").

Transformations in Perl are best accomplished through the
XML::LibXSLT module, although XML::Sablotron and XML::XSLT are
sometimes also used. We show how to use XML::LibXSLT in Recipe 22.7.

22.0.9. Paths




Of the new
vocabularies and tools for XML, possibly the most useful is XPath.
Think of it as regular expressions for XML structure—you
specify the elements you''re looking for ("the
title within a book"), and the
XPath processor returns a pointer to the matching elements.

An XPath expression looks like:

/books/book/title

Slashes separate tests. XPath has syntax for testing attributes,
elements, and text, and for identifying parents and siblings of
nodes.

The XML::LibXML module has strong support for XPath, and we show how
to use it in Recipe 22.6. XPath also crops up
in the XML::Twig module shown in Recipe 22.8.

22.0.10. History of Perl and XML


Initially, Perl had only one way to parse
XML: regular expressions. This was prone to error and often failed to
deal with well-formed XML (e.g., CDATA sections). The first real XML
parser in Perl was XML::Parser, Larry Wall''s Perl interface to James
Clark''s expat C library. Most other languages
(notably Python and PHP) also had an expat
wrapper as their first correct XML parser.

XML::Parser was a prototype—the mechanism for passing
components of XML documents to Perl was experimental and intended to
evolve over the years. But because XML::Parser was the
only XML parser for Perl, people quickly wrote
applications using it, and it became impossible for the interface to
evolve. Because XML::Parser has a proprietary API, you shouldn''t use
it directly.

XML::Parser is an
event-based parser. You register callbacks for events like "start of
an element," "text," and "end of an element." As XML::Parser parses
an XML file, it calls the callbacks to tell your code what it''s
found. Event-based parsing is quite common in the XML world, but
XML::Parser has its own events and doesn''t use the standard Simple
API for XML (SAX) events. This is why we recommend you don''t use
XML::Parser directly.

The XML::SAX modules provide a SAX wrapper
around XML::Parser and several other XML parsers. XML::Parser parses
the document, but you write code to work with XML::SAX, and XML::SAX
translates between XML::Parser events and SAX events. XML::SAX also
includes a pure Perl parser, so a program for XML::SAX works on any
Perl system, even those that can''t compile XS modules. XML::SAX
supports the full level 2 SAX API (where the backend parser supports
features such as namespaces).

The other common way to parse XML is to build a tree data structure:
element A is a child of element B in the tree if element B is inside
element A in the XML document.
There
is a standard API for working with such a tree data structure: the
Document Object Model (DOM). The XML::LibXML module uses the GNOME
project''s libxml2 library to quickly and
efficiently build a DOM tree. It is fast, and it supports XPath and
validation. The XML::DOM module was an attempt to build a DOM tree
using XML::Parser as the backend, but most programmers prefer the
speed of XML::LibXML. In Recipe 22.2 we show
XML::LibXML, not XML::DOM.

So, in short: for events, use XML::SAX with XML::Parser or
XML::LibXML behind it; for DOM trees, use XML::LibXML; for
validation, use XML::LibXML.

22.0.11. Further Reading


While the XML specification itself is simple, the specifications for
namespaces, schemas, stylesheets, and so on are not. There are many
good books to help you learn and use these technologies:


  • For help with all of the nuances of XML, try Learning
    XML
    , by Erik T. Ray (O''Reilly), and XML in a
    Nutshell
    , Second Edition, by Elliotte Rusty Harold and W.
    Scott Means (O''Reilly).


  • For help with XML Schemas, try XML Schema, by
    Eric van der Vlist (O''Reilly).


  • For examples of stylesheets and transformations, and help with the
    many non-trivial aspects of XSLT, see XSLT, by
    Doug Tidwell (O''Reilly), and XSLT Cookbook, by
    Sal Mangano (O''Reilly).


  • For help with XPath, try XPath and XPointer, by
    John E. Simpson (O''Reilly).


If you''re the type that relishes the pain of reading formal
specifications, the W3C web site, http://www.w3c.org, has the full text of all
of their standards and draft standards.

/ 875