PHP and SAX
PHP 4.0 comes with a very capable SAX parser based on the expat library. Created by James Clark, the expat library is a fast, robust SAX implementation that provides XML parsing capabilities to a number of open-source projects, including the Mozilla browser (Appendix A, "Recompiling PHP to Add XML Support."
A Simple Example
You can do a number of complex things with a SAX parser; however, I'll begin with something simple to illustrate just how it all fits together. Let's go back to the previous XML document (see Listing 2.1), and write some PHP code to process this document and do something with the data inside it (see Listing 2.2).
I'll explain Listing 2.2 in detail:
The first order of business is to initialize the SAX parser. This is accomplished via PHP's aptly named xml_parser_create() function, which returns a handle for use in successive operations involving the parser.
With the parser created, it's time to let it know which events you would like it to monitor, and which user-defined functions (or callback functions) it should call when these events occur. For the moment, I'm going to restrict my activities to monitoring start tags, end tags, and the data embedded within them:
What have I done here? Very simple. I've told the parser to call the function startElementHandler() when it finds an opening tag, the function endElementHandler() when it finds a closing tag, and the function characterDataHandler() whenever it encounters character data within the document.
When the parser calls these functions, it will automatically pass them all relevant information as function arguments. Depending on the type of callback registered, this information could include the element name, element attributes, character data, processing instructions, or notation identifiers.
From Listing 2.2, you can see that I haven't defined these functions yet; I'll do that a little later, and you'll see how this works in practice. Until these functions have been defined, any attempt to run the code from Listing 2.2 as it is right now will fail.
Now that the callback functions have been registered, all that remains is to actually parse the XML document. This is a simple exercise. First, create a file handle for the document:
Then, read in chunks of data with fread(), and parse each chunk using the xml_parse() function:
In the event that errors are encountered while parsing the document, the script will automatically terminate via PHP's die() function. Detailed error information can be obtained via the xml_error_string() and xml_get_error_code() functions (for more information on how these work, see the "Handling Errors" section).
After the complete file has been processed, it's good programming practice to clean up after yourself by destroying the XML parser you created:
That said, in the event that you forget, PHP will automatically destroy the parser for you when the script ends.
The preceding four steps make up a pretty standard process, and you'll find yourself using them over and over again when processing XML data with PHP's SAX parser. For this reason, you might find it more convenient to package them as a separate function, and call this function wherever requireda technique demonstrated in Listing 2.23.
With the generic XML processing code out of the way, let's move on to the callback functions defined near the top of the script.You'll remember that I registered the following three functions:
Executed when an opening tag is encountered
Executed when a closing tag is encountered
Executed when character data is encountered
Listing 2.3 is the revised script with these handlers included.
Nothing too complex here. The tag handlers print the names of the tags they encounter, whereas the character data handler prints the data enclosed within the tags. Notice that the startElementHandler() function automatically receives the tag name and attributes as function arguments, whereas the characterDataHandler() gets the CDATA text.
And when you execute the script through a browser, here's what the end product looks like (and if you're wondering why all the element names are in uppercase, take a look at the "Controlling Parser Behavior" section):
Not all that impressive, certainlybut then again, we're just getting started!