Extending the Framework
As mentioned at the beginning of this chapter, there are many uses for an application that can extract data from an XML document using XPath expressions. While the framework presented here is simple, it is also powerful, as it can easily be extended as described in the next sections.
An XML Document Pruning Use Case for cppextract
Because they are hierarchical with no limits on the size of element content or the number of attributes, XML documents can be large. In many cases, the actual data is buried several levels into the hierarchical structure. While the extracting and splitting techniques presented can simplify some types of processing, document pruning can selectively remove nodes while maintaining a single DOM tree.
This can be accomplished using a variation of the code in cppextract within the printSubTree() template. Since the nodes returned as a result of processing the XPath expression are not copies of the document nodes but references to them, you can perform DOM operations before printing the results. For example, you could produce separate book documents without the price by passing in “//book” and calling removeChild() to remove the <Price> element. You can also use similar techniques to replace content. The key point to remember is that you have full access to the DOM tree.
A Content-Management Use Case for cppextract
Since XPath is a rich language that is expanding even further in 2.0, you can encapsulate operations on the nodes or values extracted within the expression itself. For example, if you want to find out the number of books listed in booklist.xml, the following command line would give you the result in one process step:
cppextract booklist.xml "count(/bookcatalog/book)"
...
Numeric Value : 2.0
A complete content-management extraction system could be put together in a manner similar to the system shown in the diagram in Figure 24-1. This content-management system extracts the XML data and metadata by using the Extractor that encapsulates the cppextract functionality and inserts it into the relational database tables in an Oracle database. The XML data extraction is based on the XPaths stored in an XPath table associated with a DTD and uses the DTD’s sysID and docID or an XML schema’s location URL to retrieve the appropriate set. When initializing the Extractor, the content-management application retrieves the DTD’s sysID and docID or the XML schema location URL from the XML document and uses them to query the XPath table. After it gets a list of XPaths, the application then registers the XPaths to the Extractor and specifies instances of callback functions to receive the retrieved data. The Extractor retrieves the XML data for each XPath and disseminates the data to the corresponding content handlers. The content handlers then insert the extracted data into either metadata or data tables in the database.
Figure 24-1: cppextract content-management use case