Chapter 4: Validating XML with DTDs and XML Schemas
Although XML enables users to define their own markup languages to describe and encapsulate data into XML files, all XML documents must conform to basic “grammar” rules so that application developers can develop software with the assurance that all XML documents conform to certain basic rules of syntax. Document type definitions (DTDs) and XML schemas (XSDs) help you to ensure that your XML documents adhere to specified structures, constraints, and in the case of XSDs, datatypes so that they can be used by applications. This chapter discusses both of these methods while comparing and contrasting how and when they should be used. It will then discuss how these relate to database data and specifically the XML support in Oracle Database 10g.
Introducing the DTD
DTDs are inherited from SGML and are not in XML syntax. They specify the structure of an XML document including the hierarchical relationship between specified elements and their included attributes. A DTD can be associated with an XML document either by its being included in that document or by internally referencing an external file. If the DTD is contained in an external file, it is referenced through a uniform resource locator (URL) of the form http://www.foobar.com/book.dtd.For example, the following booklist.xml file can have a DTD associated as an embedded decalarion within the XML file itself:
<?xml version = "1.0"?>
<!-- DTD bookcatalog may have a number of book entries -->
<!DOCTYPE bookcatalog [
<!ELEMENT bookcatalog (book)*>
<!-- Each book element has a title, 1 or more authors, etc. -->
<!ELEMENT book (title, author+, ISBN, publisher, publishyear, price)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (firstname, lastname)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT ISBN (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT publishyear (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ATTLIST price type (US|CAN|UK|EURO) #REQUIRED>
]>
<bookcatalog>
<book>
<title>History of Interviews</title>
<author>
<first name>Juan</first name>
<last name>Smith</last name>
</author>
<ISBN>99999-99999</ISBN>
<publisher>Oracle Press</publisher>
<publishyear>2000</publishyear>
<price type="US">1.00</price>
</book>
</bookcatalog>
Following the DOCTYPE declaration of the DTD is the root element declaration <!ELEMENT> of bookcatalog. An element simply consists of a start tag, other elements or text and an end tag. For example, the <bookcatalog> element contains all of the elements, attributes, and text within the document. Such an element is called the root element. Only one root element may exist within an XML document. The root element marks the beginning of the document and is considered the parent of all the other elements, which are nested within its start tag and end tag. For XML documents to be considered “valid” with respect to this DTD, the root element <bookcatalog> must be the first element to start off the body of the XML document.Following this is the element declaration, which stipulates the child elements that must be nested within the root element <bookcatalog>, the content model for the root element. Note that all the child elements of <bookcatalog> are explicitly called out in its element declaration, and author has a + as a suffix. This is an example of the Extended Backus-Naur Format (EBNF) that can be used for describing the content model. The allowed suffixes are
? For 0 or 1 occurrence
* For 0 or more occurrences
+ For 1 or more occurrences
Note also the use of #PCDATA to declare that the element text must be non-marked-up text, and the price’s required attribute values are explicitly declared. The difference between CDATA and PCDATA is that CDATA sections are simply skipped by the parser and aren’t checked for well-formedness; hence, they can be viewed as non-parsed character data.A DTD in an external file can also be used. In this case only a reference is embedded in the XML document as this other version of the booklist.xml file.
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE bookcatalog SYSTEM "booklist.dtd">
<bookcatalog>
<book>
<title>History of Interviews</title>
<author>
<first name>Juan</first name>
<last name>Smith</last name>
</author>
<ISBN>99999-99999</ISBN>
<publisher>Oracle Press</publisher>
<publishyear>2000</publishyear>
<price type="US">1.00</price>
</book>
</bookcatalog>
Note that within the <!DOCTYPE> processing instruction, in place of the actual DTD content, is SYSTEM “booklist.dtd”, which refers to the external DTD. This DTD is then of the following form:
<!ELEMENT bookcatalog (book)*>
<!-- Each book element has a title, 1 or more authors, etc. -->
<!ELEMENT book (title, author+, ISBN, publisher, publishyear, price)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (firstname, lastname)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT ISBN (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT publishyear (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ATTLIST price type (US|CAN|UK|EURO) #REQUIRED>
When it comes to validating XML documents, functionally these two methods are the same.