Word Hacks [Electronic resources]

Andrew Savikas

نسخه متنی -صفحه : 162/ 131
نمايش فراداده

Hack 94 Transforming XML into a Word Document

With the right XSLT stylesheet, you can quickly transform an XML document into a Word document.

A potential killer app for WordprocessingML is the ability to publish Word documents from dynamic XML content. In this hack, we'll look at a simple XML document that vaguely resembles HTML. The code for this hack will transform the document into a full-fledged WordprocessingML document you can open in Word. Type the following in a standard text editor such as Notepad and save it as simpleDocument.xml:

<doc>
<h1>Hello, this is my document heading</h1>
<para>This is <emphasis>italic</emphasis>.</para>
<h2>This is a sub-heading</h2>
<para>This text is <strong>bold</strong>.</para>
<para>This text is <strong><emphasis>bold and italic</emphasis>
</strong>.</para>
<para><emphasis><strong>And so is this.</strong></emphasis>.</para>
<para>And <emphasis>this is italic and <strong>this is both
</strong></emphasis>.</para>
<para>Finally, <strong>this is bold and <emphasis>this is both
</emphasis> and back to just bold</strong>.</para>
</doc>

The file has a fairly flat structure, including a sequence of para, h1, and h2 elements inside the root doc element.

The code in this hack will show you how to transform the simpleDocument.xml file into a formatted Word document. Figure 10-9 shows the automatically generated WordprocessingML document after opening it in (any edition of) Word 2003. As you can see, the content of each of the different elements is formatted differently: the text from the h1 elements is rendered in a large font and is bold, the text from the emphasis elements is rendered in italic type, the text from the strong elements is rendered bold, and so on.

Figure 10-9. What the result of this transformation looks like when opened in any edition of Word 2003

10.6.1 The Code

The following code is the entire XSLT stylesheet used to render the document shown in Figure 10-9. Each xsl:template element represents a different template rule, which applies to particular kinds of nodes in the source document. The stylesheet matches the text of the document according to its context, then outputs the desired paragraphs (w:p elements), runs (w:r elements), and formatting properties.

Enter the following code in a standard text editor such as Notepad, save the file in the same folder as simpleDoc.xml, and name it createWordDocument.xsl:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
<xsl:output indent="yes"/>
<xsl:template match="/">
<xsl:processing-instruction name="mso-application">
<xsl:text>prog</xsl:text>
</xsl:processing-instruction>
<w:wordDocument>
<xsl:attribute name="xml:space">preserve</xsl:attribute>
<w:body>
<xsl:apply-templates select="/doc/*"/>
</w:body>
</w:wordDocument>
</xsl:template>
<xsl:template match="h1 | h2 | para">
<w:p>
<xsl:apply-templates/>
</w:p>
</xsl:template>
<xsl:template match="h1/text( )">
<w:r>
<w:rPr>
<w:sz w:val="32"/>
<w:b/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="h2/text( )">
<w:r>
<w:rPr>
<w:sz w:val="28"/>
<w:b/>
<w:i/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="para/text( )">
<w:r>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="emphasis/text( )">
<w:r>
<w:rPr>
<w:i/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="strong/text( )">
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="emphasis/strong/text( ) | strong/emphasis/text( )"
priority="1">
<w:r>
<w:rPr>
<w:i/>
<w:b/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
</xsl:stylesheet>

Let's pick out a couple of template rules, and I'll show what's going on in the code. The second template rule of the stylesheet matches three different kinds of elements: h1, h2, and para. When any of these elements are encountered (in the context of XSLT's automatic recursive descent), a w:p element is created, effectively turning each of these elements into a vanilla Word paragraph:

<xsl:template match="h1 | h2 | para">
<w:p>
<xsl:apply-templates/>
</w:p>
</xsl:template>

The xsl:apply-templates instruction causes the recursive descent of the source document to continue, allowing other template rules to fire when they match an input node. For example, this template rule matches text inside an emphasis element:

<xsl:template match="emphasis/text( )">
<w:r>
<w:rPr>
<w:i/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>

The most important element in this template rule is w:i. It causes this particular run of text to be rendered in italics. The w:t element, which stands for "text," functions as a container for the text in this run. The xsl:copy instruction copies the text node that's a child of the emphasis element in our source document straight to the result tree without modification.

10.6.2 Running the Hack

To run this hack, enter the following at a DOS command prompt within the folder that holds the simpleDocument.xml and createWordDocument.xsl files:

> msxsl simpleDocument.xml createWordDocument.xsl -o output.xml

A new file, output.xml, is created. Double-click the new file from Windows Explorer, and voila! You'll see the document shown in Figure 10-9.

10.6.3 Hacking the Hack

The stylesheet listed above creates a Word document with paragraphs that contain runs with direct formatting applied (bold and italic). The stylesheet listed below produces an identical-looking document to the one above, but it uses Word's styles instead. You must define the styles up front within your document, inside the w:styles element (naturally). The new parts of the stylesheet are shown in bold.

Enter the following into a standard text editor such as Notepad and save it in the same folder as the other files from this hack. Name it createStyledWordDoc.xsl.

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
<xsl:output indent="yes"/>
<xsl:template match="/">
<xsl:processing-instruction name="mso-application">
<xsl:text>prog</xsl:text>
</xsl:processing-instruction>
<w:wordDocument>
<xsl:attribute name="xml:space">preserve</xsl:attribute>
<w:styles>
<w:style w:style w:type="paragraph">
<w:name w:val="Heading 1"/>
<w:rPr>
<w:sz w:val="32"/>
<w:b/>
</w:rPr>
</w:style>
<w:style w:style w:type="paragraph">
<w:name w:val="Heading 2"/>
<w:rPr>
<w:sz w:val="28"/>
<w:b/>
<w:i/>
</w:rPr>
</w:style>
<w:style w:style w:type="character">
<w:name w:val="Italic"/>
<w:rPr>
<w:i/>
</w:rPr>
</w:style>
<w:style w:style w:type="character">
<w:name w:val="Bold"/>
<w:rPr>
<w:b/>
</w:rPr>
</w:style>
<w:style w:style w:type="character">
<w:name w:val="Bold and Italic"/>
<w:rPr>
<w:b/>
<w:i/>
</w:rPr>
</w:style>
</w:styles>
<w:body>
<xsl:apply-templates select="/doc/*"/>
</w:body>
</w:wordDocument>
</xsl:template>
<xsl:template match="h1">
<w:p>
<w:pPr>
<w:pStyle w:val="h1"/>
</w:pPr>
<xsl:apply-templates/>
</w:p>
</xsl:template>
<xsl:template match="h2">
<w:p>
<w:pPr>
<w:pStyle w:val="h2"/>
</w:pPr>
<xsl:apply-templates/>
</w:p>
</xsl:template>
<xsl:template match="para">
<w:p>
<xsl:apply-templates/>
</w:p>
</xsl:template>
<xsl:template match="h1/text( ) | h2/text( ) | para/text( )">
<w:r>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="emphasis/text( )">
<w:r>
<w:rPr>
<w:rStyle w:val="emphasis"/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="strong/text( )">
<w:r>
<w:rPr>
<w:rStyle w:val="strong"/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
<xsl:template match="emphasis/strong/text( ) | strong/emphasis/text( )"
priority="1">
<w:r>
<w:rPr>
<w:rStyle w:val="emphasisAndStrong"/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>
</xsl:stylesheet>

An explanation of the full details of this stylesheet is beyond the scope of this book, but in the context of this hack the important thing to know is that the w:rPr ("run properties") element now contains a reference to a style:

<xsl:template match="emphasis/text( )">
<w:r>
<w:rPr>
<w:rStyle w:val="emphasis"/>
</w:rPr>
<w:t>
<xsl:copy/>
</w:t>
</w:r>
</xsl:template>

In this case, the referenced style ID is "emphasis," which we declared earlier in the document:

<w:style w:style w:type="character">
<w:name w:val="Italic"/>
<w:rPr>
<w:i/>
</w:rPr>
</w:style>

The formatting effect is the same: text inside emphasis elements shows up as italic in the result. The difference is that now it is by way of a character style named Italic, rather than via direct formatting.

Evan Lenz