15.1. Poifs
At the base of the POI APIs is the "POI file system," or POIFS, which provides a Java implementation of the OLE 2 Compound Document Format. OLE 2 is an archive format, conceptually much like JAR files. Both provide a way of combining hierarchical data into a single file, a concept that maps naturally to a file system.POIFS presents a compound document as a series of DirectoryEntry and DocumentEntry objects. There is a special top-level or root directory, obtainable through the getroot() method. The DirectoryEntry class has a getEntries() method that returns an Iterator, which can be used to walk through all the children of a directory. DocumentEntry objects do not have children, but an InputStream can be obtained to read the contents of such an entry.Listing 15.1 illustrates these principles by displaying some or all of the entries within an archive or displaying the contents of a particular document. The first argument should be the name of a Word or Excel file. The OLE 2 format is used for many other files besides Office documents, and Listing 15.1 will happily handle these as well. Although the rest of this chapter is concerned only with Office, it may be informative to run this program on various application files to discover which are secretly OLE 2 archives and, if so, what their contents are.The second argument to Listing 15.1 is optional; if present it should be the name of an entry in the archive. If this name refers to a DirectoryEntry, only elements within that directory will be displayed. If the name refers to a DocumentEntry, the contents of that document will be displayed.
Listing 15.1. A tool for examining POI filesystems
The main() method starts everything off by opening the specified file as an InputStream and creating a POIFSFileSystem from that stream. The root directory is obtained via the getroot() method and then dumped.The dump() method obtains the name of the current entry. The next two lines clean the output by substituting a question mark for control characters, which are common in entry names. foundHere indicates whether the current entry matches the second argument to the program, and found indicates whether the second argument has been found anywhere yet. If the target has been found, leading dots are printed to indicate the depth of the current entry within the archive.If the current entry is a DirectoryEntry, additional formatting may be done, and then all the child entries are handled recursively by passing them in turn to dump().If the current entry is a DocumentEntry, and if it is the one the user is looking for as indicated by foundHere, then a DocumentInputStream is obtained and used to display the contents in the usual way.At present, a POIFS can only contain objects of type DirectoryEntry and DocumentEntry, but the documentation cautions that other types may be introduced in the future. This is why there is a final else in dump; should such a future enhancement be discovered, at least it can be reported.Running Listing 15.1 on a typical Word document produces the following output:
package com.awl.toolbook.chapter15;
import java.io.FileInputStream;
import java.util.Iterator;
import org.apache.poi.poifs.filesystem.*;
public class POIDump {
public static boolean found = false;
public static String name = ".";
public static void main(String argv[])
throws Exception
{
FileInputStream in =
new FileInputStream(argv[0]);
POIFSFileSystem fs =
new POIFSFileSystem(in);
DirectoryEntry root = fs.getRoot();
if(argv.length > 1) {
name = argv[1];
} else {
found = true;
}
dump(root,0);
}
public static void dump(Entry e, int depth)
throws Exception
{
String entryName = e.getName();
entryName = entryName.replace((char) 1,'?');
entryName = entryName.replace((char) 5,'?');
boolean foundHere = name.equals(entryName);
found = found || foundHere;
if(found) {
for(int i=0;i<depth;i++) {
System.out.print('.');
}
}
if(e instanceof DirectoryEntry) {
if(found) {
System.out.println(entryName + '/');
}
DirectoryEntry d = (DirectoryEntry) e;
for(Iterator i=d.getEntries();i.hasNext();) {
Entry entry = (Entry) i.next();
dump(entry,depth+1);
}
if(foundHere) {
found = false;
}
} else if (e instanceof DocumentEntry) {
if(found) {
if(!foundHere) {
System.out.println(entryName);
} else {
DocumentEntry doc = (DocumentEntry) e;
DocumentInputStream in =
new DocumentInputStream(doc);
byte data[] = new byte[2048];
int count = 0;
while((count = in.read(data)) > 0) {
System.out.print(
new String(data,0,count));
}
System.out.println(");
in.close();
found = false;
}
}
} else {
System.out.println("Unknown: " + e.getClass());
}
}
}
Root Entry/Running Listing 15.1 on an Excel file produces
.WordDocument
.?SummaryInformation
.?DocumentSummaryInformation
.1Table
.ObjectPool/
.?CompObj
Root Entry/Filesystems are recursive, and so recursive algorithms like that in Listing 15.1 are often natural. They may not always be the most efficient or easiest to program, however. Consequently POIFS provides an alternate event-driven API that uses the Java event/listener pattern. Classes may implement the POIFSReaderListener interface and register themselves with a POIFSReader. When a class registers itself, it may also indicate the document in which it is interested. When a matching document is encountered by the POIFSReader, the processPOIFSReaderEvent() method of the POIFSReaderListener will be invoked with a POIFSReaderEvent object containing information about the document, as well as a means to get the DocumentInputStream. Note there is no directory event and hence no way for a listener to be informed when a directory is encountered.The disadvantage of the event-driven API is that the set of listeners must be specified before reading begins. This means there is no way for a program to decide that a certain document may be of interest based on the presence or absence of another document or directory. On the other hand, the event-driven API is more efficient because it avoids having to load the entire filesystem into memory. The only objects loaded will be those that have been registered as of interest.Listing 15.2 shows the event API in action.
.?SummaryInformation
.?DocumentSummaryInformation
.Workbook
Listing 15.2. A tool for examining POI filesystems
The Eventdemo class implements the POIFSReaderListener and provides the necessary processPOIFSReaderEvent() method. This method uses the dump flag to determine whether to display the document's name or dump the contents.The main() method creates a POIFSReader and then checks the program arguments. If there is only one argument, it is assumed to be the name of a file. In that case, an instance of Eventdemo with dump=false is created and configured to receive all events. This is done with the call to registerListener() with no arguments other than the Eventdemo.If there are multiple arguments, the second is taken to be a boolean flag indicating whether documents should be dumped, and an Eventdemo is created appropriately.The remaining arguments are taken to be either the names of documents in the root directory or complete document paths with directories separated by slashes (/). The Eventdemo is configured to listen for simple file names via the call to registerListener() that takes the file name as the second argument. Arguments representing full paths are broken into components by the StringTokenizer and used to create a POIFSDocumentPath. This object is then used in the final call to registerListener(). Running this program on the same Word document as used in the preceding example produces:
package com.awl.toolbook.chapter15;
import java.util.StringTokenizer;
import java.io.FileInputStream;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.poifs.eventfilesystem.*;
public class EventDemo implements POIFSReaderListener {
boolean dump = false;
public EventDemo() {}
public EventDemo(boolean dump) {
this.dump = dump;
}
public void processPOIFSReaderEvent(
POIFSReaderEvent event)
{
if(!dump) {
String name = event.getPath() +
event.getName();
name = name.replace((char) 1,'?');
name = name.replace((char) 5,'?');
System.out.println("Document event: " + name);
} else {
DocumentInputStream in = event.getStream();
byte data[] = new byte[2048];
int count = 0;
try {
while((count = in.read(data)) > 0) {
System.out.print(
new String(data,0,count));
}
} catch (Exception e) {}
System.out.println(");
}
}
public static void main(String args[])
throws Exception
{
POIFSReader reader = new POIFSReader();
if(args.length == 1) {
reader.registerListener(new EventDemo());
} else {
POIFSReaderListener listener =
new EventDemo(args[1].equals("true"));
for(int i=2;i<args.length;i++) {
if(args[i].indexOf('/') == -1) {
reader.registerListener(
listener,args[i]);
} else {
StringTokenizer st =
new StringTokenizer(args[i],"/");
int count = st.countTokens();
String path[] = new String[count-1];
for(int j=0;j<count-1;j++) {
path[i] = st.nextToken();
}
String name = st.nextToken();
reader.registerListener(
listener,
new POIFSDocumentPath(path),
name);
}
}
}
FileInputStream in =
new FileInputStream(args[0]);
reader.read(in);
in.close();
}
}
Running the program with additional arguments of false WordDocument produces, as expected
Document event: /?SummaryInformation
Document event: /?DocumentSummaryInformation
Document event: /WordDocument
Document event: /?CompObj
Document event: /1Table
It is tempting to look at the outputs from these two programs and then attempt to run POIdump to retrieve the contents of the WordDocument entry in the hopes of extracting the text contained within the document. Such curiosity is a good trait in a developer, and the facility to view a DocumentEntry has been added specifically to satisfy such curiosity.While it is true that WordDocument is the DocumentEntry that contains the text, it also contains a great deal of formatting and other information. It may or may not be possible to pick out the text from the control codes, depending on a number of factors.This is not a deficiency in POIFS. Just as the java.util.jar package provides the means to extract a JPEG image from a jar archive but can not itself display the image, POIFS can extract the elements of an OLE compound document, but POIFS itself has no knowledge of what these pieces are or what they mean. This level of control must be handled by higher-level APIs, and these will be discussed shortly.Before moving on to these higher-level APIs, there are additional features of POIFS that are worth discussing. First, APIs are provided to create newarchives. These APIs work in a simple, logical waydirectories are created with createDirectory() and documents with createDocument(), which uses an InputStream to obtain the data for the new document. The following code snippet shows how these APIs might be used:
Document event: /WordDocument
Running POIDump on the result of this code would produce, as expected
POIFSFileSystem fs = new POIFSFileSystem();
DirectoryEntry root = fs.getRoot();
InputStream data1 =
new ByteArrayInputStream("This is data1".getBytes());
root.createDocument("data1",data1);
DirectoryEntry subdir =
createDirectory("subdirectory");
InputStream data2 =
new ByteArrayInputStream("This is data2".getBytes());
subdir.createDocument("data2",data2);
FileOutputStream out =
new FileOutputStream("sample.ole");
fs.writeFilesystem(out);
and running POIDump sample.ole data2 would produce
Root Entry/
.data1
.subdirectory/
..data2
Like reading a POIFS, the ability to write a new file system is not immediately useful on its own. If the goal is to combine multiple files into a single file for storage or transmission, then JAR is a far better choice, because JAR files can be handled by existing zip utilities as well as the built-in java.util.jar package. Simply creating an OLE file whose document entries match those of an Excel spreadsheet will not result in a file that is readable by Excel. Once again, a higher-level API is needed. However, it is worth understanding the creation APIs because they are used by the higher levels.
This is data2