Java Network Programming (3rd ed) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Java Network Programming (3rd ed) [Electronic resources] - نسخه متنی

Harold, Elliotte Rusty

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








8.3 Parsing HTML


Sometimes you want to read HTML,
looking for information without actually displaying it on the screen.
For instance, more than one author I know has written a
"book ticker" program to track the
hour-by-hour progress of their books in the Amazon.com bestseller
list. The hardest part of this program isn't
retrieving the HTML. It's reading through the HTML
to find the one line that contains the book's
ranking. As another example, consider a Web Whacker-style program
that downloads a web site or part thereof to a local PC with all
links intact. Downloading the files once you have the URLs is easy.
But reading through the document to find the URLs of the linked pages
is considerably more complex.

Both of these examples are parsing problems. While parsing a clearly
defined language that doesn't allow syntax errors,
such as Java or XML, is relatively straightforward, parsing a
flexible language that attempts to recover from errors, like HTML, is
extremely difficult. It's easier to write in HTML
than it is to write in a strict language like XML, but
it's much harder to read such a language. Ease of
use for the page author has been favored at the cost of ease of
development for the programmer.

Fortunately, the javax.swing.textl and
javax.swing.textl.parser packages include
classes that do most of the hard work for you.
They're primarily intended for the internal use of
the JEditorPane class discussed in the last
section. Consequently, they can be a little tricky to get at. The
constructors are often not public or hidden inside inner classes, and
the classes themselves aren't very well documented.
But once you've seen a few examples, they
aren't hard to use.


8.3.1 HTMLEditorKit.Parser


The main HTML parsing class is the inner
class javax.swingl.HTMLEditorKit.Parser:

public abstract static class HTMLEditorKit.Parser extends Object

Since this is an abstract class, the actual parsing work is performed
by an instance of its concrete subclass
javax.swing.textl.parser.ParserDelegator:

public class ParserDelegator extends HTMLEditorKit.Parser

An instance of this class reads an HTML document from a
Reader. It looks for five things in the document:
start-tags, end-tags, empty-element tags, text, and comments. That
covers all the important parts of a common HTML file. (Document type
declarations and processing instructions are omitted, but
they're rare and not very important in most HTML
files, even when they are included.) Every time the parser sees one
of these five items, it invokes the corresponding callback method in
a particular instance of the
javax.swing.textl.HTMLEditorKit.ParserCallback
class. To parse an HTML file, you write a subclass of
HTMLEditorKit.ParserCallback that responds to text
and tags as you desire. Then you pass an instance of your subclass to
the HTMLEditorKit.Parser's
parse( ) method, along with the
Reader from which the HTML will be read:

public void parse(Reader in, HTMLEditorKit.ParserCallback callback, 
boolean ignoreCharacterSet) throws IOException

The third argument indicates whether you want to be notified of the
character set of the document, assuming one is found in a
META tag in the HTML header. This will normally be
true. If it's false, then the parser will throw a
javax.swing.text.ChangedCharSetException when a
META tag in the HTML header is used to change the
character set. This would give you an opportunity to switch to a
different Reader that understands that character
set and reparse the document (this time, setting
ignoreCharSet to true since you already know the
character set).

parse( ) is the only public method in the
HTMLEditorKit.Parser class. All the work is
handled inside the callback methods in the
HTMLEditorKit.ParserCallback subclass. The
parse( ) method simply reads from the
Reader in until
it's read the entire document. Every time it sees a
tag, comment, or block of text, it invokes the corresponding callback
method in the HTMLEditorKit.ParserCallback
instance. If the Reader throws an
IOException, that exception is passed along. Since
neither the HTMLEditorKit.Parser nor the
HTMLEditorKit.ParserCallback instance is specific
to one reader, it can be used to parse multiple files simply by
invoking parse( ) multiple times. If you do this,
your HTMLEditorKit.ParserCallback class must be
fully thread-safe, because parsing takes place in a separate thread
and the parse( ) method normally returns before
parsing is complete.

Before you can do any of this, however, you have to get your hands on
an instance of the HTMLEditorKit.Parser class, and
that's harder than it should be.
HTMLEditorKit.Parser is an abstract class, so it
can't be instantiated directly. Its subclass,
javax.swing.textl.parser.ParserDelegator, is
concrete. However, before you can use it, you have to configure it
with a DTD, using the protected static methods
ParserDelegator.setDefaultDTD( ) and
ParserDelegator.createDTD( ):

protected static void setDefaultDTD( )
protected static DTD createDTD(DTD dtd, String name)

So to create a ParserDelegator, you first need to
have an instance of
javax.swing.textl.parser.DTD. This class
represents a Standardized General Markup Language (SGML) document
type definition. The DTD class has a protected
constructor and many protected methods that subclasses can use to
build a DTD from scratch, but this is an API that only an SGML expert
could be expected to use. The normal way DTDs are created is by
reading the text form of a standard DTD published by someone like the
W3C. You should be able to get a DTD for HTML by using the
DTDParser class to parse the
W3C's published HTML DTD. Unfortunately, the
DTDParser class isn't included in
the published Swing API, so you can't. Thus,
you're going to need to go through the back door to
create an HTMLEditorKit.Parser instance. What
we'll do is use the
HTMLEditorKit.Parser.getParser( ) method instead,
which ultimately returns a ParserDelegator after
properly initializing the DTD for HTML 3.2:

protected HTMLEditorKit.Parser getParser( )

Since this method is protected, we'll simply
subclass HTMLEditorKit and override it with a
public version, as Example 8-6 demonstrates.


Example 8-6. This subclass just makes the getParser( ) method public


import javax.swing.textl.*;
public class ParserGetter extends HTMLEditorKit {
// purely to make this method public
public HTMLEditorKit.Parser getParser( ){
return super.getParser( );
}
}

Now that you've got a way to get a parser,
you're ready to parse some documents. This is
accomplished through the parse( ) method of
HTMLEditorKit.Parser:

public abstract void parse(Reader input, HTMLEditorKit.ParserCallback  
callback, boolean ignoreCharSet) throws IOException

The Reader is straightforward. Simply chain an
InputStreamReader to the stream reading the HTML
document, probably one returned by the openStream() method of java.net.URL. For the third
argument, you can pass true to ignore encoding issues (this generally
works only if you're pretty sure
you're dealing with ASCII text) or false if you want
to receive a ChangedCharSetException when the
document has a META tag indicating the character
set. The second argument is where the action is.
You're going to write a subclass of
HTMLEditorKit.ParserCallback that is notified of
every start-tag, end-tag, empty-element tag, text, comment, and error
that the parser encounters.


8.3.2 HTMLEditorKit.ParserCallback


The ParserCallback
class is a public inner class inside
javax.swing.textl.HTMLEditorKit:

public static class HTMLEditorKit.ParserCallback extends Object

It has a single, public noargs constructor:

public HTMLEditorKit.ParserCallback( )

However, you probably won't use this directly
because the standard implementation of this class does nothing. It
exists to be subclassed. It has six callback methods that do nothing.
You will override these methods to respond to specific items seen in
the input stream as the document is parsed:

public void handleText(char[] text, int position)
public void handleComment(char[] text, int position)
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attributes, int position)
public void handleEndTag(HTML.Tag tag, int position)
public void handleSimpleTag(HTML.Tag tag,
MutableAttributeSet attributes, int position)
public void handleError(String errorMessage, int position)

There's also a flush( ) method
you use to perform any final cleanup. The parser invokes this method
once after it's finished parsing the document:

public void flush( ) throws BadLocationException

Let's begin with a simple example. Suppose you want
to write a program that strips out all the tags and comments from an
HTML document and leaves only the text. You would write a subclass of
HTMLEditorKit.ParserCallback that overrides the
handleText( ) method to write the text on a
Writer. You would leave the other methods alone.
Example 8-7 demonstrates.


Example 8-7. TagStripper


import javax.swing.textl.*;
import java.io.*;
public class TagStripper extends HTMLEditorKit.ParserCallback {
private Writer out;
public TagStripper(Writer out) {
this.out = out;
}
public void handleText(char[] text, int position) {
try {
out.write(text);
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
}
}
}

Now let's suppose you want to use this class to
actually strip the tags from a URL. You begin by retrieving a parser
using Example 8-5s
ParserGetter class:

ParserGetter kit = new ParserGetter( );
HTMLEditorKit.Parser parser = kit.getParser( );

Next, construct an instance of your callback class like this:

HTMLEditorKit.ParserCallback callback 
= new TagStripper(new OutputStreamWriter(System.out));

Then you get a stream you can read the HTML document from. For
example:

try {
URL u = new URL("http://www.oreilly.com");
InputStream in = new BufferedInputStream(u.openStream( ));
InputStreamReader r = new InputStreamReader(in);

Finally, you pass the Reader and the
HTMLEditorKit.ParserCallback to the
HTMLEditorKit.Parser's
parse( ) method, like this:

  parser.parse(r, callback, false);
}
catch (IOException ex) {
System.err.println(ex);
}

There are a couple of details about the parsing process that are not
obvious. First, the parser parses in a separate thread. Therefore,
you should not assume that the document has been parsed when the
parse( ) method returns. If
you're using the same
HTMLEditorKit.ParserCallback object for two
separate parses, you need to make all your callback methods
thread-safe.

Second, the parser actually skips some of the data in the input. In
particular, it normalizes and strips whitespace. If the input
document contains seven spaces in a row, the parser will convert that
to a single space. Carriage returns, linefeeds, and tabs are all
converted to a single space, so you lose line breaks. Furthermore,
most text elements are stripped of all leading
and trailing whitespace. Elements that contain nothing but space are
eliminated completely. Thus, suppose the input document contains this
content:

<H1> Here's   the   Title </H1>
<P> Here's the text </P>

What actually comes out of the tag stripper is:

Here's the TitleHere's the text

The single exception is the PRE element, which
maintains all whitespace in its contents unedited. Short of
implementing your own parser, I don't know of any
way to retain all the stripped space. But you can include the minimum
necessary line breaks and whitespace by looking at the tags as well
as the text. Generally, you expect a single break in HTML when you
see one of these tags:

<BR>
<LI>
<TR>

You expect a double break (paragraph break) when you see one of these
tags:

<P>
</H1> </H2> </H3> </H4> </H5> </H6>
<HR>
<DIV>
</UL> </OL> </DL>

To include line breaks in the output, you have to look at each tag as
it's processed and determine whether it falls in one
of these sets. This is straightforward because the first argument
passed to each of the tag callback methods is an
HTML.Tag object.


8.3.3 HTML.Tag


Tag is a public inner class in the
javax.swing.textl.HTML class.

public static class HTML.Tag extends Object

It has these four methods:

public boolean isBlock( )
public boolean breaksFlow( )
public boolean isPreformatted( )
public String toString( )

The breaksFlow( ) method returns true if
the tag should cause a single line break. The isBlock() method returns true if the tag
should cause a double line break. The isPreformatted() method returns true if
the tag indicates that whitespace should be preserved. This makes it
easy to provide the necessary breaks in the output.

Chances are you'll see more tags than
you'd expect when you parse a file. The parser
inserts missing closing tags. In other words, if a document contains
only a <P> tag, then the parser will report
both the <P> start-tag and the implied
</P> end-tag at the appropriate points in
the document. Example 8-8 is a program that does the
best job yet of converting HTML to pure text. It looks for the empty
and end-tags, explicit or implied, and, if the tag indicates that
line breaks are called for, inserts the necessary number of line
breaks.


Example 8-8. LineBreakingTagStripper


import javax.swing.text.*;
import javax.swing.textl.*;
import javax.swing.textl.parser.*;
import java.io.*;
import java.net.*;
public class LineBreakingTagStripper
extends HTMLEditorKit.ParserCallback {
private Writer out;
private String lineSeparator;
public LineBreakingTagStripper(Writer out) {
this(out, System.getProperty("line.separator", "\r\n"));
}
public LineBreakingTagStripper(Writer out, String lineSeparator) {
this.out = out;
this.lineSeparator = lineSeparator;
}
public void handleText(char[] text, int position) {
try {
out.write(text);
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
}
}
public void handleEndTag(HTML.Tag tag, int position) {
try {
if (tag.isBlock( )) {
out.write(lineSeparator);
out.write(lineSeparator);
}
else if (tag.breaksFlow( )) {
out.write(lineSeparator);
}
}
catch (IOException ex) {
System.err.println(ex);
}
}
public void handleSimpleTag(HTML.Tag tag,
MutableAttributeSet attributes, int position) {
try {
if (tag.isBlock( )) {
out.write(lineSeparator);
out.write(lineSeparator);
}
else if (tag.breaksFlow( )) {
out.write(lineSeparator);
}
else {
out.write(' ');
}
}
catch (IOException ex) {
System.err.println(ex);
}
}
}

Most of the time, of course, you want to know considerably more than
whether a tag breaks a line. You want to know what tag it is, and
behave accordingly. For instance, if you were writing a full-blown
HTML-to-TeX or HTML-to-RTF converter, you'd want to
handle each tag differently. You test the type of tag by comparing it
against these 73 mnemonic constants from the
HTML.Tag
class:


HTML.Tag.A


HTML.Tag.FRAMESET


HTML.Tag.PARAM


HTML.Tag.ADDRESS


HTML.Tag.H1


HTML.Tag.PRE


HTML.Tag.APPLET


HTML.Tag.H2


HTML.Tag.SAMP


HTML.Tag.AREA


HTML.Tag.H3


HTML.Tag.SCRIPT


HTML.Tag.B


HTML.Tag.H4


HTML.Tag.SELECT


HTML.Tag.BASE


HTML.Tag.H5


HTML.Tag.SMALL


HTML.Tag.BASEFONT


HTML.Tag.H6


HTML.Tag.STRIKE


HTML.Tag.BIG


HTML.Tag.HEAD


HTML.Tag.S


HTML.Tag.BLOCKQUOTE


HTML.Tag.HR


HTML.Tag.STRONG


HTML.Tag.BODY


HTML.Tag.HTML


HTML.Tag.STYLE


HTML.Tag.BR


HTML.Tag.I


HTML.Tag.SUB


HTML.Tag.CAPTION


HTML.Tag.IMG


HTML.Tag.SUP


HTML.Tag.CENTER


HTML.Tag.INPUT


HTML.Tag.TABLE


HTML.Tag.CITE


HTML.Tag.ISINDEX


HTML.Tag.TD


HTML.Tag.CODE


HTML.Tag.KBD


HTML.Tag.TEXTAREA


HTML.Tag.DD


HTML.Tag.LI


HTML.Tag.TH


HTML.Tag.DFN


HTML.Tag.LINK


HTML.Tag.TR


HTML.Tag.DIR


HTML.Tag.MAP


HTML.Tag.TT


HTML.Tag.DIV


HTML.Tag.MENU


HTML.Tag.U


HTML.Tag.DL


HTML.Tag.META


HTML.Tag.UL


HTML.Tag.DT


HTML.Tag.NOFRAMES


HTML.Tag.VAR


HTML.Tag.EM


HTML.Tag.OBJECT


HTML.Tag.IMPLIED


HTML.Tag.FONT


HTML.Tag.OL


HTML.Tag.COMMENT


HTML.Tag.FORM


HTML.Tag.OPTION


HTML.Tag.FRAME


HTML.Tag.P

These are not int constants. They are
object constants to allow
compile-time type checking. You saw this trick once before in the
javax.swing.event.HyperlinkEvent class. All
HTML.Tag elements passed to your callback methods
by the HTMLEditorKit.Parser will be one of these
73 constants. They are not just the same as
these 73 objects; they are these 73 objects.
There are exactly 73 objects in this class; no more, no less. You can
test against them with == rather than
equals( ).

For example, let's suppose you need a program that
outlines HTML pages by extracting their H1 through
H6 headings while ignoring the rest of the
document. It organizes the outline as nested lists in which each
H1 heading is at the top level, each
H2 heading is one level deep, and so on. You would
write an HTMLEditorKit.ParserCallback subclass
that extracted the contents of all H1,
H2, H3, H4,
H5, and H6 elements while
ignoring all others, as Example 8-9 demonstrates.


Example 8-9. Outliner


import javax.swing.text.*;
import javax.swing.textl.*;
import javax.swing.textl.parser.*;
import java.io.*;
import java.net.*;
import java.util.*;
public class Outliner extends HTMLEditorKit.ParserCallback {
private Writer out;
private int level = 0;
private boolean inHeader=false;
private static String lineSeparator
= System.getProperty("line.separator", "\r\n");
public Outliner(Writer out) {
this.out = out;
}
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attributes, int position) {
int newLevel = 0;
if (tag == HTML.Tag.H1) newLevel = 1;
else if (tag == HTML.Tag.H2) newLevel = 2;
else if (tag == HTML.Tag.H3) newLevel = 3;
else if (tag == HTML.Tag.H4) newLevel = 4;
else if (tag == HTML.Tag.H5) newLevel = 5;
else if (tag == HTML.Tag.H6) newLevel = 6;
else return;
this.inHeader = true;
try {
if (newLevel > this.level) {
for (int i =0; i < newLevel-this.level; i++) {
out.write("<ul>" + lineSeparator + "<li>");
}
}
else if (newLevel < this.level) {
for (int i =0; i < this.level-newLevel; i++) {
out.write(lineSeparator + "</ul>" + lineSeparator);
}
out.write(lineSeparator + "<li>");
}
else {
out.write(lineSeparator + "<li>");
}
this.level = newLevel;
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
}
}
public void handleEndTag(HTML.Tag tag, int position) {
if (tag == HTML.Tag.H1 || tag == HTML.Tag.H2
|| tag == HTML.Tag.H3 || tag == HTML.Tag.H4
|| tag == HTML.Tag.H5 || tag == HTML.Tag.H6) {
inHeader = false;
}
// work around bug in the parser that fails to call flush
if (tag == HTML.Tag.HTML) this.flush( );
}
public void handleText(char[] text, int position) {
if (inHeader) {
try {
out.write(text);
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
}
}
}
public void flush( ) {
try {
while (this.level-- > 0) {
out.write(lineSeparator + "</ul>");
}
out.flush( );
}
catch (IOException e) {
System.err.println(e);
}
}
private static void parse(URL url, String encoding) throws IOException {
ParserGetter kit = new ParserGetter( );
HTMLEditorKit.Parser parser = kit.getParser( );
InputStream in = url.openStream( );
InputStreamReader r = new InputStreamReader(in, encoding);
HTMLEditorKit.ParserCallback callback = new Outliner
(new OutputStreamWriter(System.out));
parser.parse(r, callback, true);
}
public static void main(String[] args) {
ParserGetter kit = new ParserGetter( );
HTMLEditorKit.Parser parser = kit.getParser( );
String encoding = "ISO-8859-1";
URL url = null;
try {
url = new URL(args[0]);
InputStream in = url.openStream( );
InputStreamReader r = new InputStreamReader(in, encoding);
// parse once just to detect the encoding
HTMLEditorKit.ParserCallback doNothing
= new HTMLEditorKit.ParserCallback( );
parser.parse(r, doNothing, false);
}
catch (MalformedURLException ex) {
System.out.println("Usage: java Outliner url");
return;
}
catch (ChangedCharSetException ex) {
String mimeType = ex.getCharSetSpec( );
encoding = mimeType.substring(mimeType.indexOf("=") + 1).trim( );
}
catch (IOException ex) {
System.err.println(ex);
}
catch (ArrayIndexOutOfBoundsException ex) {
System.out.println("Usage: java Outliner url");
return;
}
try {
parse(url, encoding);
}
catch(IOException ex) {
System.err.println(ex);
}
}
}

When a heading start-tag is encountered by the
handleStartTag( ) method, the necessary
number of <ul>,
</ul>, and <li>
tags are emitted. Furthermore, the inHeading flag
is set to true so that the handleText( ) method
will know to output the contents of the heading. All start-tags
except the six levels of headers are simply ignored. The
handleEndTag( ) method likewise
considers heading tags only by comparing the tag it receives with the
seven tags it's interested in. If it sees a heading
tag, it sets the inHeading flag to false again so
that body text won't be emitted by the
handleText( ) method. If it sees the end of the
document via an </html> tag, it flushes out
the document. Otherwise, it does nothing. The end result is a nicely
formatted group of nested, unordered lists that outlines the
document. For example, here's the output of running
it against http://www.cafeconleche.org:

% java Outliner http://www.cafeconleche.org/
<ul>
<li> Cafe con Leche XML News and Resources<ul>
<li>Quote of the Day
<li>Today's News
<li>Recommended Reading
<li>Recent News<ul>
<li>XML Overview
<li>Tutorials
<li>Projects
<li>Seminar Notes
<li>Random Notes
<li>Specifications
<li>Books
<li>XML Resources
<li>Development Tools<ul>
<li>Validating Parsers
<li>Non-validating Parsers
<li>Online Validators and Syntax Checkers
<li>Formatting Engines
<li>Browsers
<li>Class Libraries
<li>Editors
<li>XML Applications
<li>External Sites
</ul>
</ul>
</ul>
</ul>


8.3.4 Attributes


When processing an HTML file, you often
need to look at the attributes as well as the tags. The second
argument to the handleStartTag( ) and
handleSimpleTag( ) callback methods is an instance
of the javax.swing.text.MutableAttributeSet class.
This object allows you to see what attributes are attached to a
particular tag. MutableAttributeSet is a
subinterface of the
javax.swing.text.AttributeSet interface:

public abstract interface MutableAttributeSet extends AttributeSet

Both AttributeSet and
MutableAttributeSet represent a collection of
attributes on an HTML tag. The difference is that the
MutableAttributeSet interface declares methods to
add attributes to, remove attributes from, and inspect the attributes
in the set. The attributes themselves are represented as pairs of
java.lang.Object objects, one for the name of the
attribute and one for the value. The
AttributeSet
interface declares these methods:

public int          getAttributeCount( )
public boolean isDefined(Object name)
public boolean containsAttribute(Object name, Object value)
public boolean containsAttributes(AttributeSet attributes)
public boolean isEqual(AttributeSet attributes)
public AttributeSet copyAttributes( )
public Enumeration getAttributeNames( )
public Object getAttribute(Object name)
public AttributeSet getResolveParent( )

Most of these methods are self-explanatory. The
getAttributeCount( ) method returns the number of
attributes in the set. The isDefined( ) method
returns true if an attribute with the specified name is in the set,
false otherwise. The containsAttribute(Object name, Object
value)
method returns true if an attribute with the given
name and value is in the set. The
containsAttributes(AttributeSet attributes) method
returns true if all the attributes in the specified set are in this
set with the same values; in other words, if the argument is a subset
of the set on which this method is invoked. The isEqual() method returns true if the invoking
AttributeSet is the same as the argument. The
copyAttributes( ) method returns a clone of the
current AttributeSet. The
getAttributeNames( ) method returns a
java.util.Enumeration of all the names of the
attributes in the set. Once you know the name of one of the elements
of the set, the getAttribute( ) method returns its
value. Finally, the getResolveParent( ) method
returns the parent AttributeSet, which will be
searched for attributes that are not found in the current set. For
example, given an AttributeSet, this method prints
the attributes in name=value format:

private void listAttributes(AttributeSet attributes) {
Enumeration e = attributes.getAttributeNames( );
while (e.hasMoreElements( )) {
Object name = e.nextElement( );
Object value = attributes.getAttribute(name);
System.out.println(name + "=" + value);
}
}

Although the argument and return types of these methods are mostly
declared in terms of java.lang.Object, in
practice, all values are instances of
java.lang.String, while all names are instances of
the public inner class
javax.swing.textl.HTML.Attribute. Just as the
HTML.Tag class predefines 73 HTML tags and uses a
private constructor to prevent the creation of others, so too does
the HTML.Attribute class predefine 80
standard HTML attributes (HTML.Attribute.ACTION,
HTML.Attribute.ALIGN,
HTML.Attribute.ALINK,
HTML.Attribute.ALT, etc.) and prohibits the
construction of others via a nonpublic constructor. Generally, this
isn't an issue, since you mostly use
getAttribute( ), containsAttribute(), and so forth only with names returned by
getAttributeNames( ). The 80 predefined attributes
are:


HTML.Attribute.ACTION


HTML.Attribute.DUMMY


HTML.Attribute.PROMPT


HTML.Attribute.ALIGN


HTML.Attribute.ENCTYPE


HTML.Attribute.REL


HTML.Attribute.ALINK


HTML.Attribute.ENDTAG


HTML.Attribute.REV


HTML.Attribute.ALT


HTML.Attribute.FACE


HTML.Attribute.ROWS


HTML.Attribute.ARCHIVE


HTML.Attribute.FRAMEBORDER


HTML.Attribute.ROWSPAN


HTML.Attribute.BACKGROUND


HTML.Attribute.HALIGN


HTML.Attribute. SCROLLING


HTML.Attribute.BGCOLOR


HTML.Attribute.HEIGHT


HTML.Attribute.SELECTED


HTML.Attribute.BORDER


HTML.Attribute.HREF


HTML.Attribute.SHAPE


HTML.Attribute. CELLPADDING


HTML.Attribute.HSPACE


HTML.Attribute.SHAPES


HTML.Attribute. CELLSPACING


HTML.Attribute.HTTPEQUIV


HTML.Attribute.SIZE


HTML.Attribute.CHECKED


HTML.Attribute.ID


HTML.Attribute.SRC


HTML.Attribute.CLASS


HTML.Attribute.ISMAP


HTML.Attribute.STANDBY


HTML.Attribute.CLASSID


HTML.Attribute.LANG


HTML.Attribute.START


HTML.Attribute.CLEAR


HTML.Attribute.LANGUAGE


HTML.Attribute.STYLE


HTML.Attribute.CODE


HTML.Attribute.LINK


HTML.Attribute.TARGET


HTML.Attribute.CODEBASE


HTML.Attribute.LOWSRC


HTML.Attribute.TEXT


HTML.Attribute.CODETYPE


HTML.Attribute. MARGINHEIGHT


HTML.Attribute.TITLE


HTML.Attribute.COLOR


HTML.Attribute.MARGINWIDTH


HTML.Attribute.TYPE


HTML.Attribute.COLS


HTML.Attribute.MAXLENGTH


HTML.Attribute.USEMAP


HTML.Attribute.COLSPAN


HTML.Attribute.METHOD


HTML.Attribute.VALIGN


HTML.Attribute.COMMENT


HTML.Attribute.MULTIPLE


HTML.Attribute.VALUE


HTML.Attribute.COMPACT


HTML.Attribute.N


HTML.Attribute. VALUETYPE


HTML.Attribute.CONTENT


HTML.Attribute.NAME


HTML.Attribute.VERSION


HTML.Attribute.COORDS


HTML.Attribute.NOHREF


HTML.Attribute.VLINK


HTML.Attribute.DATA


HTML.Attribute.NORESIZE


HTML.Attribute.VSPACE


HTML.Attribute.DECLARE


HTML.Attribute.NOSHADE


HTML.Attribute.WIDTH


HTML.Attribute.DIR


HTML.Attribute.NOWRAP

The
MutableAttributeSet interface adds six methods to add
attributes to and remove attributes from the set:

public void                 addAttribute(Object name, Object value)
public void addAttributes(AttributeSet attributes)
public void removeAttribute(Object name)
public void removeAttributes(Enumeration names)
public void removeAttributes(AttributeSet attributes)
public void setResolveParent(AttributeSet parent)

Again, the values are strings and the names are
HTML.Attribute objects.

One possible use for all these methods is to modify documents before
saving or displaying them. For example, most web browsers let you
save a page on your hard drive as either HTML or text. However, both
these formats lose track of images and relative links. The problem is
that most pages are full of relative URLs, and these all break when
you move the page to your local machine. Example 8-10
is an application called PageSaver that downloads
a web page to a local hard drive while keeping all links intact by
rewriting all relative URLs as absolute URLs.

The PageSaver class reads a series of URLs from the
command line. It opens each one in turn and parses it. Every tag,
text block, comment, and attribute is copied into a local file.
However, all link attributes, such as SRC,
LOWSRC, CODEBASE, and
HREF, are remapped to an absolute URL. Note
particularly the extensive use to which the URL
and javax.swing.text classes were put;
PageSaver could be rewritten with string
replacements, but that would be considerably more complicated.


Example 8-10. PageSaver


import javax.swing.text.*;
import javax.swing.textl.*;
import javax.swing.textl.parser.*;
import java.io.*;
import java.net.*;
import java.util.*;
public class PageSaver extends HTMLEditorKit.ParserCallback {
private Writer out;
private URL base;
public PageSaver(Writer out, URL base) {
this.out = out;
this.base = base;
}
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attributes, int position) {
try {
out.write("<" + tag);
this.writeAttributes(attributes);
// for the <APPLET> tag we may have to add a codebase attribute
if (tag == HTML.Tag.APPLET
&& attributes.getAttribute(HTML.Attribute.CODEBASE) == null) {
String codebase = base.toString( );
if (codebase.endsWith("") || codebase.endsWith("l")) {
codebase = codebase.substring(0, codebase.lastIndexOf('/'));
}
out.write(" codebase=\" + codebase + "\");
}
out.write(">");
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
e.printStackTrace( );
}
}
public void handleEndTag(HTML.Tag tag, int position) {
try {
out.write("</" + tag + ">");
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
}
}
private void writeAttributes(AttributeSet attributes)
throws IOException {
Enumeration e = attributes.getAttributeNames( );
while (e.hasMoreElements( )) {
Object name = e.nextElement( );
String value = (String) attributes.getAttribute(name);
try {
if (name == HTML.Attribute.HREF || name == HTML.Attribute.SRC
|| name == HTML.Attribute.LOWSRC
|| name == HTML.Attribute.CODEBASE ) {
URL u = new URL(base, value);
out.write(" " + name + "=\" + u + "\");
}
else {
out.write(" " + name + "=\" + value + "\");
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
System.err.println(base);
System.err.println(value);
ex.printStackTrace( );
}
}
}
public void handleComment(char[] text, int position) {
try {
out.write("<!-- ");
out.write(text);
out.write(" -->");
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
}
}
public void handleText(char[] text, int position) {
try {
out.write(text);
out.flush( );
}
catch (IOException ex) {
System.err.println(ex);
e.printStackTrace( );
}
}
public void handleSimpleTag(HTML.Tag tag,
MutableAttributeSet attributes, int position) {
try {
out.write("<" + tag);
this.writeAttributes(attributes);
out.write(">");
}
catch (IOException ex) {
System.err.println(ex);
e.printStackTrace( );
}
}
public static void main(String[] args) {
for (int i = 0; i < args.length; i++) {
ParserGetter kit = new ParserGetter( );
HTMLEditorKit.Parser parser = kit.getParser( );
try {
URL u = new URL(args[i]);
InputStream in = u.openStream( );
InputStreamReader r = new InputStreamReader(in);
String remoteFileName = u.getFile( );
if (remoteFileName.endsWith("/")) {
remoteFileName += "indexl";
}
if (remoteFileName.startsWith("/")) {
remoteFileName = remoteFileName.substring(1);
}
File localDirectory = new File(u.getHost( ));
while (remoteFileName.indexOf('/') > -1) {
String part = remoteFileName.substring(0, remoteFileName.
indexOf('/'));
remoteFileName =
remoteFileName.substring(remoteFileName.indexOf('/')+1);
localDirectory = new File(localDirectory, part);
}
if (localDirectory.mkdirs( )) {
File output = new File(localDirectory, remoteFileName);
FileWriter out = new FileWriter(output);
HTMLEditorKit.ParserCallback callback = new PageSaver(out, u);
parser.parse(r, callback, false);
}
}
catch (IOException ex) {
System.err.println(ex);
e.printStackTrace( );
}
}
}
}

The handleEndTag( ),
handleText(), and handleComment(
)
methods simply copy their content from the input into the
output. The handleStartTag( ) and
handleSimpleTag( ) methods write their respective
tags onto the output but also invoke the private
writeAttributes( ) method. This method loops
through the attributes in the set and mostly just copies them onto
the output. However, for a few select attributes, such as
SRC and HREF, which typically
have URL values, it rewrites the values as absolute URLs. Finally,
the main( ) method reads URLs from the command
line, calculates reasonable names and directories for corresponding
local files, and starts a new PageSaver for each
URL.

In the first edition of this book, I included a similar program that
downloaded the raw HTML using the URL class and
parsed it manually. That program was about a third longer than this
one and much less robust. For instance, it did not support frames or
the LOWSRC attributes of IMG
tags. It went to great effort to handle both quoted and unquoted
attribute values and still didn't recognize
attribute values enclosed in single quotes. By contrast, this program
needs only one extra line of code to support each additional
attribute. It is much more robust, much easier to understand (since
there's not a lot of detailed string manipulation),
and much easier to extend.

This is just one example of the various HTML filters that the
javax.swing.textl package makes easy to write.
You could, for example, write a filter that pretty-prints the HTML by
indenting the different levels of tags. You could write a program to
convert HTML to TeX, XML, RTF, or many other formats. You could write
a program that spiders a web site, downloading all linked
pagesand this is just the beginning. All of these programs are
much easier to write because Swing provides a simple-to-use HTML
parser. All you have to do is respond to the individual elements and
attributes that the parser discovers in the HTML document. The more
difficult problem of parsing the document is removed.


/ 164