Java Network Programming (3rd ed) [Electronic resources] نسخه متنی

15.3 Reading the Header

HTTP servers provide a substantial
amount of information in the header that precedes each response. For
example, here's a typical HTTP header returned by an
Apache web server:

HTTP/1.1 200 OK
Date: Mon, 18 Oct 1999 20:06:48 GMT
Server: Apache/1.3.4 (Unix) PHP/3.0.6 mod_perl/1.17
Last-Modified: Mon, 18 Oct 1999 12:58:21 GMT
ETag: "1e05f2-89bb-380b196d"
Accept-Ranges: bytes
Content-Length: 35259
Connection: close
Content-Type: text/html

There's a lot of information there. In general, an
HTTP header may include the content type of the requested document,
the length of the document in bytes, the character set in which the
content is encoded, the date and time, the date the content expires,
and the date the content was last modified. However, the information
depends on the server; some servers send all this information for
each request, others send some information, and a few
don't send anything. The methods of this section
allow you to query a
URLConnection to find out what metadata the server
has provided.

Aside from HTTP, very few protocols use MIME headers (and technically
speaking, even the HTTP header isn't actually a MIME
header; it just looks a lot like one). When writing your own subclass
of URLConnection, it is often necessary to
override these methods so that they return sensible values. The most
important piece of information you may be lacking is the MIME content
type. URLConnection provides some utility methods
that guess the data's content type based on its
filename or the first few bytes of the data itself.

15.3.1 Retrieving Specific Header Fields

The first six methods request specific, particularly common fields
from the header. These are:

Content-type

Content-length

Content-encoding

Date

Last-modified

Expires

15.3.1.1 public String getContentType( )

This method returns the MIME content
type of the data. It relies on the web server to send a valid content
type. (In a later section, we'll see how
recalcitrant servers are handled.) It throws no exceptions and
returns null if the content type
isn't available. text/html will
be the most common content type you'll encounter
when connecting to web servers. Other commonly used types include
text/plain, image/gif,
application/xml, and
image/jpeg.

If the content type is some form of text, then this header may also
contain a character set part identifying the
document's character encoding. For example:

Content-type: text/html; charset=UTF-8

Or:

Content-Type: text/xml; charset=iso-2022-jp

In this case, getContentType( ) returns the full
value of the Content-type field, including the character encoding. We
can use this to improve on Example 15-1 by using the
encoding specified in the HTTP header to decode the document, or
ISO-8859-1 (the HTTP default) if no such encoding is specified. If a
nontext type is encountered, an exception is thrown. Example 15-2 demonstrates:

Example 15-2. Download a web page with the correct character set

import java.net.*;
import java.io.*;
public class EncodingAwareSourceViewer {
public static void main (String[] args) {
for (int i = 0; i < args.length; i++) {  
try {
// set default encoding
String encoding = "ISO-8859-1";
URL u = new URL(args[i]);
URLConnection uc = u.openConnection( );
String contentType = uc.getContentType( );
int encodingStart = contentType.indexOf("charset=");
if (encodingStart != -1) {
encoding = contentType.substring(encodingStart+8);
}
InputStream in = new BufferedInputStream(uc.getInputStream( ));   
Reader r = new InputStreamReader(in, encoding);
int c;
while ((c = r.read( )) != -1) {
System.out.print((char) c);
} 
}
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
catch (IOException ex) {
System.err.println(ex);
}
} //  end if
} // end main
}  // end EncodingAwareSourceViewer

In practice, most servers don't include charset
information in their Content-type headers, so this is of limited use.

15.3.1.2 public int getContentLength( )

The getContentLength() method tells you
how many bytes there are in the content. Many servers send
Content-length headers only when they're
transferring a binary file, not when transferring a text file. If
there is no Content-length header, getContentLength() returns -1. The method throws no exceptions. It is used
when you need to know exactly how many bytes to read or when you need
to create a buffer large enough to hold the data in advance.

In Chapter 7, we discussed how to use the
openStream( ) method of the URL
class to download text files from an HTTP server. Although in theory
you should be able to use the same method to download a binary file,
such as a GIF image or a .class byte code file,
in practice this procedure presents a problem. HTTP servers
don't always close the connection exactly where the
data is finished; therefore, you don't know when to
stop reading. To download a binary file, it is more reliable to use a
URLConnection's
getContentLength( ) method to find the
file's length, then read exactly the number of bytes
indicated. Example 15-3 is a program that uses this
technique to save a binary file on a disk.

Example 15-3. Downloading a binary file from a web site and saving it to disk

import java.net.*;
import java.io.*;
public class BinarySaver {
public static void main (String args[]) {
for (int i = 0; i < args.length; i++) {
try {
URL root = new URL(args[i]);
saveBinaryFile(root);
}
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not URL I understand.");
}
catch (IOException ex) {
System.err.println(ex);
}
} // end for
} // end main
public static void saveBinaryFile(URL u) throws IOException {
URLConnection uc = u.openConnection( );
String contentType = uc.getContentType( );
int contentLength = uc.getContentLength( );
if (contentType.startsWith("text/") || contentLength == -1 ) {
throw new IOException("This is not a binary file.");
}
InputStream raw = uc.getInputStream( );
InputStream in  = new BufferedInputStream(raw);
byte[] data = new byte[contentLength];
int bytesRead = 0;
int offset = 0;
while (offset < contentLength) {
bytesRead = in.read(data, offset, data.length-offset);
if (bytesRead == -1) break;
offset += bytesRead;
}
in.close( );
if (offset != contentLength) {
throw new IOException("Only read " + offset 
+ " bytes; Expected " + contentLength + " bytes");
}
String filename = u.getFile( );
filename = filename.substring(filename.lastIndexOf('/') + 1);
FileOutputStream fout = new FileOutputStream(filename);
fout.write(data);
fout.flush( );
fout.close( );
} 
} // end BinarySaver

As usual, the main( ) method loops over the URLs
entered on the command line, passing each URL to the
saveBinaryFile( ) method. saveBinaryFile() opens a URLConnection
uc to the URL. It puts the type
into the variable contentType and the content
length into the variable contentLength. Next, an
if statement checks whether the content type is
text or the Content-length field is missing or
invalid (contentLength == -1).
If either of these is true, an
IOException is thrown. If these assertions are
both false, we have a binary file of known length:
that's what we want.

Now that we have a genuine binary file on our hands, we prepare to
read it into an array of bytes called data.
data is initialized to the number of bytes
required to hold the binary object, contentLength.
Ideally, you would like to fill data with a single
call to read( ) but you probably
won't get all the bytes at once, so the read is
placed in a loop. The number of bytes read up to this point is
accumulated into the offset variable, which also
keeps track of the location in the data array at
which to start placing the data retrieved by the next call to
read( ). The loop continues until
offset equals or exceeds
contentLength; that is, the array has been filled
with the expected number of bytes. We also break out of the
while loop if read( ) returns
-1, indicating an unexpected end of stream. The
offset variable now contains the total number of
bytes read, which should be equal to the content length. If they are
not equal, an error has occurred, so saveBinaryFile() throws an IOException. This is the
general procedure for reading binary files from HTTP connections.

Now we are ready to save the data in a file. saveBinaryFile() gets the filename from the URL using the
getFile( ) method and strips any path information
by calling
filename.substring(theFile.lastIndexOf('/')
+ 1). A new
FileOutputStream fout is opened
into this file and the data is written in one large burst with
fout.write(b).

15.3.1.3 public String getContentEncoding( )

This method
returns a String that tells you how the content is
encoded. If the content is sent unencoded (as is commonly the case
with HTTP servers), this method returns null. It
throws no exceptions. The most commonly used content encoding on the
Web is probably x-gzip, which can be straightforwardly decoded using
a java.util.zip.GZipInputStream.

The content encoding is not the same as the character encoding. The
character encoding is determined by the Content-type header or
information internal to the document, and specifies how characters
are specified in bytes. Content encoding specifies how the bytes are
encoded in other bytes.

When subclassing URLConnection, override this
method if you expect to be dealing with encoded data, as might be the
case for an NNTP or SMTP protocol handler; in these applications,
many different encoding schemes, such as BinHex and uuencode, are
used to pass eight-bit binary data through a seven-bit ASCII
connection.

15.3.1.4 public long getDate( )

The getDate( ) method returns a
long that tells you when the document was sent, in
milliseconds since midnight, Greenwich Mean Time (GMT), January 1,
1970. You can convert it to a java.util.Date. For
example:

Date documentSent = new Date(uc.getDate( ));

This is the time the document was sent as seen from the server; it
may not agree with the time on your local machine. If the HTTP header
does not include a Date field, getDate( ) returns
0.

15.3.1.5 public long getExpiration( )

Some documents have
server-based expiration dates that indicate when the document should
be deleted from the cache and reloaded from the server.
getExpiration( ) is very similar to
getDate( ), differing only in how the return value
is interpreted. It returns a long indicating the
number of milliseconds after 12:00 A.M., GMT, January 1, 1970, at
which point the document expires. If the HTTP header does not include
an Expiration field, getExpiration( ) returns 0,
which means 12:00 A.M., GMT, January 1, 1970. The only reasonable
interpretation of this date is that the document does not expire and
can remain in the cache indefinitely.

15.3.1.6 public long getLastModified( )

The final date
method, getLastModified( ), returns the date on
which the document was last modified. Again, the date is given as the
number of milliseconds since midnight, GMT, January 1, 1970. If the
HTTP header does not include a Last-modified field (and many
don't), this method returns 0.

Example 15-4 reads URLs from the command line and
uses these six methods to print their content type, content length,
content encoding, date of last modification, expiration date, and
current date.

Example 15-4. Return the header

import java.net.*;
import java.io.*;
import java.util.*;
public class HeaderViewer {
public static void main(String args[]) {
for (int i=0; i < args.length; i++) {
try {
URL u = new URL(args[0]);
URLConnection uc = u.openConnection( );
System.out.println("Content-type: " + uc.getContentType( ));
System.out.println("Content-encoding: " 
+ uc.getContentEncoding( ));
System.out.println("Date: " + new Date(uc.getDate( )));
System.out.println("Last modified: " 
+ new Date(uc.getLastModified( )));
System.out.println("Expiration date: " 
+ new Date(uc.getExpiration( )));
System.out.println("Content-length: " + uc.getContentLength( ));
}  // end try
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not a URL I understand");
}
catch (IOException ex) {
System.err.println(ex);
}      
System.out.println( ); 
}  // end for
}  // end main
}  // end HeaderViewer

Here's the result when used to look at http://www.oreilly.com:

% java HeaderViewer http://www.oreilly.com
Content-type: text/html
Content-encoding: null
Date: Mon Oct 18 13:54:52 PDT 1999
Last modified: Sat Oct 16 07:54:02 PDT 1999
Expiration date: Wed Dec 31 16:00:00 PST 1969
Content-length: -1

The content type of the file at http://www.oreilly.com is
text/html. No content encoding was used. The file
was sent on Monday, October 18, 1999 at 1:54 P.M., Pacific Daylight
Time. It was last modified on Saturday, October 16, 1999 at 7:54 A.M.
Pacific Daylight Time and it expires on Wednesday, December 31, 1969
at 4:00 P. M., Pacific Standard Time. Did this document really expire
31 years ago? No. Remember that what's being checked
here is whether the copy in your cache is more recent than 4:00 P.M.
PST, December 31, 1969. If it is, you don't need to
reload it. More to the point, after adjusting for time zone
differences, this date looks suspiciously like 12:00 A.M., Greenwich
Mean Time, January 1, 1970, which happens to be the default if the
server doesn't send an expiration date. (Most
don't.)

Finally, the content length of -1 means that there was no
Content-length header. Many servers don't bother to
provide a Content-length header for text files. However, a
Content-length header should always be sent for a binary file.
Here's the HTTP header you get when you request the
GIF image /image/library/english/10151_space.gif.
Now the server sends a Content-length header with a value of 57.

% java HeaderViewer /image/library/english/10151_space.gif
Content-type: image/gif
Content-encoding: null
Date: Mon Oct 18 14:00:07 PDT 1999
Last modified: Thu Jan 09 12:05:11 PST 1997
Expiration date: Wed Dec 31 16:00:00 PST 1969
Content-length: 57

15.3.2 Retrieving Arbitrary Header Fields

The last six methods requested specific
fields from the header, but there's no theoretical
limit to the number of header fields a message can contain. The next
five methods inspect arbitrary fields in a header. Indeed, the
methods of the last section are just thin wrappers over the methods
discussed here; you can use these methods to get header fields that
Java's designers did not plan for. If the requested
header is found, it is returned. Otherwise, the method returns
null.

15.3.2.1 public String getHeaderField(String name)

The getHeaderField() method returns the
value of a named header field. The name of the header is not
case-sensitive and does not include a closing colon. For example, to
get the value of the Content-type and Content-encoding header fields
of a URLConnection object uc,
you could write:

String contentType = uc.getHeaderField("content-type");
String contentEncoding = uc.getHeaderField("content-encoding"));

To get the Date, Content-length, or Expires headers,
you'd do the same:

String data = uc.getHeaderField("date");
String expires = uc.getHeaderField("expires");
String contentLength = uc.getHeaderField("Content-length");

These methods all return String, not
int or long as the
getContentLength( ), getExpirationDate(), getLastModified( ), and
getDate( ) methods of the last section did. If
you're interested in a numeric value, convert the
String to a long or an
int.

Do not assume the value returned by getHeaderField() is valid. You must check to make sure it is non-null.

15.3.2.2 public String getHeaderFieldKey(int n)

This method returns the key (that is, the field name: for example,
Content-length or Server) of
the n^th header field.
The request method is header zero and has a null key. The first
header is one. For example, to get the sixth key of the header of the
URLConnection uc, you would
write:

String header6 = uc.getHeaderFieldKey(6);

15.3.2.3 public String getHeaderField(int n)

This method returns the value of the nth header
field. In HTTP, the request method is header field zero and the first
actual header is one. Example 15-5 uses this method
in conjunction with getHeaderFieldKey( ) to print
the entire HTTP header.

Example 15-5. Print the entire HTTP header

import java.net.*;
import java.io.*;
public class AllHeaders {
public static void main(String args[]) {
for (int i=0; i < args.length; i++) {
try {
URL u = new URL(args[i]);
URLConnection uc = u.openConnection( );
for (int j = 1; ; j++) {
String header = uc.getHeaderField(j);
if (header == null) break;
System.out.println(uc.getHeaderFieldKey(j) + ": " + header);
}  // end for
}  // end try
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not a URL I understand.");
}
catch (IOException ex) {
System.err.println(ex);
}
System.out.println( );
}  // end for
}  // end main
}  // end AllHeaders

For example, here's the output when this program is
run against http://www.oreilly.com:

% java AllHeaders http://www.oreilly.com
Server: WN/1.15.1
Date: Mon, 18 Oct 1999 21:20:26 GMT
Last-modified: Sat, 16 Oct 1999 14:54:02 GMT
Content-type: text/html
Title: www.oreilly.com -- Welcome to O'Reilly &amp; Associates! 
-- computer  books, software, online publishing
Link: <mailto:webmaster@oreilly.com>; rev="Made"

Besides Date, Last-modified, and Content-type headers, this server
also provides Server, Title, and Link headers. Other servers may have
different sets of headers.

15.3.2.4 public long getHeaderFieldDate(String name, long default)

This method first retrieves the header field specified by the
name argument and tries to convert the string to a
long that specifies the milliseconds since
midnight, January 1, 1970, GMT. getHeaderFieldDate() can be used to retrieve a header field that represents a
date: for example, the Expires, Date, or Last-modified headers. To
convert the string to an integer, getHeaderFieldDate() uses the parseDate( ) method of
java.util.Date. The parseDate() method does a decent job of understanding and converting
most common date formats, but it can be stumpedfor instance,
if you ask for a header field that contains something other than a
date. If parseDate( ) doesn't
understand the date or if getHeaderFieldDate( ) is
unable to find the requested header field,
getHeaderFieldDate( ) returns the
default argument. For example:

Date expires = new Date(uc.getHeaderFieldDate("expires", 0));
long lastModified = uc.getHeaderFieldDate("last-modified", 0);
Date now = new Date(uc.getHeaderFieldDate("date", 0));

You can use the methods of the java.util.Date
class to convert the long to a
String.

15.3.2.5 public int getHeaderFieldInt(String name, int default)

This method retrieves the value of the header field
name and tries to convert it to an
int. If it fails, either because it
can't find the requested header field or because
that field does not contain a recognizable integer,
getHeaderFieldInt( ) returns the
default argument. This method is often used to
retrieve the Content-length field. For example, to
get the content length from a URLConnection
uc, you would write:

int contentLength = uc.getHeaderFieldInt("content-length", -1);

In this code fragment, getHeaderFieldInt( )
returns -1 if the Content-length header isn't
present.