Java Network Programming (3rd ed) [Electronic resources]

Harold, Elliotte Rusty

نسخه متنی -صفحه : 164/ 50

7.1 The URL Class

The java.net.URL class is an abstraction of a Uniform Resource Locator such as http://www.hamsterdance.com/ or public final class URL extends Object implements Serializable

Although storing a URL as a string would be trivial, it is helpful to think of URLs as objects with fields that include the scheme (a.k.a. the protocol), hostname, port, path, query string, and fragment identifier (a.k.a. the ref), each of which may be set independently. Indeed, this is almost exactly how the java.net.URL class is organized, though the details vary a little between different versions of Java.

The fields of java.net.URL are only visible to other members of the java.net package; classes that aren't in java.net can't access a URL's fields directly. However, you can set these fields using the URL constructors and retrieve their values using the various getter methods (getHost( ), getPort(), and so on). URLs are effectively immutable. After a URL object has been constructed, its fields do not change. This has the side effect of making them thread-safe.

7.1.1 Creating New URLs

Unlike the InetAddress objects in Chapter 6, you can construct instances of java.net.URL. There are six constructors, differing in the information they require. Which constructor you use depends on the information you have and the form it's in. All these constructors throw a MalformedURLException if you try to create a URL for an unsupported protocol and may throw a MalformedURLException if the URL is syntactically incorrect.

Exactly which protocols are supported is implementation-dependent. The only protocols that have been available in all major virtual machines are http and , and the latter is notoriously flaky. Java 1.5 also requires virtual machines to support https, jar, and ftp; many virtual machines prior to Java 1.5 support these three as well. Most virtual machines also support ftp, mailto, and gopher as well as some custom protocols like doc, netdoc, systemresource, and verbatim used internally by Java. The Netscape virtual machine supports the http, , ftp, mailto, telnet, ldap, and gopher protocols. The Microsoft virtual machine supports http, , ftp, https, mailto, gopher, doc, and systemresource, but not telnet, netdoc, jar, or verbatim. Of course, support for all these protocols is limited in applets by the security policy. For example, just because an untrusted applet can construct a URL object from a URL does not mean that the applet can actually read the the URL refers to. Just because an untrusted applet can construct a URL object from an HTTP URL that points to a third-party web site does not mean that the applet can connect to that site.

If the protocol you need isn't supported by a particular VM, you may be able to install a protocol handler for that scheme. This is subject to a number of security checks in applets and is really practical only for applications. Other than verifying that it recognizes the URL scheme, Java does not make any checks about the correctness of the URLs it constructs. The programmer is responsible for making sure that URLs created are valid. For instance, Java does not check that the hostname in an HTTP URL does not contain spaces or that the query string is x-www-form-URL-encoded. It does not check that a mailto URL actually contains an email address. Java does not check the URL to make sure that it points at an existing host or that it meets any other requirements for URLs. You can create URLs for hosts that don't exist and for hosts that do exist but that you won't be allowed to connect to.

7.1.1.1 Constructing a URL from a string

The simplest URL constructor just takes an absolute URL in string form as its single argument:

public URL(String url) throws MalformedURLException

Like all constructors, this may only be called after the new operator, and like all URL constructors, it can throw a MalformedURLException. The following code constructs a URL object from a String, catching the exception that might be thrown:

try {
URL u = new URL("http://www.audubon.org/");
}
catch (MalformedURLException ex)  {
System.err.println(ex);
}

Example 7-1 is a simple program for determining which protocols a virtual machine supports. It attempts to construct a URL object for each of 14 protocols (8 standard protocols, 3 custom protocols for various Java APIs, and 4 undocumented protocols used internally by HotJava). If the constructor succeeds, you know the protocol is supported. Otherwise, a MalformedURLException is thrown and you know the protocol is not supported.

Example 7-1. ProtocolTester

/* Which protocols does a virtual machine support? */
import java.net.*;
public class ProtocolTester {
public static void main(String[] args) {
// hypertext transfer protocol
testProtocol("http://www.adc.org");  
// secure http
testProtocol("https://www.amazon.com/exec/obidos/order2/"); 
//  transfer protocol
testProtocol("ftp://metalab.unc.edu/pub/languages/java/javafaq/");
// Simple Mail Transfer Protocol 
testProtocol(");
// telnet 
testProtocol("telnet://dibner.poly.edu/");
// local  access
testProtocol(":///etc/passwd");
// gopher 
testProtocol("gopher://gopher.anc.org.za/");
// Lightweight Directory Access Protocol
testProtocol(
"ldap://ldap.itd.umich.edu/o=University%20of%20Michigan,c=US?postalAddress");
// JAR
testProtocol(
"jar:http://cafeaulait.org/books/javaio/ioexamples/javaio.jar!"
+"/com/macfaq/io/StreamCopier.class");
// NFS, Network File System
testProtocol("nfs://utopia.poly.edu/usr/tmp/");
// a custom protocol for JDBC
testProtocol("jdbc:mysql://luna.metalab.unc.edu:3306/NEWS");
// rmi, a custom protocol for remote method invocation
testProtocol("rmi://metalab.unc.edu/RenderEngine");
// custom protocols for HotJava
testProtocol("doc:/UsersGuide/releasel");
testProtocol("netdoc:/UsersGuide/releasel");
testProtocol("systemresource://www.adc.org/+/indexl");
testProtocol("verbatim:http://www.adc.org/");
}
private static void testProtocol(String url) {
try {  
URL u = new URL(url);
System.out.println(u.getProtocol( ) + " is supported");
}
catch (MalformedURLException ex) {
String protocol = url.substring(0, url.indexOf(':'));
System.out.println(protocol + " is not supported");
}
} 
}

The results of this program depend on which virtual machine runs it. Here are the results from Java 1.4.1 on Mac OS X 10.2, which turns out to support all the protocols except Telnet, LDAP, RMI, NFS, and JDBC:

% java ProtocolTester
http is supported
https is supported
ftp is supported
mailto is supported
telnet is not supported
is supported
gopher is supported
ldap is not supported
jar is supported
nfs is not supported
jdbc is not supported
rmi is not supported
doc is supported
netdoc is supported
systemresource is supported
verbatim is supported

Results using Sun's Linux 1.4.2 virtual machine were identical. Other 1.4 virtual machines derived from the Sun code will show similar results. Java 1.2 and later are likely to be the same except for maybe HTTPS, which was only recently added to the standard distribution. VMs that are not derived from the Sun codebase may vary somewhat in which protocols they support. For example, here are the results of running ProtocolTester with the open source Kaffe VM 1.1.1:

% java ProtocolTester
http is supported
https is not supported
ftp is supported
mailto is not supported
telnet is not supported
is supported
gopher is not supported
ldap is not supported
jar is supported
nfs is not supported
jdbc is not supported
rmi is not supported
doc is not supported
netdoc is not supported
systemresource is not supported
verbatim is not supported

The nonsupport of RMI and JDBC is actually a little deceptive; in fact, the JDK does support these protocols. However, that support is through various parts of the java.rmi and java.sql packages, respectively. These protocols are not accessible through the URL class like the other supported protocols (although I have no idea why Sun chose to wrap up RMI and JDBC parameters in URL clothing if it wasn't intending to interface with these via Java's quite sophisticated mechanism for handling URLs).

7.1.1.2 Constructing a URL from its component parts

The second constructor builds a URL from three strings specifying the protocol, the hostname, and the :

public URL(String protocol, String hostname, String ) 
throws MalformedURLException

This constructor sets the port to -1 so the default port for the protocol will be used. The argument should begin with a slash and include a path, a name, and optionally a fragment identifier. Forgetting the initial slash is a common mistake, and one that is not easy to spot. Like all URL constructors, it can throw a MalformedURLException. For example:

try {
URL u = new URL("http", "www.eff.org", "/blueribbonl#intro");
}
catch (MalformedURLException ex)  {
// All VMs should recognize http
}

This creates a URL object that points to http://www.eff.org/blueribbonl#intro, using the default port for the HTTP protocol (port 80). The specification includes a reference to a named anchor. The code catches the exception that would be thrown if the virtual machine did not support the HTTP protocol. However, this shouldn't happen in practice.

For the rare occasions when the default port isn't correct, the next constructor lets you specify the port explicitly as an int:

public URL(String protocol, String host, int port, String ) 
throws MalformedURLException

The other arguments are the same as for the URL(String protocol, String host, String ) constructor and carry the same caveats. For example:

try {
URL u = new URL("http", "fourier.dur.ac.uk", 8000, "/~dma3mjh/jsci/");
}
catch (MalformedURLException ex)  {
System.err.println(ex);
}

This code creates a URL object that points to http://fourier.dur.ac.uk:8000/~dma3mjh/jsci/, specifying port 8000 explicitly.

Example 7-2 is an alternative protocol tester that can run as an applet, making it useful for testing support of browser virtual machines. It uses the three-argument constructor rather than the one-argument constructor in Example 7-1. It also stores the schemes to be tested in an array and uses the same host and for each scheme. This produces seriously malformed URLs like mailto://www.peacefire.org/bypass/SurfWatch/, once again demonstrating that all Java checks for at object construction is whether it recognizes the scheme, not whether the URL is appropriate.

Example 7-2. A protocol tester applet

import java.net.*;
import java.applet.*;
import java.awt.*;
public class ProtocolTesterApplet extends Applet {
TextArea results = new TextArea( );  
public void init( ) {
this.setLayout(new BorderLayout( ));    
this.add("Center", results);
}
public void start( ) {
String host = "www.peacefire.org";
String  = "/bypass/SurfWatch/";
String[] schemes = {"http",   "https",   "ftp",  "mailto", 
"telnet", ",    "ldap", "gopher",
"jdbc",   "rmi",     "jndi", "jar",
"doc",    "netdoc",  "nfs",  "verbatim",
"finger", "daytime", "systemresource"};
for (int i = 0; i < schemes.length; i++) {
try {
URL u = new URL(schemes[i], host, );
results.append(schemes[i] + " is supported\r\n");
}
catch (MalformedURLException ex) {
results.append(schemes[i] + " is not supported\r\n");      
}
}  
}
}

Figure 7-1 shows the results of Example 7-2 in Mozilla 1.4 with Java 1.4 installed. This browser supports HTTP, HTTPS, FTP, mailto, , gopher, doc, netdoc, verbatim, systemresource, and jar but not HTTPS, ldap, Telnet, jdbc, rmi, jndi, finger or daytime.

Figure 7-1. The ProtocolTesterApplet running in Mozilla 1.4

7.1.1.3 Constructing relative URLs

This constructor builds an absolute URL from a relative URL and a base URL:

public URL(URL base, String relative) throws MalformedURLException

For instance, you may be parsing an HTML document at http://www.ibiblio.org/javafaq/indexl and encounter a link to a called mailinglistsl with no further qualifying information. In this case, you use the URL to the document that contains the link to provide the missing information. The constructor computes the new URL as http://www.ibiblio.org/javafaq/mailinglistsl. For example:

try {
URL u1 = new URL("http://www.ibiblio.org/javafaq/indexl");
URL u2 = new URL (u1, "mailinglistsl");
}
catch (MalformedURLException ex) {
System.err.println(ex);
}

The name is removed from the path of u1 and the new name mailinglistsl is appended to make u2. This constructor is particularly useful when you want to loop through a list of s that are all in the same directory. You can create a URL for the first and then use this initial URL to create URL objects for the other s by substituting their names. You also use this constructor when you want to create a URL relative to the applet's document base or code base, which you retrieve using the getDocumentBase() or getCodeBase() methods of the java.applet.Applet class. Example 7-3 is a very simple applet that uses getDocumentBase( ) to create a new URL object:

Example 7-3. A URL relative to the web page

import java.net.*;
import java.applet.*;
import java.awt.*;
public class RelativeURLTest extends Applet {
public void init ( ) {
try {        
URL base = this.getDocumentBase( );
URL relative = new URL(base, "mailinglistsl");
this.setLayout(new GridLayout(2,1));
this.add(new Label(base.toString( )));
this.add(new Label(relative.toString( )));
}
catch (MalformedURLException ex) {
this.add(new Label("This shouldn't happen!"));
}
}
}

Of course, the output from this applet depends on the document base. In the run shown in Figure 7-2, the original URL (the document base) refers to the RelativeURLl; the constructor creates a new URL that points to the mailinglistsl in the same directory.

Figure 7-2. A base and a relative URL

When using this constructor with getDocumentBase(), you frequently put the call to getDocumentBase( ) inside the constructor, like this:

URL relative = new URL(this.getDocumentBase( ), "mailinglistsl");

7.1.1.4 Specifying a URLStreamHandler // Java 1.2

Two constructors allow you to specify the protocol handler used for the URL. The first constructor builds a relative URL from a base URL and a relative part. The second builds the URL from its component pieces:

public URL(URL base, String relative, URLStreamHandler handler) // 1.2
throws MalformedURLException
public URL(String protocol, String host, int port, String , // 1.2
URLStreamHandler handler) throws MalformedURLException

All URL objects have URLStreamHandler objects to do their work for them. These two constructors change from the default URLStreamHandler subclass for a particular protocol to one of your own choosing. This is useful for working with URLs whose schemes aren't supported in a particular virtual machine as well as for adding functionality that the default stream handler doesn't provide, such as asking the user for a username and password. For example:

URL u = new URL("finger", "utopia.poly.edu", 79, "/marcus", 
new com.macfaq.net.www.protocol.finger.Handler( ));

The com.macfaq.net.www.protocol.finger.Handler class used here will be developed in Chapter 16.

While the other four constructors raise no security issues in and of themselves, these two do because class loader security is closely tied to the various URLStreamHandler classes. Consequently, untrusted applets are not allowed to specify a URLSreamHandler. Trusted applets can do so if they have the NetPermission specifyStreamHandler. However, for reasons that will become apparent in Chapter 16, this is a security hole big enough to drive the Microsoft money train through. Consequently, you should not request this permission or expect it to be granted if you do request it.

7.1.1.5 Other sources of URL objects

Besides the constructors discussed here, a number of other methods in the Java class library return URL objects. You've already seen getDocumentBase( ) from java.applet.Applet. The other common source is getCodeBase( ), also from java.applet.Applet. This works just like getDocumentBase( ), except it returns the URL of the applet itself instead of the URL of the page that contains the applet. Both getDocumentBase( ) and getCodeBase( ) come from the java.applet.AppletStub interface, which java.applet.Applet implements. You're unlikely to implement this interface yourself unless you're building a web browser or applet viewer.

In Java 1.2 and later, the java.io.File class has a toURL( ) method that returns a method returns a URL from which a single resource can be read. The ClassLoader.getSystemResources(String name) method returns an Enumeration containing a list of URLs from which the named resource can be read. Finally, the instance method getResource(String name) searches the path used by the referenced class loader for a URL to the named resource. The URLs returned by these methods may be URLs, HTTP URLs, or some other scheme. The name of the resource is a slash-separated list of Java identifiers, such as /com/macfaq/sounds/swale.au or com/macfaq/images/headshot.jpg. The Java virtual machine will attempt to find the requested resource in the class pathpotentially including parts of the class path on the web server that an applet was loaded fromor inside a JAR archive.

Java 1.4 adds the URI class, which we'll discuss soon. URIs can be converted into URLs using the toURL( ) method, provided Java has the relevant protocol handler installed.

There are a few other methods that return URL objects here and there throughout the class library, but most are simple getter methods that return only a URL you probably already know because you used it to construct the object in the first place; for instance, the getPage( ) method of java.swing.JEditorPane and the getURL( ) method of java.net.URLConnection.

7.1.2 Splitting a URL into Pieces

URLs are composed of five pieces:

The scheme, also known as the protocol

The authority

The path

The fragment identifier, also known as the section or ref

The query string

For example, given the URL http://www.ibiblio.org/javafaq/books/jnp/indexl?isbn=1565922069#toc, the scheme is http, the authority is www.ibiblio.org, the path is /javafaq/books/jnp/indexl, the fragment identifier is toc, and the query string is isbn=1565922069. However, not all URLs have all these pieces. For instance, the URL http://www.faqs.org/rfcs/rfc2396l has a scheme, an authority, and a path, but no fragment identifier or query string.

The authority may further be divided into the user info, the host, and the port. For example, in the URL http://admin@www.blackstar.com:8080/, the authority is

7.1.2.1 public String getProtocol( )

The getProtocol( ) method returns a String containing the scheme of the URL, e.g., "http", "https", or ". For example:

URL page = this.getCodeBase( );
System.out.println("This applet was downloaded via " 
+ page.getProtocol( ));

7.1.2.2 public String getHost( )

The getHost( ) method returns a String containing the hostname of the URL. For example:

URL page = this.getCodeBase( );
System.out.println("This applet was downloaded from " + page.getHost( ));

The most recent virtual machines get this method right but some older ones, including Sun's JDK 1.3.0, may return a host string that is not necessarily a valid hostname or address. In particular, URLs that incorporate usernames, like URL u = new URL("); String host = u.getHost( );

Java 1.3 sets host to anonymous:anonymous@wuarchive.wustl.edu, not simply wuarchive.wustl.edu. Java 1.4 would return wuarchive.wustl.edu instead.

7.1.2.3 public int getPort( )

The getPort( ) method returns the port number specified in the URL as an int. If no port was specified in the URL, getPort( ) returns -1 to signify that the URL does not specify the port explicitly, and will use the default port for the protocol. For example, if the URL is http://www.userfriendly.org/, getPort( ) returns -1; if the URL is http://www.userfriendly.org:80/, getPort( ) returns 80. The following code prints -1 for the port number because it isn't specified in the URL:

URL u = new URL("http://www.ncsa.uiuc.edu/demowebl-primerl");
System.out.println("The port part of " + u + " is " + u.getPort( ));

7.1.2.4 public int getDefaultPort( )

The getDefaultPort( ) method returns the default port used for this URL's protocol when none is specified in the URL. If no default port is defined for the protocol, getDefaultPort( ) returns -1. For example, if the URL is http://www.userfriendly.org/, getDefaultPort( ) returns 80; if the URL is

7.1.2.5 public String getFile( )

The getFile( ) method returns a String that contains the path portion of a URL; remember that Java does not break a URL into separate path and parts. Everything from the first slash (/) after the hostname until the character preceding the # sign that begins a fragment identifier is considered to be part of the . For example:

URL page = this.getDocumentBase( );
System.out.println("This page's path is " + page.getFile( ));

If the URL does not have a part, Java 1.2 and earlier append a slash to the URL and return the slash as the name. For example, if the URL is http://www.slashdot.org (rather than something like http://www.slashdot.org/, getFile() returns /. Java 1.3 and later simply set the to the empty string.

7.1.2.6 public String getPath( ) // Java 1.3

The getPath( ) method, available only in Java 1.3 and later, is a near synonym for getFile( ); that is, it returns a String containing the path and portion of a URL. However, unlike getFile( ), it does not include the query string in the String it returns, just the path.

Note that the getPath( ) method does not return only the directory path and getFile( ) does not return only the name, as you might expect. Both getPath() and getFile( ) return the full path and name. The only difference is that getFile() also returns the query string and getPath( ) does not.

7.1.2.7 public String getRef( )

The getRef( ) method returns the fragment identifier part of the URL. If the URL doesn't have a fragment identifier, the method returns null. In the following code, getRef( ) returns the string xtocid1902914:

URL u = new URL(
"http://www.ibiblio.org/javafaq/javafaql#xtocid1902914");
System.out.println("The fragment ID of " + u + " is " + u.getRef( ));

7.1.2.8 public String getQuery( ) // Java 1.3

The getQuery( ) method returns the query string of the URL. If the URL doesn't have a query string, the method returns null. In the following code, getQuery() returns the string category=Piano:

URL u = new URL(
"http://www.ibiblio.org/nywc/compositions.l?category=Piano");
System.out.println("The query string of " + u + " is " + u.getQuery( ));

In Java 1.2 and earlier, you need to extract the query string from the value returned by getFile( ) instead.

7.1.2.9 public String getUserInfo( ) // Java 1.3

Some URLs include usernames and occasionally even password information. This information comes after the scheme and before the host; an @ symbol delimits it. For instance, in the URL http://elharo@java.oreilly.com/, the user info is elharo. Some URLs also include passwords in the user info. For instance, in the URL ftp://mp3:secret@ftp.example.com/c%3a/stuff/mp3/, the user info is mp3:secret. However, most of the time including a password in a URL is a security risk. If the URL doesn't have any user info, getUserInfo() returns null. Mailto URLs may not behave like you expect. In a URL like mailto:

7.1.2.10 public String getAuthority( ) // Java 1.3

Between the scheme and the path of a URL, you'll find the authority. The term authority is taken from the Uniform Resource Identifier specification (RFC 2396), where this part of the URI indicates the authority that resolves the resource. In the most general case, the authority includes the user info, the host, and the port. For example, in the URL http://conferences.oreilly.com/java/speakers/, the authority is simply the hostname conferences.oreilly.com. The getAuthority( ) method returns the authority as it exists in the URL, with or without the user info and port.

Example 7-4 uses all eight methods to split URLs entered on the command line into their component parts. This program requires Java 1.3 or later.

Example 7-4. The parts of a URL

import java.net.*;
public class URLSplitter {
public static void main(String args[]) {
for (int i = 0; i < args.length; i++) {
try {
URL u = new URL(args[i]);
System.out.println("The URL is " + u);
System.out.println("The scheme is " + u.getProtocol( ));        
System.out.println("The user info is " + u.getUserInfo( ));
String host = u.getHost( );
if (host != null) {
int atSign = host.indexOf('@');  
if (atSign != -1) host = host.substring(atSign+1);
System.out.println("The host is " + host);   
}
else {          
System.out.println("The host is null.");   
}
System.out.println("The port is " + u.getPort( ));
System.out.println("The path is " + u.getPath( ));
System.out.println("The ref is " + u.getRef( ));
System.out.println("The query string is " + u.getQuery( ));
}  // end try
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not a URL I understand.");
}
System.out.println( );
}  // end for
}  // end main
}  // end URLSplitter

Here's the result of running this against several of the URL examples in this chapter:

% java URLSplitter    \
 http://www.ncsa.uiuc.edu/demowebl-primerl#A1.3.3.3 \ 
                  \ 
 http://www.oreilly.com                                   \    
 http://www.ibiblio.org/nywc/compositions.l?category=Piano \
 http://admin@www.blackstar.com:8080/                     \   
The URL is http://www.ncsa.uiuc.edu/demowebl-primerl#A1.3.3.3
The scheme is http
The user info is null
The host is www.ncsa.uiuc.edu
The port is -1
The path is /demowebl-primerl
The ref is A1.3.3.3
The query string is null
The URL is 
The scheme is ftp
The user info is mp3:mp3
The host is 138.247.121.61
The port is 21000
The path is /c%3a/
The ref is null
The query string is null
The URL is http://www.oreilly.com
The scheme is http
The user info is null
The host is www.oreilly.com
The port is -1
The path is 
The ref is null
The query string is null
The URL is http://www.ibiblio.org/nywc/compositions.l?category=Piano
The scheme is http
The user info is null
The host is www.ibiblio.org
The port is -1
The path is /nywc/compositions.l
The ref is null
The query string is category=Piano
The URL is http://admin@www.blackstar.com:8080/
The scheme is http
The user info is admin
The host is www.blackstar.com
The port is 8080
The path is /
The ref is null
The query string is null

7.1.3 Retrieving Data from a URL

Naked URLs aren't very exciting. What's interesting is the data contained in the documents they point to. The URL class has several methods that retrieve data from a URL:

public InputStream openStream( ) throws IOException
public URLConnection openConnection( ) throws IOException
public URLConnection openConnection(Proxy proxy) throws IOException // 1.5
public Object getContent( ) throws IOException
public Object getContent(Class[] classes)  throws IOException // 1.3

These methods differ in that they return the data at the URL as an instance of different classes.

7.1.3.1 public final InputStream openStream( ) throws IOException

The openStream( ) method connects to the resource referenced by the URL, performs any necessary handshaking between the client and the server, and returns an InputStream from which data can be read. The data you get from this InputStream is the raw (i.e., uninterpreted) contents of the the URL references: ASCII if you're reading an ASCII text , raw HTML if you're reading an HTML , binary image data if you're reading an image , and so forth. It does not include any of the HTTP headers or any other protocol-related information. You can read from this InputStream as you would read from any other InputStream. For example:

try {
URL u  = new URL("http://www.hamsterdance.com");
InputStream in = u.openStream( );
int c;
while ((c = in.read( )) != -1) System.out.write(c);
}
catch (IOException ex) {
System.err.println(ex);
}

This code fragment catches an IOException, which also catches the MalformedURLException that the URL constructor can throw, since MalformedURLException subclasses IOException.

Example 7-5 reads a URL from the command line, opens an InputStream from that URL, chains the resulting InputStream to an InputStreamReader using the default encoding, and then uses InputStreamReader's read( ) method to read successive characters from the , each of which is printed on System.out. That is, it prints the raw data located at the URL: if the URL references an HTML , the program's output is raw HTML.

Example 7-5. Download a web page

import java.net.*;
import java.io.*;
public class SourceViewer {
public static void main (String[] args) {
if  (args.length > 0) {
try {
//Open the URL for reading
URL u = new URL(args[0]);
InputStream in = u.openStream( );
// buffer the input to increase performance 
in = new BufferedInputStream(in);       
// chain the InputStream to a Reader
Reader r = new InputStreamReader(in);
int c;
while ((c = r.read( )) != -1) {
System.out.print((char) c);
} 
}
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
catch (IOException ex) {
System.err.println(ex);
}
} //  end if
} // end main
}  // end SourceViewer

And here are the first few lines of output when SourceViewer downloads http://www.oreilly.com:

% java SourceViewer http://www.oreilly.com
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<l xmlns="http://www.w3.org/1999/l" lang="en-US" xml:lang="en-US">
<head>
<title>oreilly.com -- Welcome to O'Reilly Media, Inc. -- computer books, 
software conferences, online publishing</title>
<meta name="keywords" content="O'Reilly, oreilly, computer books, technical 
books, UNIX, unix, Perl, Java, Linux, Internet, Web, C, C++, Windows, Windows 
NT, Security, Sys Admin, System Administration, Oracle, PL/SQL, online books,
books online, computer book online, e-books, ebooks, Perl Conference, Open Source
Conference, Java Conference, open source, free software, XML, Mac OS X, .Net, dot
net, C#, PHP, CGI, VB, VB Script, Java Script, javascript, Windows 2000, XP, 
bioinformatics, web services, p2p" />
<meta name="description" content="O'Reilly is a leader in technical and computer book 
documentation, online content, and conferences for UNIX, Perl, Java, Linux, Internet, 
Mac OS X, C, C++, Windows, Windows NT, Security, Sys Admin, System Administration, 
Oracle, Design and Graphics, Online Books, e-books, ebooks, Perl Conference, Java 
Conference, P2P Conference" />

There are quite a few more lines in that web page; if you want to see them, you can fire up your web browser.

The shakiest part of this program is that it blithely assumes that the remote URL is text, which is not necessarily true. It could well be a GIF or JPEG image, an MP3 sound , or something else entirely. Even if it is text, the document encoding may not be the same as the default encoding of the client system. The remote host and local client may not have the same default character set. As a general rule, for pages that use a character set radically different from ASCII, the HTML will include a META tag in the header specifying the character set in use. For instance, this META tag specifies the Big-5 encoding for Chinese:

<meta http-equiv="Content-Type" content="textl; charset=big5">

An XML document will likely have an XML declaration instead:

<?xml version="1.0" encoding="Big5"?>

In practice, there's no easy way to get at this information other than by parsing the and looking for a header like this one, and even that approach is limited. Many HTML s hand-coded in Latin alphabets don't have such a META tag. Since Windows, the Mac, and most Unixes have somewhat different interpretations of the characters from 128 to 255, the extended characters in these documents do not translate correctly on platforms other than the one on which they were created.

And as if this isn't confusing enough, the HTTP header that precedes the actual document is likely to have its own encoding information, which may completely contradict what the document itself says. You can't read this header using the URL class, but you can with the URLConnection object returned by the openConnection( ) method. Encoding detection and declaration is one of the thornier parts of the architecture of the Web.

7.1.3.2 public URLConnection openConnection( ) throws IOException

The openConnection( ) method opens a socket to the specified URL and returns a URLConnection object. A URLConnection represents an open connection to a network resource. If the call fails, openConnection( ) throws an IOException. For example:

try {
URL u = new URL("http://www.jennicam.org/");
try {
URLConnection uc = u.openConnection( );
InputStream in = uc.getInputStream( );
// read from the connection...
} // end try
catch (IOException ex) {
System.err.println(ex);
}
} // end try
catch (MalformedURLException ex) {
System.err.println(ex);
}

Use this method when you want to communicate directly with the server. The URLConnection gives you access to everything sent by the server: in addition to the document itself in its raw form (e.g., HTML, plain text, binary image data), you can access all the metadata specified by the protocol. For example, if the scheme is HTTP, the URLConnection lets you access the HTTP headers as well as the raw HTML. The URLConnection class also lets you write data to as well as read from a URLfor instance, in order to send email to a mailto URL or post form data. The URLConnection class will be the primary subject of Chapter 15.

Java 1.5 adds one overloaded variant of this method that specifies the proxy server to pass the connection through:

public URLConnection openConnection(Proxy proxy) throws IOException

This overrides any proxy server set with the usual socksProxyHost, socksProxyPort, http.proxyHost, http.proxyPort, http.nonProxyHosts, and similar system properties. If the protocol handler does not support proxies, the argument is ignored and the connection is made directly if possible.

7.1.3.3 public final Object getContent( ) throws IOException

The getContent( ) method is the third way to download data referenced by a URL. The getContent( ) method retrieves the data referenced by the URL and tries to make it into some type of object. If the URL refers to some kind of text object such as an ASCII or HTML , the object returned is usually some sort of InputStream. If the URL refers to an image such as a GIF or a JPEG , getContent( ) usually returns a java.awt.ImageProducer (more specifically, an instance of a class that implements the ImageProducer interface). What unifies these two disparate classes is that they are not the thing itself but a means by which a program can construct the thing:

try {
URL u = new URL("http://mesola.obspm.fr/");
Object o = u.getContent( );
// cast the Object to the appropriate type
// work with the Object...
} 
catch (Exception ex) {
System.err.println(ex);
}

getContent( ) operates by looking at the Content-type field in the MIME header of the data it gets from the server. If the server does not use MIME headers or sends an unfamiliar Content-type, getContent( ) returns some sort of InputStream with which the data can be read. An IOException is thrown if the object can't be retrieved. Example 7-6 demonstrates this.

Example 7-6. Download an object

import java.net.*;
import java.io.*;
public class ContentGetter {
public static void main (String[] args) {
if  (args.length > 0) {
//Open the URL for reading
try {
URL u = new URL(args[0]);
try {
Object o = u.getContent( );
System.out.println("I got a " + o.getClass( ).getName( ));
} // end try
catch (IOException ex) {
System.err.println(ex);
}
} // end try
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
} //  end if
} // end main
}  // end ContentGetter

Here's the result of trying to get the content of http://www.oreilly.com:

% java ContentGetter http://www.oreilly.com/
I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream

The exact class may vary from one version of Java to the next (in earlier versions, it's been java.io.PushbackInputStream or sun.net.www.http.KeepAliveStream) but it should be some form of InputStream.

Here's what you get when you try to load a header image from that page:

% java ContentGetter http://www.oreilly.com/graphics_new/animation.gif
I got a sun.awt.image.URLImageSource

Here's what happens when you try to load a Java applet using getContent( ):

% java ContentGetter http://www.cafeaulait.org/RelativeURLTest.class
I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream

Here's what happens when you try to load an audio using getContent( ):

% java ContentGetter http://www.cafeaulait.org/course/week9/spacemusic.au
I got a sun.applet.AppletAudioClip

The last result is the most unusual because it is as close as the Java core API gets to a class that represents a sound . It's not just an interface through which you can load the sound data.

This example demonstrates the biggest problems with using getContent( ): it's hard to predict what kind of object you'll get. You could get some kind of InputStream or an ImageProducer or perhaps an AudioClip; it's easy to check using the instanceof operator. This information should be enough to let you read a text or display an image.

7.1.3.4 public final Object getContent(Class[] classes) throws IOException // Java 1.3

Starting in Java 1.3, it is possible for a content handler to provide different views of an object. This overloaded variant of the getContent( ) method lets you choose what class you'd like the content to be returned as. The method attempts to return the URL's content in the order used in the array. For instance, if you prefer an HTML to be returned as a String, but your second choice is a Reader and your third choice is an InputStream, write:

URL u = new URL("http://www.nwu.org");
Class[] types = new Class[3];
types[0] = String.class;
types[1] = Reader.class;
types[2] = InputStream.class;
Object o = u.getContent(types);

You then have to test for the type of the returned object using instanceof. For example:

if (o instanceof String) {
System.out.println(o); 
}
else if (o instanceof Reader) {
int c;
Reader r = (Reader) o;
while ((c = r.read( )) != -1) System.out.print((char) c); 
}
else if (o instanceof InputStream) {
int c;
InputStream in = (InputStream) o;
while ((c = in.read( )) != -1) System.out.write(c);         
}
else {
System.out.println("Error: unexpected type " + o.getClass( )); 
}

7.1.4 Utility Methods

The URL class contains a couple of utility methods that perform common operations on URLs. The sameFile( ) method determines whether two URLs point to the same document. The toExternalForm( ) method converts a URL object to a string that can be used in an HTML link or a web browser's Open URL dialog.

7.1.4.1 public boolean sameFile(URL other)

The sameFile( ) method tests whether two URL objects point to the same . If they do, sameFile( ) returns true; otherwise, it returns false. The test that sameFile( ) performs is quite shallow; all it does is compare the corresponding fields for equality. It detects whether the two hostnames are really just aliases for each other. For instance, it can tell that http://www.ibiblio.org/ and http://metalab.unc.edu/ are the same . However, it cannot tell that http://www.ibiblio.org:80/ and http://metalab.unc.edu/ are the same or that http://www.cafeconleche.org/ and http://www.cafeconleche.org/indexl are the same . sameFile( ) is smart enough to ignore the fragment identifier part of a URL, however. Here's a fragment of code that uses sameFile( ) to compare two URLs:

try {
URL u1 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimerl#GS");
URL u2 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimerl#HD");
if (u1.sameFile(u2)) {
System.out.println(u1 + " is the same  as \n" + u2);
}
else {
System.out.println(u1 + " is not the same  as \n" + u2);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}

The output is:

http://www.ncsa.uiuc.edu/HTMLPrimerl#GS is the same  as 
http://www.ncsa.uiuc.edu/HTMLPrimerl#HD

The sameFile( ) method is similar to the equals( ) method of the URL class. The main difference between sameFile( ) and equals( ) is that equals( ) considers the fragment identifier (if any), whereas sameFile( ) does not. The two URLs shown here do not compare equal although they are the same . Also, any object may be passed to equals( ); only URL objects can be passed to sameFile( ).

7.1.4.2 public String toExternalForm( )

The toExternalForm( ) method returns a human-readable String representing the URL. It is identical to the toString( ) method. In fact, all the toString( ) method does is return toExternalForm( ). Therefore, this method is currently redundant and rarely used.

7.1.4.3 public URI toURI( ) throws URISyntaxException // Java 1.5

Java 1.5 adds a toURI( ) method that converts a URL object to an equivalent URI object. We'll take up the URI class shortly. In the meantime, the main thing you need to know is that the URI class provides much more accurate, specification-conformant behavior than the URL class. For operations like absolutization and encoding, you should prefer the URI class where you have the option. In Java 1.4 and later, the URL class should be used primarily for the actual downloading of content from the remote server.

7.1.5 The Object Methods

URL inherits from java.lang.Object, so it has access to all the methods of the Object class. It overrides three to provide more specialized behavior: equals( ), hashCode( ), and toString( ).

7.1.5.1 public String toString( )

Like all good classes, java.net.URL has a toString( ) method. Example 7-1 through Example 7-5 implicitly called this method when URLs were passed to System.out.println( ). As those examples demonstrated, the String produced by toString( ) is always an absolute URL, such as http://www.cafeaulait.org/javatutoriall.

It's uncommon to call toString( ) explicitly. Print statements call toString( ) implicitly. Outside of print statements, it's more proper to use toExternalForm( ) instead. If you do call toString( ), the syntax is simple:

URL codeBase = this.getCodeBase( );
String appletURL = codeBase.toString( );

7.1.5.2 public boolean equals(Object o)

An object is equal to a URL only if it is also a URL, both URLs point to the same as determined by the sameFile( ) method, and both URLs have the same fragment identifier (or both URLs don't have fragment identifiers). Since equals( ) depends on sameFile( ), equals( ) has the same limitations as sameFile( ). For example, http://www.oreilly.com/ is not equal to http://www.oreilly.com/indexl, and http://www.oreilly.com:80/ is not equal to http://www.oreilly.com/. Whether this makes sense depends on whether you think of a URL as a string or as a reference to a particular Internet resource.

Example 7-7 creates URL objects for http://www.ibiblio.org/ and http://metalab.unc.edu/ and tells you if they're the same using the equals() method.

Example 7-7. Are http://www.ibiblio.org and http://www.metalab.unc.edu the same?

import java.net.*;
public class URLEquality {
public static void main (String[] args) {
try {
URL ibiblio = new URL ("http://www.ibiblio.org/");
URL metalab = new URL("http://metalab.unc.edu/");
if (ibiblio.equals(metalab)) {
System.out.println(ibiblio + " is the same as " + metalab);
}
else {
System.out.println(ibiblio + " is not the same as " + metalab);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
}
}

When you run this program, you discover:

% java URLEquality
http://www.ibiblio.org/ is the same as http://metalab.unc.edu/

7.1.5.3 public int hashCode( )

The hashCode( ) method returns an int that is used when URL objects are used as keys in hash tables. Thus, it is called by the various methods of java.util.Hashtable. You rarely need to call this method directly, if ever. Hash codes for two different URL objects are unlikely to be the same, but it is certainly possible; there are far more conceivable URLs than there are four-byte integers.

7.1.6 Methods for Protocol Handlers

The last method in the URL class I'll just mention briefly here for the sake of completeness: setURLStreamHandlerFactory( ). It's primarily used by protocol handlers that are responsible for new schemes, not by programmers who just want to retrieve data from a URL. We'll discuss it in more detail in Chapter 16.

7.1.6.1 public static synchronized void setURLStreamHandlerFactory(URLStreamHandlerFactory factory)

This method sets the URLStreamHandlerFactory for the application and throws a generic Error if the factory has already been set. A URLStreamHandler is responsible for parsing the URL and then constructing the appropriate URLConnection object to handle the connection to the server. Most of the time this happens behind the scenes.