The java.net.URL class is an abstraction of a Uniform
Resource Locator such as http://www.hamsterdance.com/ or public final class URL extends Object implements Serializable Although storing a URL as a string would be trivial, it is helpful to
think of URLs as objects with fields that include the scheme (a.k.a.
the protocol), hostname, port, path, query string, and fragment
identifier (a.k.a. the ref), each of which may be set independently.
Indeed, this is almost exactly how the
java.net.URL class is organized, though the
details vary a little between different versions of Java. The fields of java.net.URL are only visible to
other members of the java.net package; classes
that aren't in java.net
can't access a
URL's fields directly. However,
you can set these fields using the URL
constructors and retrieve their values using the various getter
methods (getHost( ), getPort(), and so on). URLs are effectively immutable. After a
URL object has been constructed, its fields do not
change. This has the side effect of making them thread-safe.
Unlike the InetAddress
objects in Chapter 6, you can construct
instances of java.net.URL. There are
six
constructors, differing in the information they require. Which
constructor you use depends on the information you have and the form
it's in. All these constructors throw a
MalformedURLException if you try to create a
URL for an unsupported protocol and may throw a
MalformedURLException if the URL is syntactically
incorrect. Exactly which protocols are
supported is implementation-dependent. The only protocols that have
been available in all major virtual machines are http and , and
the latter is notoriously flaky. Java 1.5 also requires virtual
machines to support https, jar, and ftp; many virtual machines prior
to Java 1.5 support these three as well. Most virtual machines also
support ftp, mailto, and gopher as well as some custom protocols like
doc, netdoc, systemresource, and verbatim used internally by Java.
The Netscape virtual machine supports the http, , ftp, mailto,
telnet, ldap, and gopher protocols. The Microsoft virtual machine
supports http, , ftp, https, mailto, gopher, doc, and
systemresource, but not telnet, netdoc, jar, or verbatim. Of course,
support for all these protocols is limited in applets by the security
policy. For example, just because an untrusted applet can construct a
URL object from a URL does not mean that the
applet can actually read the the URL refers to. Just because an
untrusted applet can construct a URL object from
an HTTP URL that points to a third-party web site does not mean that
the applet can connect to that site. If the protocol you need isn't supported by a
particular VM, you may be able to install a protocol handler for that
scheme. This is subject to a number of security checks in applets and
is really practical only for applications. Other than verifying that
it recognizes the URL scheme, Java does not make any checks about the
correctness of the URLs it constructs. The programmer is responsible
for making sure that URLs created are valid. For instance, Java does
not check that the hostname in an HTTP URL does not contain spaces or
that the query string is x-www-form-URL-encoded. It does not check
that a mailto URL actually contains an email address. Java does not
check the URL to make sure that it points at an existing host or that
it meets any other requirements for URLs. You can create URLs for
hosts that don't exist and for hosts that do exist
but that you won't be allowed to connect to.
The simplest
URL constructor just takes an absolute URL in
string form as its single argument: Like all constructors, this may only be called after the
new operator, and like all URL
constructors, it can throw a
MalformedURLException. The following code
constructs a URL object from a
String, catching the exception that might be
thrown: Example 7-1 is a simple program for determining
which protocols a virtual machine supports. It attempts to construct
a URL object for each of 14 protocols (8 standard
protocols, 3 custom protocols for various Java APIs, and 4
undocumented protocols used internally by HotJava). If the
constructor succeeds, you know the protocol is supported. Otherwise,
a MalformedURLException is thrown and you know the
protocol is not supported.
The results of this program depend on which virtual machine runs it.
Here are the results from Java 1.4.1 on Mac OS X 10.2, which turns
out to support all the protocols except Telnet, LDAP, RMI, NFS, and
JDBC: Results using Sun's Linux 1.4.2 virtual machine were
identical. Other 1.4 virtual machines derived from the Sun code will
show similar results. Java 1.2 and later are likely to be the same
except for maybe HTTPS, which was only recently added to the standard
distribution. VMs that are not derived from the Sun codebase may vary
somewhat in which protocols they support. For example, here are the
results of running ProtocolTester with the open
source Kaffe VM 1.1.1: The nonsupport of RMI and JDBC is actually a little deceptive; in
fact, the JDK does support these protocols. However, that support is
through various parts of the java.rmi and
java.sql packages, respectively. These protocols
are not accessible through the
URL class like the other supported
protocols (although I have no idea why Sun chose to wrap up RMI and
JDBC parameters in URL clothing if it wasn't
intending to interface with these via Java's quite
sophisticated mechanism for handling URLs).
The second constructor builds
a URL from three strings specifying the protocol,
the hostname, and the : This constructor sets the port to -1 so the default port for the
protocol will be used. The argument should
begin with a slash and include a path, a name, and optionally a
fragment identifier. Forgetting the initial slash is a common
mistake, and one that is not easy to spot. Like all
URL constructors, it can throw a
MalformedURLException. For example: This creates a URL object that points to
http://www.eff.org/blueribbonl#intro,
using the default port for the HTTP protocol (port 80). The
specification includes a reference to a named anchor. The code
catches the exception that would be thrown if the virtual machine did
not support the HTTP protocol. However, this
shouldn't happen in practice. For the rare occasions when the default port isn't
correct, the next constructor lets you specify the port explicitly as
an int: The other arguments are the same as for the
URL(String protocol,
String host,
String ) constructor and
carry the same caveats. For example: This code creates a URL object that points
to
http://fourier.dur.ac.uk:8000/~dma3mjh/jsci/, specifying
port 8000 explicitly. Example 7-2 is an alternative protocol tester that
can run as an applet, making it useful for testing support of browser
virtual machines. It uses the three-argument constructor rather than
the one-argument constructor in Example 7-1. It also
stores the schemes to be tested in an array and uses the same host
and for each scheme. This produces seriously malformed URLs like
mailto://www.peacefire.org/bypass/SurfWatch/,
once again demonstrating that all Java checks for at object
construction is whether it recognizes the scheme, not whether the URL
is appropriate.
Figure 7-1 shows the results of Example 7-2 in Mozilla 1.4 with Java 1.4 installed. This
browser supports HTTP, HTTPS, FTP, mailto, , gopher, doc, netdoc,
verbatim, systemresource, and jar but not HTTPS, ldap, Telnet, jdbc,
rmi, jndi, finger or daytime.
This constructor builds an
absolute URL from a relative
URL and a base URL: For instance, you may be parsing an HTML document at http://www.ibiblio.org/javafaq/indexl and
encounter a link to a called
mailinglistsl with no further qualifying
information. In this case, you use the URL to the document that
contains the link to provide the missing information. The constructor
computes the new URL as http://www.ibiblio.org/javafaq/mailinglistsl.
For example: The name is removed from the path of u1 and
the new name mailinglistsl is appended
to make u2. This constructor is particularly
useful when you want to loop through a list of s that are all in
the same directory. You can create a URL for the first and then
use this initial URL to create URL objects for the
other s by substituting their names. You also use this
constructor when you want to create a URL relative
to the applet's document base or code base, which
you retrieve using the getDocumentBase() or getCodeBase() methods of the java.applet.Applet
class. Example 7-3 is a very simple applet that uses
getDocumentBase( ) to create a new
URL object:
Of course, the output from this applet depends on the document base.
In the run shown in Figure 7-2, the original
URL (the document base) refers to the
RelativeURLl; the constructor creates a new
URL that points to the
mailinglistsl in the same directory. When using this constructor with getDocumentBase(), you frequently put the call to getDocumentBase(
) inside the constructor, like this:
Two constructors allow you to specify
the protocol handler used for the URL. The first constructor builds a
relative URL from a base URL
and a relative part. The second builds the URL
from its component pieces: All URL objects have
URLStreamHandler objects to do their work for
them. These two constructors change from the default
URLStreamHandler subclass for a particular
protocol to one of your own choosing. This is useful for working with
URLs whose schemes aren't supported in a particular
virtual machine as well as for adding functionality that the default
stream handler doesn't provide, such as asking the
user for a username and password. For example: The com.macfaq.net.www.protocol.finger.Handler
class used here will be developed in Chapter 16. While the other four constructors raise no security issues in and of
themselves, these two do because class loader security is closely
tied to the various URLStreamHandler classes.
Consequently, untrusted applets are not allowed to specify a
URLSreamHandler. Trusted applets can do so if they
have the NetPermission
specifyStreamHandler. However, for reasons that
will become apparent in Chapter 16, this is a
security hole big enough to drive the Microsoft money train through.
Consequently, you should not request this permission or expect it to
be granted if you do request it.
Besides the constructors discussed here, a number of other methods in
the Java class library return URL objects.
You've already seen getDocumentBase(
) from
java.applet.Applet. The other common source is
getCodeBase( ), also from
java.applet.Applet. This works just like
getDocumentBase( ), except it returns the
URL of the applet itself instead of the URL of the
page that contains the applet. Both getDocumentBase(
) and getCodeBase( ) come from the
java.applet.AppletStub interface, which
java.applet.Applet implements.
You're unlikely to implement this interface yourself
unless you're building a web browser or applet
viewer. In Java 1.2 and later, the java.io.File class has
a toURL( ) method that returns a method returns a
URL from which a single resource can be read. The
ClassLoader.getSystemResources(String name) method
returns an Enumeration containing a list of
URLs from which the named resource can be read.
Finally, the instance method getResource(String
name) searches the path used by the referenced
class loader for a URL to the named resource. The URLs returned by
these methods may be URLs, HTTP URLs, or some other scheme. The
name of the resource is a slash-separated list of Java identifiers,
such as /com/macfaq/sounds/swale.au or
com/macfaq/images/headshot.jpg. The Java virtual
machine will attempt to find the requested resource in the class
pathpotentially including parts of the class path on the web
server that an applet was loaded fromor inside a JAR archive. Java 1.4 adds the URI class, which
we'll discuss soon. URIs can be converted into URLs
using the toURL( ) method, provided Java has the
relevant protocol handler installed. There are a few other methods that return URL
objects here and there throughout the class library, but most are
simple getter methods that return only a URL you probably already
know because you used it to construct the object in the first place;
for instance, the getPage( ) method of
java.swing.JEditorPane and the getURL(
) method of
java.net.URLConnection.
URLs
are composed of five pieces: The scheme, also known as the protocol The authority The path The fragment identifier, also known as the section or ref The query string
For example, given the URL http://www.ibiblio.org/javafaq/books/jnp/indexl?isbn=1565922069#toc,
the scheme is http, the authority is
www.ibiblio.org, the path is
/javafaq/books/jnp/indexl, the fragment
identifier is toc, and the query string is
isbn=1565922069. However, not all URLs have all
these pieces. For instance, the URL http://www.faqs.org/rfcs/rfc2396l has a
scheme, an authority, and a path, but no fragment identifier or query
string. The authority may further be divided into the user info, the host,
and the port. For example, in the URL http://admin@www.blackstar.com:8080/, the
authority is
The getProtocol( ) method returns a
String containing the scheme of the URL, e.g.,
"http",
"https", or
". For example:
The getHost( ) method returns a
String containing the hostname of the URL. For
example: The most recent virtual machines get this method right but some older
ones, including Sun's JDK 1.3.0, may return a host
string that is not necessarily a valid hostname or address. In
particular, URLs that incorporate usernames, like URL u = new URL(");
String host = u.getHost( ); Java 1.3 sets host to
anonymous:anonymous@wuarchive.wustl.edu, not
simply wuarchive.wustl.edu. Java 1.4 would return
wuarchive.wustl.edu instead.
The getPort( ) method
returns the port number specified in the URL as an
int. If no port was specified in the
URL, getPort( ) returns -1 to
signify that the URL does not specify the port explicitly, and will
use the default port for the protocol. For example, if the URL is
http://www.userfriendly.org/,
getPort( ) returns -1; if the URL is http://www.userfriendly.org:80/,
getPort( ) returns 80. The following code prints
-1 for the port number because it isn't specified in
the URL:
The getDefaultPort( ) method
returns the default port used for this
URL's protocol when none is
specified in the URL. If no default port is defined for the protocol,
getDefaultPort( ) returns -1. For example, if the
URL is http://www.userfriendly.org/,
getDefaultPort( ) returns 80; if the URL is
The getFile( ) method returns a
String that contains the path portion of a URL;
remember that Java does not break a URL into separate path and
parts. Everything from the first slash (/) after the hostname until
the character preceding the # sign that begins a fragment identifier
is considered to be part of the . For example: If the URL does not have a part, Java 1.2 and earlier append a
slash to the URL and return the slash as the name. For example,
if the URL is http://www.slashdot.org (rather than
something like http://www.slashdot.org/, getFile() returns /. Java 1.3 and later simply
set the to the empty string.
The getPath( ) method, available only in Java 1.3
and later, is a near synonym for getFile( ); that
is, it returns a String containing the path and
portion of a URL. However, unlike getFile( ),
it does not include the query string in the String
it returns, just the path.7.1.1 Creating New URLs
7.1.1.1 Constructing a URL from a string
public URL(String url) throws MalformedURLException
try {
URL u = new URL("http://www.audubon.org/");
}
catch (MalformedURLException ex) {
System.err.println(ex);
}Example 7-1. ProtocolTester
/* Which protocols does a virtual machine support? */
import java.net.*;
public class ProtocolTester {
public static void main(String[] args) {
// hypertext transfer protocol
testProtocol("http://www.adc.org");
// secure http
testProtocol("https://www.amazon.com/exec/obidos/order2/");
// transfer protocol
testProtocol("ftp://metalab.unc.edu/pub/languages/java/javafaq/");
// Simple Mail Transfer Protocol
testProtocol(");
// telnet
testProtocol("telnet://dibner.poly.edu/");
// local access
testProtocol(":///etc/passwd");
// gopher
testProtocol("gopher://gopher.anc.org.za/");
// Lightweight Directory Access Protocol
testProtocol(
"ldap://ldap.itd.umich.edu/o=University%20of%20Michigan,c=US?postalAddress");
// JAR
testProtocol(
"jar:http://cafeaulait.org/books/javaio/ioexamples/javaio.jar!"
+"/com/macfaq/io/StreamCopier.class");
// NFS, Network File System
testProtocol("nfs://utopia.poly.edu/usr/tmp/");
// a custom protocol for JDBC
testProtocol("jdbc:mysql://luna.metalab.unc.edu:3306/NEWS");
// rmi, a custom protocol for remote method invocation
testProtocol("rmi://metalab.unc.edu/RenderEngine");
// custom protocols for HotJava
testProtocol("doc:/UsersGuide/releasel");
testProtocol("netdoc:/UsersGuide/releasel");
testProtocol("systemresource://www.adc.org/+/indexl");
testProtocol("verbatim:http://www.adc.org/");
}
private static void testProtocol(String url) {
try {
URL u = new URL(url);
System.out.println(u.getProtocol( ) + " is supported");
}
catch (MalformedURLException ex) {
String protocol = url.substring(0, url.indexOf(':'));
System.out.println(protocol + " is not supported");
}
}
}% java ProtocolTester
http is supported
https is supported
ftp is supported
mailto is supported
telnet is not supported
is supported
gopher is supported
ldap is not supported
jar is supported
nfs is not supported
jdbc is not supported
rmi is not supported
doc is supported
netdoc is supported
systemresource is supported
verbatim is supported
% java ProtocolTester
http is supported
https is not supported
ftp is supported
mailto is not supported
telnet is not supported
is supported
gopher is not supported
ldap is not supported
jar is supported
nfs is not supported
jdbc is not supported
rmi is not supported
doc is not supported
netdoc is not supported
systemresource is not supported
verbatim is not supported
7.1.1.2 Constructing a URL from its component parts
public URL(String protocol, String hostname, String )
throws MalformedURLException
try {
URL u = new URL("http", "www.eff.org", "/blueribbonl#intro");
}
catch (MalformedURLException ex) {
// All VMs should recognize http
}public URL(String protocol, String host, int port, String )
throws MalformedURLException
try {
URL u = new URL("http", "fourier.dur.ac.uk", 8000, "/~dma3mjh/jsci/");
}
catch (MalformedURLException ex) {
System.err.println(ex);
}Example 7-2. A protocol tester applet
import java.net.*;
import java.applet.*;
import java.awt.*;
public class ProtocolTesterApplet extends Applet {
TextArea results = new TextArea( );
public void init( ) {
this.setLayout(new BorderLayout( ));
this.add("Center", results);
}
public void start( ) {
String host = "www.peacefire.org";
String = "/bypass/SurfWatch/";
String[] schemes = {"http", "https", "ftp", "mailto",
"telnet", ", "ldap", "gopher",
"jdbc", "rmi", "jndi", "jar",
"doc", "netdoc", "nfs", "verbatim",
"finger", "daytime", "systemresource"};
for (int i = 0; i < schemes.length; i++) {
try {
URL u = new URL(schemes[i], host, );
results.append(schemes[i] + " is supported\r\n");
}
catch (MalformedURLException ex) {
results.append(schemes[i] + " is not supported\r\n");
}
}
}
}Figure 7-1. The ProtocolTesterApplet running in Mozilla 1.4
7.1.1.3 Constructing relative URLs
public URL(URL base, String relative) throws MalformedURLException
try {
URL u1 = new URL("http://www.ibiblio.org/javafaq/indexl");
URL u2 = new URL (u1, "mailinglistsl");
}
catch (MalformedURLException ex) {
System.err.println(ex);
}Example 7-3. A URL relative to the web page
import java.net.*;
import java.applet.*;
import java.awt.*;
public class RelativeURLTest extends Applet {
public void init ( ) {
try {
URL base = this.getDocumentBase( );
URL relative = new URL(base, "mailinglistsl");
this.setLayout(new GridLayout(2,1));
this.add(new Label(base.toString( )));
this.add(new Label(relative.toString( )));
}
catch (MalformedURLException ex) {
this.add(new Label("This shouldn't happen!"));
}
}
}Figure 7-2. A base and a relative URL
URL relative = new URL(this.getDocumentBase( ), "mailinglistsl");
7.1.1.4 Specifying a URLStreamHandler // Java 1.2
public URL(URL base, String relative, URLStreamHandler handler) // 1.2
throws MalformedURLException
public URL(String protocol, String host, int port, String , // 1.2
URLStreamHandler handler) throws MalformedURLException
URL u = new URL("finger", "utopia.poly.edu", 79, "/marcus",
new com.macfaq.net.www.protocol.finger.Handler( ));7.1.1.5 Other sources of URL objects
7.1.2 Splitting a URL into Pieces
7.1.2.1 public String getProtocol( )
URL page = this.getCodeBase( );
System.out.println("This applet was downloaded via "
+ page.getProtocol( ));7.1.2.2 public String getHost( )
URL page = this.getCodeBase( );
System.out.println("This applet was downloaded from " + page.getHost( ));7.1.2.3 public int getPort( )
URL u = new URL("http://www.ncsa.uiuc.edu/demowebl-primerl");
System.out.println("The port part of " + u + " is " + u.getPort( ));7.1.2.4 public int getDefaultPort( )
7.1.2.5 public String getFile( )
URL page = this.getDocumentBase( );
System.out.println("This page's path is " + page.getFile( ));7.1.2.6 public String getPath( ) // Java 1.3
|
The getRef( ) method returns the fragment identifier part of the URL. If the URL doesn't have a fragment identifier, the method returns null. In the following code, getRef( ) returns the string xtocid1902914:
URL u = new URL(
"http://www.ibiblio.org/javafaq/javafaql#xtocid1902914");
System.out.println("The fragment ID of " + u + " is " + u.getRef( ));
The getQuery( ) method returns the query string of the URL. If the URL doesn't have a query string, the method returns null. In the following code, getQuery() returns the string category=Piano:
URL u = new URL(
"http://www.ibiblio.org/nywc/compositions.l?category=Piano");
System.out.println("The query string of " + u + " is " + u.getQuery( ));In Java 1.2 and earlier, you need to extract the query string from the value returned by getFile( ) instead.
Some URLs include usernames and occasionally even password information. This information comes after the scheme and before the host; an @ symbol delimits it. For instance, in the URL http://elharo@java.oreilly.com/, the user info is elharo. Some URLs also include passwords in the user info. For instance, in the URL ftp://mp3:secret@ftp.example.com/c%3a/stuff/mp3/, the user info is mp3:secret. However, most of the time including a password in a URL is a security risk. If the URL doesn't have any user info, getUserInfo() returns null. Mailto URLs may not behave like you expect. In a URL like mailto:
Between the scheme and the path of a URL, you'll find the authority. The term authority is taken from the Uniform Resource Identifier specification (RFC 2396), where this part of the URI indicates the authority that resolves the resource. In the most general case, the authority includes the user info, the host, and the port. For example, in the URL http://conferences.oreilly.com/java/speakers/, the authority is simply the hostname conferences.oreilly.com. The getAuthority( ) method returns the authority as it exists in the URL, with or without the user info and port.
Example 7-4 uses all eight methods to split URLs entered on the command line into their component parts. This program requires Java 1.3 or later.
import java.net.*;
public class URLSplitter {
public static void main(String args[]) {
for (int i = 0; i < args.length; i++) {
try {
URL u = new URL(args[i]);
System.out.println("The URL is " + u);
System.out.println("The scheme is " + u.getProtocol( ));
System.out.println("The user info is " + u.getUserInfo( ));
String host = u.getHost( );
if (host != null) {
int atSign = host.indexOf('@');
if (atSign != -1) host = host.substring(atSign+1);
System.out.println("The host is " + host);
}
else {
System.out.println("The host is null.");
}
System.out.println("The port is " + u.getPort( ));
System.out.println("The path is " + u.getPath( ));
System.out.println("The ref is " + u.getRef( ));
System.out.println("The query string is " + u.getQuery( ));
} // end try
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not a URL I understand.");
}
System.out.println( );
} // end for
} // end main
} // end URLSplitterHere's the result of running this against several of the URL examples in this chapter:
% java URLSplitter \
http://www.ncsa.uiuc.edu/demowebl-primerl#A1.3.3.3 \
\
http://www.oreilly.com \
http://www.ibiblio.org/nywc/compositions.l?category=Piano \
http://admin@www.blackstar.com:8080/ \
The URL is http://www.ncsa.uiuc.edu/demowebl-primerl#A1.3.3.3
The scheme is http
The user info is null
The host is www.ncsa.uiuc.edu
The port is -1
The path is /demowebl-primerl
The ref is A1.3.3.3
The query string is null
The URL is
The scheme is ftp
The user info is mp3:mp3
The host is 138.247.121.61
The port is 21000
The path is /c%3a/
The ref is null
The query string is null
The URL is http://www.oreilly.com
The scheme is http
The user info is null
The host is www.oreilly.com
The port is -1
The path is
The ref is null
The query string is null
The URL is http://www.ibiblio.org/nywc/compositions.l?category=Piano
The scheme is http
The user info is null
The host is www.ibiblio.org
The port is -1
The path is /nywc/compositions.l
The ref is null
The query string is category=Piano
The URL is http://admin@www.blackstar.com:8080/
The scheme is http
The user info is admin
The host is www.blackstar.com
The port is 8080
The path is /
The ref is null
The query string is null
Naked URLs aren't very exciting. What's interesting is the data contained in the documents they point to. The URL class has several methods that retrieve data from a URL:
public InputStream openStream( ) throws IOException public URLConnection openConnection( ) throws IOException public URLConnection openConnection(Proxy proxy) throws IOException // 1.5 public Object getContent( ) throws IOException public Object getContent(Class[] classes) throws IOException // 1.3
These methods differ in that they return the data at the URL as an instance of different classes.
The openStream( ) method connects to the resource referenced by the URL, performs any necessary handshaking between the client and the server, and returns an InputStream from which data can be read. The data you get from this InputStream is the raw (i.e., uninterpreted) contents of the the URL references: ASCII if you're reading an ASCII text , raw HTML if you're reading an HTML , binary image data if you're reading an image , and so forth. It does not include any of the HTTP headers or any other protocol-related information. You can read from this InputStream as you would read from any other InputStream. For example:
try {
URL u = new URL("http://www.hamsterdance.com");
InputStream in = u.openStream( );
int c;
while ((c = in.read( )) != -1) System.out.write(c);
}
catch (IOException ex) {
System.err.println(ex);
}This code fragment catches an IOException, which also catches the MalformedURLException that the URL constructor can throw, since MalformedURLException subclasses IOException.
Example 7-5 reads a URL from the command line, opens an InputStream from that URL, chains the resulting InputStream to an InputStreamReader using the default encoding, and then uses InputStreamReader's read( ) method to read successive characters from the , each of which is printed on System.out. That is, it prints the raw data located at the URL: if the URL references an HTML , the program's output is raw HTML.
import java.net.*;
import java.io.*;
public class SourceViewer {
public static void main (String[] args) {
if (args.length > 0) {
try {
//Open the URL for reading
URL u = new URL(args[0]);
InputStream in = u.openStream( );
// buffer the input to increase performance
in = new BufferedInputStream(in);
// chain the InputStream to a Reader
Reader r = new InputStreamReader(in);
int c;
while ((c = r.read( )) != -1) {
System.out.print((char) c);
}
}
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
catch (IOException ex) {
System.err.println(ex);
}
} // end if
} // end main
} // end SourceViewerAnd here are the first few lines of output when SourceViewer downloads http://www.oreilly.com:
% java SourceViewer http://www.oreilly.com <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <l xmlns="http://www.w3.org/1999/l" lang="en-US" xml:lang="en-US"> <head> <title>oreilly.com -- Welcome to O'Reilly Media, Inc. -- computer books, software conferences, online publishing</title> <meta name="keywords" content="O'Reilly, oreilly, computer books, technical books, UNIX, unix, Perl, Java, Linux, Internet, Web, C, C++, Windows, Windows NT, Security, Sys Admin, System Administration, Oracle, PL/SQL, online books, books online, computer book online, e-books, ebooks, Perl Conference, Open Source Conference, Java Conference, open source, free software, XML, Mac OS X, .Net, dot net, C#, PHP, CGI, VB, VB Script, Java Script, javascript, Windows 2000, XP, bioinformatics, web services, p2p" /> <meta name="description" content="O'Reilly is a leader in technical and computer book documentation, online content, and conferences for UNIX, Perl, Java, Linux, Internet, Mac OS X, C, C++, Windows, Windows NT, Security, Sys Admin, System Administration, Oracle, Design and Graphics, Online Books, e-books, ebooks, Perl Conference, Java Conference, P2P Conference" />
There are quite a few more lines in that web page; if you want to see them, you can fire up your web browser.
The shakiest part of this program is that it blithely assumes that the remote URL is text, which is not necessarily true. It could well be a GIF or JPEG image, an MP3 sound , or something else entirely. Even if it is text, the document encoding may not be the same as the default encoding of the client system. The remote host and local client may not have the same default character set. As a general rule, for pages that use a character set radically different from ASCII, the HTML will include a META tag in the header specifying the character set in use. For instance, this META tag specifies the Big-5 encoding for Chinese:
<meta http-equiv="Content-Type" content="textl; charset=big5">
An XML document will likely have an XML declaration instead:
<?xml version="1.0" encoding="Big5"?>
In practice, there's no easy way to get at this information other than by parsing the and looking for a header like this one, and even that approach is limited. Many HTML s hand-coded in Latin alphabets don't have such a META tag. Since Windows, the Mac, and most Unixes have somewhat different interpretations of the characters from 128 to 255, the extended characters in these documents do not translate correctly on platforms other than the one on which they were created.
And as if this isn't confusing enough, the HTTP header that precedes the actual document is likely to have its own encoding information, which may completely contradict what the document itself says. You can't read this header using the URL class, but you can with the URLConnection object returned by the openConnection( ) method. Encoding detection and declaration is one of the thornier parts of the architecture of the Web.
The openConnection( ) method opens a socket to the specified URL and returns a URLConnection object. A URLConnection represents an open connection to a network resource. If the call fails, openConnection( ) throws an IOException. For example:
try {
URL u = new URL("http://www.jennicam.org/");
try {
URLConnection uc = u.openConnection( );
InputStream in = uc.getInputStream( );
// read from the connection...
} // end try
catch (IOException ex) {
System.err.println(ex);
}
} // end try
catch (MalformedURLException ex) {
System.err.println(ex);
}Use this method when you want to communicate directly with the server. The URLConnection gives you access to everything sent by the server: in addition to the document itself in its raw form (e.g., HTML, plain text, binary image data), you can access all the metadata specified by the protocol. For example, if the scheme is HTTP, the URLConnection lets you access the HTTP headers as well as the raw HTML. The URLConnection class also lets you write data to as well as read from a URLfor instance, in order to send email to a mailto URL or post form data. The URLConnection class will be the primary subject of Chapter 15.
Java 1.5 adds one overloaded variant of this method that specifies the proxy server to pass the connection through:
public URLConnection openConnection(Proxy proxy) throws IOException
This overrides any proxy server set with the usual socksProxyHost, socksProxyPort, http.proxyHost, http.proxyPort, http.nonProxyHosts, and similar system properties. If the protocol handler does not support proxies, the argument is ignored and the connection is made directly if possible.
The getContent( ) method is the third way to download data referenced by a URL. The getContent( ) method retrieves the data referenced by the URL and tries to make it into some type of object. If the URL refers to some kind of text object such as an ASCII or HTML , the object returned is usually some sort of InputStream. If the URL refers to an image such as a GIF or a JPEG , getContent( ) usually returns a java.awt.ImageProducer (more specifically, an instance of a class that implements the ImageProducer interface). What unifies these two disparate classes is that they are not the thing itself but a means by which a program can construct the thing:
try {
URL u = new URL("http://mesola.obspm.fr/");
Object o = u.getContent( );
// cast the Object to the appropriate type
// work with the Object...
}
catch (Exception ex) {
System.err.println(ex);
}getContent( ) operates by looking at the Content-type field in the MIME header of the data it gets from the server. If the server does not use MIME headers or sends an unfamiliar Content-type, getContent( ) returns some sort of InputStream with which the data can be read. An IOException is thrown if the object can't be retrieved. Example 7-6 demonstrates this.
import java.net.*;
import java.io.*;
public class ContentGetter {
public static void main (String[] args) {
if (args.length > 0) {
//Open the URL for reading
try {
URL u = new URL(args[0]);
try {
Object o = u.getContent( );
System.out.println("I got a " + o.getClass( ).getName( ));
} // end try
catch (IOException ex) {
System.err.println(ex);
}
} // end try
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
} // end if
} // end main
} // end ContentGetterHere's the result of trying to get the content of http://www.oreilly.com:
% java ContentGetter http://www.oreilly.com/ I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream
The exact class may vary from one version of Java to the next (in earlier versions, it's been java.io.PushbackInputStream or sun.net.www.http.KeepAliveStream) but it should be some form of InputStream.
Here's what you get when you try to load a header image from that page:
% java ContentGetter http://www.oreilly.com/graphics_new/animation.gif I got a sun.awt.image.URLImageSource
Here's what happens when you try to load a Java applet using getContent( ):
% java ContentGetter http://www.cafeaulait.org/RelativeURLTest.class I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream
Here's what happens when you try to load an audio using getContent( ):
% java ContentGetter http://www.cafeaulait.org/course/week9/spacemusic.au I got a sun.applet.AppletAudioClip
The last result is the most unusual because it is as close as the Java core API gets to a class that represents a sound . It's not just an interface through which you can load the sound data.
This example demonstrates the biggest problems with using getContent( ): it's hard to predict what kind of object you'll get. You could get some kind of InputStream or an ImageProducer or perhaps an AudioClip; it's easy to check using the instanceof operator. This information should be enough to let you read a text or display an image.
Starting in Java 1.3, it is possible for a content handler to provide different views of an object. This overloaded variant of the getContent( ) method lets you choose what class you'd like the content to be returned as. The method attempts to return the URL's content in the order used in the array. For instance, if you prefer an HTML to be returned as a String, but your second choice is a Reader and your third choice is an InputStream, write:
URL u = new URL("http://www.nwu.org");
Class[] types = new Class[3];
types[0] = String.class;
types[1] = Reader.class;
types[2] = InputStream.class;
Object o = u.getContent(types);You then have to test for the type of the returned object using instanceof. For example:
if (o instanceof String) {
System.out.println(o);
}
else if (o instanceof Reader) {
int c;
Reader r = (Reader) o;
while ((c = r.read( )) != -1) System.out.print((char) c);
}
else if (o instanceof InputStream) {
int c;
InputStream in = (InputStream) o;
while ((c = in.read( )) != -1) System.out.write(c);
}
else {
System.out.println("Error: unexpected type " + o.getClass( ));
}
The URL class contains a couple of utility methods that perform common operations on URLs. The sameFile( ) method determines whether two URLs point to the same document. The toExternalForm( ) method converts a URL object to a string that can be used in an HTML link or a web browser's Open URL dialog.
The sameFile( ) method tests whether two URL objects point to the same . If they do, sameFile( ) returns true; otherwise, it returns false. The test that sameFile( ) performs is quite shallow; all it does is compare the corresponding fields for equality. It detects whether the two hostnames are really just aliases for each other. For instance, it can tell that http://www.ibiblio.org/ and http://metalab.unc.edu/ are the same . However, it cannot tell that http://www.ibiblio.org:80/ and http://metalab.unc.edu/ are the same or that http://www.cafeconleche.org/ and http://www.cafeconleche.org/indexl are the same . sameFile( ) is smart enough to ignore the fragment identifier part of a URL, however. Here's a fragment of code that uses sameFile( ) to compare two URLs:
try {
URL u1 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimerl#GS");
URL u2 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimerl#HD");
if (u1.sameFile(u2)) {
System.out.println(u1 + " is the same as \n" + u2);
}
else {
System.out.println(u1 + " is not the same as \n" + u2);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}The output is:
http://www.ncsa.uiuc.edu/HTMLPrimerl#GS is the same as http://www.ncsa.uiuc.edu/HTMLPrimerl#HD
The sameFile( ) method is similar to the equals( ) method of the URL class. The main difference between sameFile( ) and equals( ) is that equals( ) considers the fragment identifier (if any), whereas sameFile( ) does not. The two URLs shown here do not compare equal although they are the same . Also, any object may be passed to equals( ); only URL objects can be passed to sameFile( ).
The toExternalForm( ) method returns a human-readable String representing the URL. It is identical to the toString( ) method. In fact, all the toString( ) method does is return toExternalForm( ). Therefore, this method is currently redundant and rarely used.
Java 1.5 adds a toURI( ) method that converts a URL object to an equivalent URI object. We'll take up the URI class shortly. In the meantime, the main thing you need to know is that the URI class provides much more accurate, specification-conformant behavior than the URL class. For operations like absolutization and encoding, you should prefer the URI class where you have the option. In Java 1.4 and later, the URL class should be used primarily for the actual downloading of content from the remote server.
URL inherits from java.lang.Object, so it has access to all the methods of the Object class. It overrides three to provide more specialized behavior: equals( ), hashCode( ), and toString( ).
Like all good classes, java.net.URL has a toString( ) method. Example 7-1 through Example 7-5 implicitly called this method when URLs were passed to System.out.println( ). As those examples demonstrated, the String produced by toString( ) is always an absolute URL, such as http://www.cafeaulait.org/javatutoriall.
It's uncommon to call toString( ) explicitly. Print statements call toString( ) implicitly. Outside of print statements, it's more proper to use toExternalForm( ) instead. If you do call toString( ), the syntax is simple:
URL codeBase = this.getCodeBase( ); String appletURL = codeBase.toString( );
An object is equal to a URL only if it is also a URL, both URLs point to the same as determined by the sameFile( ) method, and both URLs have the same fragment identifier (or both URLs don't have fragment identifiers). Since equals( ) depends on sameFile( ), equals( ) has the same limitations as sameFile( ). For example, http://www.oreilly.com/ is not equal to http://www.oreilly.com/indexl, and http://www.oreilly.com:80/ is not equal to http://www.oreilly.com/. Whether this makes sense depends on whether you think of a URL as a string or as a reference to a particular Internet resource.
Example 7-7 creates URL objects for http://www.ibiblio.org/ and http://metalab.unc.edu/ and tells you if they're the same using the equals() method.
import java.net.*;
public class URLEquality {
public static void main (String[] args) {
try {
URL ibiblio = new URL ("http://www.ibiblio.org/");
URL metalab = new URL("http://metalab.unc.edu/");
if (ibiblio.equals(metalab)) {
System.out.println(ibiblio + " is the same as " + metalab);
}
else {
System.out.println(ibiblio + " is not the same as " + metalab);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
}
}When you run this program, you discover:
% java URLEquality http://www.ibiblio.org/ is the same as http://metalab.unc.edu/
The hashCode( ) method returns an int that is used when URL objects are used as keys in hash tables. Thus, it is called by the various methods of java.util.Hashtable. You rarely need to call this method directly, if ever. Hash codes for two different URL objects are unlikely to be the same, but it is certainly possible; there are far more conceivable URLs than there are four-byte integers.
The last method in the URL class I'll just mention briefly here for the sake of completeness: setURLStreamHandlerFactory( ). It's primarily used by protocol handlers that are responsible for new schemes, not by programmers who just want to retrieve data from a URL. We'll discuss it in more detail in Chapter 16.
This method sets the URLStreamHandlerFactory for the application and throws a generic Error if the factory has already been set. A URLStreamHandler is responsible for parsing the URL and then constructing the appropriate URLConnection object to handle the connection to the server. Most of the time this happens behind the scenes.