7.1 The URL Class
The java.net.URL class is an abstraction of a Uniform
Resource Locator such as http://www.hamsterdance.com/ or public final class URL extends Object implements SerializableAlthough storing a URL as a string would be trivial, it is helpful to
think of URLs as objects with fields that include the scheme (a.k.a.
the protocol), hostname, port, path, query string, and fragment
identifier (a.k.a. the ref), each of which may be set independently.
Indeed, this is almost exactly how the
java.net.URL class is organized, though the
details vary a little between different versions of Java.The fields of java.net.URL are only visible to
other members of the java.net package; classes
that aren't in java.net
can't access a
URL's fields directly. However,
you can set these fields using the URL
constructors and retrieve their values using the various getter
methods (getHost( ), getPort(), and so on). URLs are effectively immutable. After a
URL object has been constructed, its fields do not
change. This has the side effect of making them thread-safe.
7.1.1 Creating New URLs
Unlike the InetAddress
objects in Chapter 6, you can construct
instances of java.net.URL. There are
six
constructors, differing in the information they require. Which
constructor you use depends on the information you have and the form
it's in. All these constructors throw a
MalformedURLException if you try to create a
URL for an unsupported protocol and may throw a
MalformedURLException if the URL is syntactically
incorrect.Exactly which protocols are
supported is implementation-dependent. The only protocols that have
been available in all major virtual machines are http and , and
the latter is notoriously flaky. Java 1.5 also requires virtual
machines to support https, jar, and ftp; many virtual machines prior
to Java 1.5 support these three as well. Most virtual machines also
support ftp, mailto, and gopher as well as some custom protocols like
doc, netdoc, systemresource, and verbatim used internally by Java.
The Netscape virtual machine supports the http, , ftp, mailto,
telnet, ldap, and gopher protocols. The Microsoft virtual machine
supports http, , ftp, https, mailto, gopher, doc, and
systemresource, but not telnet, netdoc, jar, or verbatim. Of course,
support for all these protocols is limited in applets by the security
policy. For example, just because an untrusted applet can construct a
URL object from a URL does not mean that the
applet can actually read the the URL refers to. Just because an
untrusted applet can construct a URL object from
an HTTP URL that points to a third-party web site does not mean that
the applet can connect to that site.If the protocol you need isn't supported by a
particular VM, you may be able to install a protocol handler for that
scheme. This is subject to a number of security checks in applets and
is really practical only for applications. Other than verifying that
it recognizes the URL scheme, Java does not make any checks about the
correctness of the URLs it constructs. The programmer is responsible
for making sure that URLs created are valid. For instance, Java does
not check that the hostname in an HTTP URL does not contain spaces or
that the query string is x-www-form-URL-encoded. It does not check
that a mailto URL actually contains an email address. Java does not
check the URL to make sure that it points at an existing host or that
it meets any other requirements for URLs. You can create URLs for
hosts that don't exist and for hosts that do exist
but that you won't be allowed to connect to.
7.1.1.1 Constructing a URL from a string
The simplest
URL constructor just takes an absolute URL in
string form as its single argument:
public URL(String url) throws MalformedURLExceptionLike all constructors, this may only be called after the
new operator, and like all URL
constructors, it can throw a
MalformedURLException. The following code
constructs a URL object from a
String, catching the exception that might be
thrown:
try {Example 7-1 is a simple program for determining
URL u = new URL("http://www.audubon.org/");
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
which protocols a virtual machine supports. It attempts to construct
a URL object for each of 14 protocols (8 standard
protocols, 3 custom protocols for various Java APIs, and 4
undocumented protocols used internally by HotJava). If the
constructor succeeds, you know the protocol is supported. Otherwise,
a MalformedURLException is thrown and you know the
protocol is not supported.
Example 7-1. ProtocolTester
/* Which protocols does a virtual machine support? */The results of this program depend on which virtual machine runs it.
import java.net.*;
public class ProtocolTester {
public static void main(String[] args) {
// hypertext transfer protocol
testProtocol("http://www.adc.org");
// secure http
testProtocol("https://www.amazon.com/exec/obidos/order2/");
// transfer protocol
testProtocol("ftp://metalab.unc.edu/pub/languages/java/javafaq/");
// Simple Mail Transfer Protocol
testProtocol(");
// telnet
testProtocol("telnet://dibner.poly.edu/");
// local access
testProtocol(":///etc/passwd");
// gopher
testProtocol("gopher://gopher.anc.org.za/");
// Lightweight Directory Access Protocol
testProtocol(
"ldap://ldap.itd.umich.edu/o=University%20of%20Michigan,c=US?postalAddress");
// JAR
testProtocol(
"jar:http://cafeaulait.org/books/javaio/ioexamples/javaio.jar!"
+"/com/macfaq/io/StreamCopier.class");
// NFS, Network File System
testProtocol("nfs://utopia.poly.edu/usr/tmp/");
// a custom protocol for JDBC
testProtocol("jdbc:mysql://luna.metalab.unc.edu:3306/NEWS");
// rmi, a custom protocol for remote method invocation
testProtocol("rmi://metalab.unc.edu/RenderEngine");
// custom protocols for HotJava
testProtocol("doc:/UsersGuide/releasel");
testProtocol("netdoc:/UsersGuide/releasel");
testProtocol("systemresource://www.adc.org/+/indexl");
testProtocol("verbatim:http://www.adc.org/");
}
private static void testProtocol(String url) {
try {
URL u = new URL(url);
System.out.println(u.getProtocol( ) + " is supported");
}
catch (MalformedURLException ex) {
String protocol = url.substring(0, url.indexOf(':'));
System.out.println(protocol + " is not supported");
}
}
}
Here are the results from Java 1.4.1 on Mac OS X 10.2, which turns
out to support all the protocols except Telnet, LDAP, RMI, NFS, and
JDBC:
% java ProtocolTesterResults using Sun's Linux 1.4.2 virtual machine were
http is supported
https is supported
ftp is supported
mailto is supported
telnet is not supported
is supported
gopher is supported
ldap is not supported
jar is supported
nfs is not supported
jdbc is not supported
rmi is not supported
doc is supported
netdoc is supported
systemresource is supported
verbatim is supported
identical. Other 1.4 virtual machines derived from the Sun code will
show similar results. Java 1.2 and later are likely to be the same
except for maybe HTTPS, which was only recently added to the standard
distribution. VMs that are not derived from the Sun codebase may vary
somewhat in which protocols they support. For example, here are the
results of running ProtocolTester with the open
source Kaffe VM 1.1.1:
% java ProtocolTesterThe nonsupport of RMI and JDBC is actually a little deceptive; in
http is supported
https is not supported
ftp is supported
mailto is not supported
telnet is not supported
is supported
gopher is not supported
ldap is not supported
jar is supported
nfs is not supported
jdbc is not supported
rmi is not supported
doc is not supported
netdoc is not supported
systemresource is not supported
verbatim is not supported
fact, the JDK does support these protocols. However, that support is
through various parts of the java.rmi and
java.sql packages, respectively. These protocols
are not accessible through the
URL class like the other supported
protocols (although I have no idea why Sun chose to wrap up RMI and
JDBC parameters in URL clothing if it wasn't
intending to interface with these via Java's quite
sophisticated mechanism for handling URLs).
7.1.1.2 Constructing a URL from its component parts
The second constructor builds
a URL from three strings specifying the protocol,
the hostname, and the :
public URL(String protocol, String hostname, String )This constructor sets the port to -1 so the default port for the
throws MalformedURLException
protocol will be used. The argument should
begin with a slash and include a path, a name, and optionally a
fragment identifier. Forgetting the initial slash is a common
mistake, and one that is not easy to spot. Like all
URL constructors, it can throw a
MalformedURLException. For example:
try {This creates a URL object that points to
URL u = new URL("http", "www.eff.org", "/blueribbonl#intro");
}
catch (MalformedURLException ex) {
// All VMs should recognize http
}
http://www.eff.org/blueribbonl#intro,
using the default port for the HTTP protocol (port 80). The
specification includes a reference to a named anchor. The code
catches the exception that would be thrown if the virtual machine did
not support the HTTP protocol. However, this
shouldn't happen in practice.For the rare occasions when the default port isn't
correct, the next constructor lets you specify the port explicitly as
an int:
public URL(String protocol, String host, int port, String )The other arguments are the same as for the
throws MalformedURLException
URL(String protocol,
String host,
String ) constructor and
carry the same caveats. For example:
try {This code creates a URL object that points
URL u = new URL("http", "fourier.dur.ac.uk", 8000, "/~dma3mjh/jsci/");
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
to
http://fourier.dur.ac.uk:8000/~dma3mjh/jsci/, specifying
port 8000 explicitly.Example 7-2 is an alternative protocol tester that
can run as an applet, making it useful for testing support of browser
virtual machines. It uses the three-argument constructor rather than
the one-argument constructor in Example 7-1. It also
stores the schemes to be tested in an array and uses the same host
and for each scheme. This produces seriously malformed URLs like
mailto://www.peacefire.org/bypass/SurfWatch/,
once again demonstrating that all Java checks for at object
construction is whether it recognizes the scheme, not whether the URL
is appropriate.
Example 7-2. A protocol tester applet
import java.net.*;Figure 7-1 shows the results of Example 7-2 in Mozilla 1.4 with Java 1.4 installed. This
import java.applet.*;
import java.awt.*;
public class ProtocolTesterApplet extends Applet {
TextArea results = new TextArea( );
public void init( ) {
this.setLayout(new BorderLayout( ));
this.add("Center", results);
}
public void start( ) {
String host = "www.peacefire.org";
String = "/bypass/SurfWatch/";
String[] schemes = {"http", "https", "ftp", "mailto",
"telnet", ", "ldap", "gopher",
"jdbc", "rmi", "jndi", "jar",
"doc", "netdoc", "nfs", "verbatim",
"finger", "daytime", "systemresource"};
for (int i = 0; i < schemes.length; i++) {
try {
URL u = new URL(schemes[i], host, );
results.append(schemes[i] + " is supported\r\n");
}
catch (MalformedURLException ex) {
results.append(schemes[i] + " is not supported\r\n");
}
}
}
}
browser supports HTTP, HTTPS, FTP, mailto, , gopher, doc, netdoc,
verbatim, systemresource, and jar but not HTTPS, ldap, Telnet, jdbc,
rmi, jndi, finger or daytime.
Figure 7-1. The ProtocolTesterApplet running in Mozilla 1.4

7.1.1.3 Constructing relative URLs
This constructor builds an
absolute URL from a relative
URL and a base URL:
public URL(URL base, String relative) throws MalformedURLExceptionFor instance, you may be parsing an HTML document at http://www.ibiblio.org/javafaq/indexl and
encounter a link to a called
mailinglistsl with no further qualifying
information. In this case, you use the URL to the document that
contains the link to provide the missing information. The constructor
computes the new URL as http://www.ibiblio.org/javafaq/mailinglistsl.
For example:
try {The name is removed from the path of u1 and
URL u1 = new URL("http://www.ibiblio.org/javafaq/indexl");
URL u2 = new URL (u1, "mailinglistsl");
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
the new name mailinglistsl is appended
to make u2. This constructor is particularly
useful when you want to loop through a list of s that are all in
the same directory. You can create a URL for the first and then
use this initial URL to create URL objects for the
other s by substituting their names. You also use this
constructor when you want to create a URL relative
to the applet's document base or code base, which
you retrieve using the getDocumentBase() or getCodeBase() methods of the java.applet.Applet
class. Example 7-3 is a very simple applet that uses
getDocumentBase( ) to create a new
URL object:
Example 7-3. A URL relative to the web page
import java.net.*;Of course, the output from this applet depends on the document base.
import java.applet.*;
import java.awt.*;
public class RelativeURLTest extends Applet {
public void init ( ) {
try {
URL base = this.getDocumentBase( );
URL relative = new URL(base, "mailinglistsl");
this.setLayout(new GridLayout(2,1));
this.add(new Label(base.toString( )));
this.add(new Label(relative.toString( )));
}
catch (MalformedURLException ex) {
this.add(new Label("This shouldn't happen!"));
}
}
}
In the run shown in Figure 7-2, the original
URL (the document base) refers to the
RelativeURLl; the constructor creates a new
URL that points to the
mailinglistsl in the same directory.
Figure 7-2. A base and a relative URL

) inside the constructor, like this:
URL relative = new URL(this.getDocumentBase( ), "mailinglistsl");
7.1.1.4 Specifying a URLStreamHandler // Java 1.2
Two constructors allow you to specify
the protocol handler used for the URL. The first constructor builds a
relative URL from a base URL
and a relative part. The second builds the URL
from its component pieces:
public URL(URL base, String relative, URLStreamHandler handler) // 1.2All URL objects have
throws MalformedURLException
public URL(String protocol, String host, int port, String , // 1.2
URLStreamHandler handler) throws MalformedURLException
URLStreamHandler objects to do their work for
them. These two constructors change from the default
URLStreamHandler subclass for a particular
protocol to one of your own choosing. This is useful for working with
URLs whose schemes aren't supported in a particular
virtual machine as well as for adding functionality that the default
stream handler doesn't provide, such as asking the
user for a username and password. For example:
URL u = new URL("finger", "utopia.poly.edu", 79, "/marcus",The com.macfaq.net.www.protocol.finger.Handler
new com.macfaq.net.www.protocol.finger.Handler( ));
class used here will be developed in Chapter 16.While the other four constructors raise no security issues in and of
themselves, these two do because class loader security is closely
tied to the various URLStreamHandler classes.
Consequently, untrusted applets are not allowed to specify a
URLSreamHandler. Trusted applets can do so if they
have the NetPermission
specifyStreamHandler. However, for reasons that
will become apparent in Chapter 16, this is a
security hole big enough to drive the Microsoft money train through.
Consequently, you should not request this permission or expect it to
be granted if you do request it.
7.1.1.5 Other sources of URL objects
Besides the constructors discussed here, a number of other methods in
the Java class library return URL objects.
You've already seen getDocumentBase(
) from
java.applet.Applet. The other common source is
getCodeBase( ), also from
java.applet.Applet. This works just like
getDocumentBase( ), except it returns the
URL of the applet itself instead of the URL of the
page that contains the applet. Both getDocumentBase(
) and getCodeBase( ) come from the
java.applet.AppletStub interface, which
java.applet.Applet implements.
You're unlikely to implement this interface yourself
unless you're building a web browser or applet
viewer.In Java 1.2 and later, the java.io.File class has
a toURL( ) method that returns a method returns a
URL from which a single resource can be read. The
ClassLoader.getSystemResources(String name) method
returns an Enumeration containing a list of
URLs from which the named resource can be read.
Finally, the instance method getResource(String
name) searches the path used by the referenced
class loader for a URL to the named resource. The URLs returned by
these methods may be URLs, HTTP URLs, or some other scheme. The
name of the resource is a slash-separated list of Java identifiers,
such as /com/macfaq/sounds/swale.au or
com/macfaq/images/headshot.jpg. The Java virtual
machine will attempt to find the requested resource in the class
pathpotentially including parts of the class path on the web
server that an applet was loaded fromor inside a JAR archive.Java 1.4 adds the URI class, which
we'll discuss soon. URIs can be converted into URLs
using the toURL( ) method, provided Java has the
relevant protocol handler installed.There are a few other methods that return URL
objects here and there throughout the class library, but most are
simple getter methods that return only a URL you probably already
know because you used it to construct the object in the first place;
for instance, the getPage( ) method of
java.swing.JEditorPane and the getURL(
) method of
java.net.URLConnection.
7.1.2 Splitting a URL into Pieces
URLs
are composed of five pieces:The scheme, also known as the protocolThe authorityThe pathThe fragment identifier, also known as the section or refThe query string
For example, given the URL http://www.ibiblio.org/javafaq/books/jnp/indexl?isbn=1565922069#toc,
the scheme is http, the authority is
www.ibiblio.org, the path is
/javafaq/books/jnp/indexl, the fragment
identifier is toc, and the query string is
isbn=1565922069. However, not all URLs have all
these pieces. For instance, the URL http://www.faqs.org/rfcs/rfc2396l has a
scheme, an authority, and a path, but no fragment identifier or query
string.The authority may further be divided into the user info, the host,
and the port. For example, in the URL http://admin@www.blackstar.com:8080/, the
authority is
7.1.2.1 public String getProtocol( )
The getProtocol( ) method returns a
String containing the scheme of the URL, e.g.,
"http",
"https", or
". For example:
URL page = this.getCodeBase( );
System.out.println("This applet was downloaded via "
+ page.getProtocol( ));
7.1.2.2 public String getHost( )
The getHost( ) method returns a
String containing the hostname of the URL. For
example:
URL page = this.getCodeBase( );The most recent virtual machines get this method right but some older
System.out.println("This applet was downloaded from " + page.getHost( ));
ones, including Sun's JDK 1.3.0, may return a host
string that is not necessarily a valid hostname or address. In
particular, URLs that incorporate usernames, like URL u = new URL(");
String host = u.getHost( );Java 1.3 sets host to
anonymous:anonymous@wuarchive.wustl.edu, not
simply wuarchive.wustl.edu. Java 1.4 would return
wuarchive.wustl.edu instead.
7.1.2.3 public int getPort( )
The getPort( ) method
returns the port number specified in the URL as an
int. If no port was specified in the
URL, getPort( ) returns -1 to
signify that the URL does not specify the port explicitly, and will
use the default port for the protocol. For example, if the URL is
http://www.userfriendly.org/,
getPort( ) returns -1; if the URL is http://www.userfriendly.org:80/,
getPort( ) returns 80. The following code prints
-1 for the port number because it isn't specified in
the URL:
URL u = new URL("http://www.ncsa.uiuc.edu/demoweb/html-primerl");
System.out.println("The port part of " + u + " is " + u.getPort( ));
7.1.2.4 public int getDefaultPort( )
The getDefaultPort( ) method
returns the default port used for this
URL's protocol when none is
specified in the URL. If no default port is defined for the protocol,
getDefaultPort( ) returns -1. For example, if the
URL is http://www.userfriendly.org/,
getDefaultPort( ) returns 80; if the URL is
7.1.2.5 public String getFile( )
The getFile( ) method returns a
String that contains the path portion of a URL;
remember that Java does not break a URL into separate path and
parts. Everything from the first slash (/) after the hostname until
the character preceding the # sign that begins a fragment identifier
is considered to be part of the . For example:
URL page = this.getDocumentBase( );If the URL does not have a part, Java 1.2 and earlier append a
System.out.println("This page's path is " + page.getFile( ));
slash to the URL and return the slash as the name. For example,
if the URL is http://www.slashdot.org (rather than
something like http://www.slashdot.org/, getFile() returns /. Java 1.3 and later simply
set the to the empty string.
7.1.2.6 public String getPath( ) // Java 1.3
The getPath( ) method, available only in Java 1.3
and later, is a near synonym for getFile( ); that
is, it returns a String containing the path and
portion of a URL. However, unlike getFile( ),
it does not include the query string in the String
it returns, just the path.
|
7.1.2.7 public String getRef( )
The getRef( ) method returns the fragment
identifier part of the URL. If the URL doesn't have
a fragment identifier, the method returns null. In
the following code, getRef( ) returns the string
xtocid1902914:
URL u = new URL(
"http://www.ibiblio.org/javafaq/javafaql#xtocid1902914");
System.out.println("The fragment ID of " + u + " is " + u.getRef( ));
7.1.2.8 public String getQuery( ) // Java 1.3
The getQuery( ) method
returns the query string of the URL. If the URL
doesn't have a query string, the method returns
null. In the following code, getQuery() returns the string category=Piano:
URL u = new URL(In Java 1.2 and earlier, you need to extract the query string from
"http://www.ibiblio.org/nywc/compositions.phtml?category=Piano");
System.out.println("The query string of " + u + " is " + u.getQuery( ));
the value returned by getFile( ) instead.
7.1.2.9 public String getUserInfo( ) // Java 1.3
Some URLs include usernames
and occasionally even password information. This information comes
after the scheme and before the host; an @ symbol delimits it. For
instance, in the URL http://elharo@java.oreilly.com/, the user
info is elharo. Some URLs also include passwords
in the user info. For instance, in the URL ftp://mp3:secret@ftp.example.com/c%3a/stuff/mp3/,
the user info is mp3:secret. However, most of
the time including a password in a URL is a security risk. If the URL
doesn't have any user info, getUserInfo() returns null. Mailto URLs may not
behave like you expect. In a URL like
mailto:
7.1.2.10 public String getAuthority( ) // Java 1.3
Between the scheme and the
path of a URL, you'll find the authority. The term
authority
is taken from the Uniform Resource Identifier specification (RFC
2396), where this part of the URI indicates the authority that
resolves the resource. In the most general case, the authority
includes the user info, the host, and the port. For example, in the
URL http://conferences.oreilly.com/java/speakers/,
the authority is simply the hostname
conferences.oreilly.com. The
getAuthority( ) method returns the authority as it
exists in the URL, with or without the user info and port.Example 7-4 uses all eight methods to split URLs
entered on the command line into their component parts. This program
requires Java 1.3 or later.
Example 7-4. The parts of a URL
import java.net.*;Here's the result of running this against several of
public class URLSplitter {
public static void main(String args[]) {
for (int i = 0; i < args.length; i++) {
try {
URL u = new URL(args[i]);
System.out.println("The URL is " + u);
System.out.println("The scheme is " + u.getProtocol( ));
System.out.println("The user info is " + u.getUserInfo( ));
String host = u.getHost( );
if (host != null) {
int atSign = host.indexOf('@');
if (atSign != -1) host = host.substring(atSign+1);
System.out.println("The host is " + host);
}
else {
System.out.println("The host is null.");
}
System.out.println("The port is " + u.getPort( ));
System.out.println("The path is " + u.getPath( ));
System.out.println("The ref is " + u.getRef( ));
System.out.println("The query string is " + u.getQuery( ));
} // end try
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not a URL I understand.");
}
System.out.println( );
} // end for
} // end main
} // end URLSplitter
the URL examples in this chapter:
% java URLSplitter \
http://www.ncsa.uiuc.edu/demoweb/html-primerl#A1.3.3.3 \
\
http://www.oreilly.com \
http://www.ibiblio.org/nywc/compositions.phtml?category=Piano \
http://admin@www.blackstar.com:8080/ \
The URL is http://www.ncsa.uiuc.edu/demoweb/html-primerl#A1.3.3.3
The scheme is http
The user info is null
The host is www.ncsa.uiuc.edu
The port is -1
The path is /demoweb/html-primerl
The ref is A1.3.3.3
The query string is null
The URL is
The scheme is ftp
The user info is mp3:mp3
The host is 138.247.121.61
The port is 21000
The path is /c%3a/
The ref is null
The query string is null
The URL is http://www.oreilly.com
The scheme is http
The user info is null
The host is www.oreilly.com
The port is -1
The path is
The ref is null
The query string is null
The URL is http://www.ibiblio.org/nywc/compositions.phtml?category=Piano
The scheme is http
The user info is null
The host is www.ibiblio.org
The port is -1
The path is /nywc/compositions.phtml
The ref is null
The query string is category=Piano
The URL is http://admin@www.blackstar.com:8080/
The scheme is http
The user info is admin
The host is www.blackstar.com
The port is 8080
The path is /
The ref is null
The query string is null
7.1.3 Retrieving Data from a URL
Naked URLs aren't very
exciting. What's interesting is the data contained
in the documents they point to. The URL class has
several methods that retrieve data from a URL:
public InputStream openStream( ) throws IOExceptionThese methods differ in that they return the data at the URL as an
public URLConnection openConnection( ) throws IOException
public URLConnection openConnection(Proxy proxy) throws IOException // 1.5
public Object getContent( ) throws IOException
public Object getContent(Class[] classes) throws IOException // 1.3
instance of different classes.
7.1.3.1 public final InputStream openStream( ) throws IOException
The openStream( ) method connects to the resource
referenced by the URL, performs any necessary
handshaking between the client and the server, and returns an
InputStream from which data can be read. The data
you get from this InputStream is the raw (i.e.,
uninterpreted) contents of the the URL
references: ASCII if you're reading an ASCII text
, raw HTML if you're reading an HTML ,
binary image data if you're reading an image ,
and so forth. It does not include any of the HTTP headers or any
other protocol-related information. You can read from this
InputStream as you would read from any other
InputStream. For example:
try {This code fragment catches an IOException, which
URL u = new URL("http://www.hamsterdance.com");
InputStream in = u.openStream( );
int c;
while ((c = in.read( )) != -1) System.out.write(c);
}
catch (IOException ex) {
System.err.println(ex);
}
also catches the MalformedURLException that the
URL constructor can throw, since
MalformedURLException subclasses
IOException.Example 7-5 reads a URL from the command line, opens
an InputStream from that URL, chains the resulting
InputStream to an
InputStreamReader using the default encoding, and
then uses InputStreamReader's
read( ) method to read successive characters from
the , each of which is printed on System.out.
That is, it prints the raw data located at the URL: if the URL
references an HTML , the program's output is raw
HTML.
Example 7-5. Download a web page
import java.net.*;And here are the first few lines of output when
import java.io.*;
public class SourceViewer {
public static void main (String[] args) {
if (args.length > 0) {
try {
//Open the URL for reading
URL u = new URL(args[0]);
InputStream in = u.openStream( );
// buffer the input to increase performance
in = new BufferedInputStream(in);
// chain the InputStream to a Reader
Reader r = new InputStreamReader(in);
int c;
while ((c = r.read( )) != -1) {
System.out.print((char) c);
}
}
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
catch (IOException ex) {
System.err.println(ex);
}
} // end if
} // end main
} // end SourceViewer
SourceViewer downloads http://www.oreilly.com:
% java SourceViewer http://www.oreilly.comThere are quite a few more lines in that web page; if you want to see
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title>oreilly.com -- Welcome to O'Reilly Media, Inc. -- computer books,
software conferences, online publishing</title>
<meta name="keywords" content="O'Reilly, oreilly, computer books, technical
books, UNIX, unix, Perl, Java, Linux, Internet, Web, C, C++, Windows, Windows
NT, Security, Sys Admin, System Administration, Oracle, PL/SQL, online books,
books online, computer book online, e-books, ebooks, Perl Conference, Open Source
Conference, Java Conference, open source, free software, XML, Mac OS X, .Net, dot
net, C#, PHP, CGI, VB, VB Script, Java Script, javascript, Windows 2000, XP,
bioinformatics, web services, p2p" />
<meta name="description" content="O'Reilly is a leader in technical and computer book
documentation, online content, and conferences for UNIX, Perl, Java, Linux, Internet,
Mac OS X, C, C++, Windows, Windows NT, Security, Sys Admin, System Administration,
Oracle, Design and Graphics, Online Books, e-books, ebooks, Perl Conference, Java
Conference, P2P Conference" />
them, you can fire up your web browser.The shakiest
part of this program is that it blithely assumes that the remote URL
is text, which is not necessarily true. It could well be a GIF or
JPEG image, an MP3 sound , or something else entirely. Even if it
is text, the document encoding may not be the same as the default
encoding of the client system. The remote host and local client may
not have the same default character set. As a general rule, for pages
that use a character set radically different from ASCII, the HTML
will include a META tag in the header specifying the character
set in use. For instance, this META tag specifies
the Big-5 encoding for Chinese:
<meta http-equiv="Content-Type" content="text/html; charset=big5">An XML document will likely have an XML declaration instead:
<?xml version="1.0" encoding="Big5"?>In practice, there's no easy way to get at this
information other than by parsing the and looking for a header
like this one, and even that approach is limited. Many HTML s
hand-coded in Latin alphabets don't have such a
META tag. Since Windows, the Mac, and most Unixes
have somewhat different interpretations of the characters from 128 to
255, the extended characters in these documents do not translate
correctly on platforms other than the one on which they were created.And as if this isn't confusing enough, the HTTP
header that precedes the actual document is likely to have its own
encoding information, which may completely contradict what the
document itself says. You can't read this header
using the URL class, but you can with the
URLConnection object returned by the
openConnection( ) method. Encoding detection and
declaration is one of the thornier parts of the architecture of the
Web.
7.1.3.2 public URLConnection openConnection( ) throws IOException
The openConnection( ) method
opens a socket to the specified URL and returns a
URLConnection object. A
URLConnection represents an open connection to a
network resource. If the call fails, openConnection(
) throws an IOException. For example:
try {Use this method when you want to communicate directly with the
URL u = new URL("http://www.jennicam.org/");
try {
URLConnection uc = u.openConnection( );
InputStream in = uc.getInputStream( );
// read from the connection...
} // end try
catch (IOException ex) {
System.err.println(ex);
}
} // end try
catch (MalformedURLException ex) {
System.err.println(ex);
}
server. The URLConnection gives you access to
everything sent by the server: in addition to the document itself in
its raw form (e.g., HTML, plain text, binary image data), you can
access all the metadata specified by the protocol. For example, if
the scheme is HTTP, the URLConnection lets you
access the HTTP headers as well as the raw HTML. The
URLConnection class also lets you write data to as
well as read from a URLfor instance, in order to send email to
a mailto URL or post form data. The URLConnection
class will be the primary subject of Chapter 15.Java 1.5 adds one overloaded variant of this method that specifies
the proxy server to pass the connection through:
public URLConnection openConnection(Proxy proxy) throws IOExceptionThis overrides any proxy server set with the usual
socksProxyHost, socksProxyPort,
http.proxyHost, http.proxyPort,
http.nonProxyHosts, and similar system properties.
If the protocol handler does not support proxies, the argument is
ignored and the connection is made directly if possible.
7.1.3.3 public final Object getContent( ) throws IOException
The getContent( ) method
is the third way to download data referenced by a URL. The
getContent( ) method retrieves the data referenced
by the URL and tries to make it into some type of object. If the URL
refers to some kind of text object such as an ASCII or HTML , the
object returned is usually some sort of
InputStream. If the URL refers to an image such as
a GIF or a JPEG , getContent( ) usually
returns a java.awt.ImageProducer (more
specifically, an instance of a class that implements the
ImageProducer interface). What unifies these two
disparate classes is that they are not the thing itself but a means
by which a program can construct the thing:
try {getContent( ) operates by looking at the
URL u = new URL("http://mesola.obspm.fr/");
Object o = u.getContent( );
// cast the Object to the appropriate type
// work with the Object...
}
catch (Exception ex) {
System.err.println(ex);
}
Content-type field in the MIME header of the data
it gets from the server. If the server does not use MIME headers or
sends an unfamiliar Content-type,
getContent( ) returns some sort of
InputStream with which the data can be read. An
IOException is thrown if the object
can't be retrieved. Example 7-6
demonstrates this.
Example 7-6. Download an object
import java.net.*;Here's the result of trying to get the content of
import java.io.*;
public class ContentGetter {
public static void main (String[] args) {
if (args.length > 0) {
//Open the URL for reading
try {
URL u = new URL(args[0]);
try {
Object o = u.getContent( );
System.out.println("I got a " + o.getClass( ).getName( ));
} // end try
catch (IOException ex) {
System.err.println(ex);
}
} // end try
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
} // end if
} // end main
} // end ContentGetter
http://www.oreilly.com:
% java ContentGetter http://www.oreilly.com/The exact class may vary from one version of Java to the next (in
I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream
earlier versions, it's been
java.io.PushbackInputStream or
sun.net.www.http.KeepAliveStream) but it should be
some form of InputStream.Here's what you get when you try to load a header
image from that page:
% java ContentGetter http://www.oreilly.com/graphics_new/animation.gifHere's what happens when you try to load a Java
I got a sun.awt.image.URLImageSource
applet using getContent( ):
% java ContentGetter http://www.cafeaulait.org/RelativeURLTest.classHere's what happens when you try to load an audio
I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream
using getContent( ):
% java ContentGetter http://www.cafeaulait.org/course/week9/spacemusic.auThe last result is the most unusual because it is as close as the
I got a sun.applet.AppletAudioClip
Java core API gets to a class that represents a sound .
It's not just an interface through which you can
load the sound data.This example demonstrates the biggest problems with using
getContent( ): it's hard to
predict what kind of object you'll get. You could
get some kind of InputStream or an
ImageProducer or perhaps an
AudioClip; it's easy to check
using the instanceof operator. This information
should be enough to let you read a text or display an image.
7.1.3.4 public final Object getContent(Class[] classes) throws IOException // Java 1.3
Starting in Java 1.3, it is possible for a content handler to provide
different views of an object. This overloaded variant of the
getContent( ) method lets you choose what class
you'd like the content to be returned as. The method
attempts to return the URL's content in the order
used in the array. For instance, if you prefer an HTML to be
returned as a String, but your second choice is a
Reader and your third choice is an
InputStream, write:
URL u = new URL("http://www.nwu.org");You then have to test for the type of the returned object using
Class[] types = new Class[3];
types[0] = String.class;
types[1] = Reader.class;
types[2] = InputStream.class;
Object o = u.getContent(types);
instanceof. For example:
if (o instanceof String) {
System.out.println(o);
}
else if (o instanceof Reader) {
int c;
Reader r = (Reader) o;
while ((c = r.read( )) != -1) System.out.print((char) c);
}
else if (o instanceof InputStream) {
int c;
InputStream in = (InputStream) o;
while ((c = in.read( )) != -1) System.out.write(c);
}
else {
System.out.println("Error: unexpected type " + o.getClass( ));
}
7.1.4 Utility Methods
The URL
class contains a couple of utility methods that perform common
operations on URLs. The sameFile( ) method
determines whether two URLs point to the same document. The
toExternalForm( ) method converts a
URL object to a string that can be used in an HTML
link or a web browser's Open URL dialog.
7.1.4.1 public boolean sameFile(URL other)
The sameFile( ) method
tests whether two URL objects point to the same
. If they do, sameFile( ) returns
true; otherwise, it returns
false. The test that sameFile(
) performs is quite shallow; all it does is compare the
corresponding fields for equality. It detects whether the two
hostnames are really just aliases for each other. For instance, it
can tell that http://www.ibiblio.org/ and http://metalab.unc.edu/ are the same .
However, it cannot tell that http://www.ibiblio.org:80/ and http://metalab.unc.edu/ are the same or
that http://www.cafeconleche.org/
and http://www.cafeconleche.org/indexl are
the same . sameFile( ) is smart enough to
ignore the fragment identifier part of a URL, however.
Here's a fragment of code that uses
sameFile( ) to compare two URLs:
try {The output is:
URL u1 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimerl#GS");
URL u2 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimerl#HD");
if (u1.sameFile(u2)) {
System.out.println(u1 + " is the same as \n" + u2);
}
else {
System.out.println(u1 + " is not the same as \n" + u2);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
http://www.ncsa.uiuc.edu/HTMLPrimerl#GS is the same asThe sameFile( ) method is similar to the
http://www.ncsa.uiuc.edu/HTMLPrimerl#HD
equals( ) method of the URL
class. The main difference between sameFile( ) and
equals( ) is that equals( )
considers the fragment identifier (if any), whereas
sameFile( ) does not. The two URLs shown here do
not compare equal although they are the same . Also, any object
may be passed to equals( ); only
URL objects can be passed to sameFile(
).
7.1.4.2 public String toExternalForm( )
The toExternalForm( ) method
returns a human-readable String representing the
URL. It is identical to the toString( ) method. In
fact, all the toString( ) method does is return
toExternalForm( ). Therefore, this method is
currently redundant and rarely used.
7.1.4.3 public URI toURI( ) throws URISyntaxException // Java 1.5
Java 1.5 adds a toURI( ) method that converts a
URL object to an equivalent URI
object. We'll take up the URI
class shortly. In the meantime, the main thing you need to know is
that the URI class provides much more accurate,
specification-conformant behavior than the URL
class. For operations like absolutization and encoding, you should
prefer the URI class where you have the option. In
Java 1.4 and later, the URL class should be used
primarily for the actual downloading of content from the remote
server.
7.1.5 The Object Methods
URL
inherits from java.lang.Object, so it has access
to all the methods of the Object class. It overrides three to provide
more specialized behavior: equals( ),
hashCode( ), and toString( ).
7.1.5.1 public String toString( )
Like all good classes, java.net.URL has a
toString( ) method.
Example 7-1 through Example 7-5
implicitly called this method when URLs were
passed to System.out.println( ). As those examples
demonstrated, the String produced by
toString( ) is always an absolute URL, such as
http://www.cafeaulait.org/javatutoriall.It's uncommon to call toString( )
explicitly. Print statements call toString( )
implicitly. Outside of print statements, it's more
proper to use toExternalForm( ) instead. If you do
call toString( ), the syntax is simple:
URL codeBase = this.getCodeBase( );
String appletURL = codeBase.toString( );
7.1.5.2 public boolean equals(Object o)
An
object is equal to a URL only if it is also a
URL, both URLs point to the
same as determined by the sameFile( ) method,
and both URLs have the same fragment identifier
(or both URLs don't have fragment
identifiers). Since equals( ) depends on
sameFile( ), equals( ) has the
same limitations as sameFile( ). For example,
http://www.oreilly.com/ is not
equal to http://www.oreilly.com/indexl, and
http://www.oreilly.com:80/ is not
equal to http://www.oreilly.com/.
Whether this makes sense depends on whether you think of a URL as a
string or as a reference to a particular Internet resource.Example 7-7 creates URL objects
for http://www.ibiblio.org/ and
http://metalab.unc.edu/ and tells
you if they're the same using the equals() method.
Example 7-7. Are http://www.ibiblio.org and http://www.metalab.unc.edu the same?
import java.net.*;When you run this program, you discover:
public class URLEquality {
public static void main (String[] args) {
try {
URL ibiblio = new URL ("http://www.ibiblio.org/");
URL metalab = new URL("http://metalab.unc.edu/");
if (ibiblio.equals(metalab)) {
System.out.println(ibiblio + " is the same as " + metalab);
}
else {
System.out.println(ibiblio + " is not the same as " + metalab);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
}
}
% java URLEquality
http://www.ibiblio.org/ is the same as http://metalab.unc.edu/
7.1.5.3 public int hashCode( )
The hashCode( ) method
returns an int that is used when
URL objects are used as keys in hash tables. Thus,
it is called by the various methods of
java.util.Hashtable. You rarely need to call this
method directly, if ever. Hash codes for two different
URL objects are unlikely to be the same, but it is
certainly possible; there are far more conceivable URLs than there
are four-byte integers.
7.1.6 Methods for Protocol Handlers
The last method in the URL class I'll just
mention briefly here for the sake of completeness:
setURLStreamHandlerFactory( ).
It's primarily used by protocol handlers that are
responsible for new schemes, not by programmers who just want to
retrieve data from a URL. We'll discuss it in more
detail in Chapter 16.
7.1.6.1 public static synchronized void setURLStreamHandlerFactory(URLStreamHandlerFactory factory)
This method sets the
URLStreamHandlerFactory for the application
and throws a generic Error if the factory has
already been set. A URLStreamHandler is
responsible for parsing the URL and then constructing the appropriate
URLConnection object to handle the connection to
the server. Most of the time this happens behind the scenes.