3.1 URIs
A Uniform
Resource Identifier (URI) is a string of characters in a particular
syntax that identifies a resource. The resource identified may be a
file on a server, but it may also be an email address, a news
message, a book, a person's name, an Internet host,
the current stock price of Sun Microsystems, or something else. An
absolute URI is made up of a scheme for the URI and a scheme-specific
part, separated by a colon, like this:
scheme:scheme-specific-partThe syntax of the scheme-specific part depends on the scheme being
used. Current schemes
include:
Base64-encoded data included directly in a link; see RFC 2397
A file on a local disk
An FTP server
A World Wide Web server using the Hypertext Transfer Protocol
A Gopher server
An email address
A Usenet newsgroup
A connection to a Telnet-based service
A Uniform Resource Name
In addition, Java makes heavy use of nonstandard custom schemes such
as rmi, jndi, and
doc for various purposes. We'll
look at the mechanism behind this in Chapter 16,
when we discuss protocol handlers.There is no specific syntax that applies to the scheme-specific parts
of all URIs. However, many have a hierarchical form, like this:
//authority/path?queryThe authority part of the URI names the
authority responsible for resolvin
g the rest of the URI. For
instance, the URI http://www.ietf.org/rfc/rfc2396.txt has the
scheme http and the authority www.ietf.org. This means the server at
www.ietf.org is responsible for
mapping the path /rfc/rfc2396.txt to a resource.
This URI does not have a query part. The URI http://www.powells.com/cgi-bin?
has the scheme http, the authority www.powells.com, the path www.landoverbaptist.org and
something very different when the authority is www.churchofsatan.com. The path may be
hierarchical, in which case the individual parts are separated by
forward slashes, and the . and
.. operators are used to navigate the hierarchy.
These are derived from the pathname syntax on the Unix operating
systems where the Web and URLs were invented. They conveniently map
to a filesystem stored on a Unix web server. However, there is no
guarantee that the components of any particular path actually
correspond to files or directories on any particular filesystem. For
example, in the URI http://www.amazon.com/exec/obidos/ISBN%3D1565924851/cafeaulaitA/002-3777605-3043449,
all the pieces of the hierarchy are just used to pull information out
of a database that's never stored in a filesystem.
ISBN%3D1565924851 selects the particular book
from the database by its ISBN number,
cafeaulaitA specifies who gets the referral fee
if a purchase is made from this link, and
002-3777605-3043449 is a session key used to
track the visitor's path through the site.Some URIs aren't at all hierarchical, at least in
the filesystem sense. For example,
snews://secnews.netscape.com/netscape.devs-java
has a path of /netscape.devs-java. Although
there's some hierarchy to the newsgroup names
indicated by the . between
netscape and
netscape.devs-java, it's not
visible as part of the URI.The scheme part is composed of lowercase letters, digits, and the
plus sign, period, and hyphen. The other three parts of a typical URI
(authority, path, and query) should each be composed of the ASCII
alphanumeric characters; that is, the letters A-Z, a-z, and the
digits 0-9. In addition, the punctuation characters - _ . ! ~ *
' may also be used. All other characters, including
non-ASCII alphanumerics such as á and , should be escaped
by a percent sign (%) followed by the hexadecimal code for the
character. For instance, á would be encoded as %E1. A URL
so transformed is said to have been
"x-www-form-urlencoded".This process assumes that the character set is the Latin 1. The URI
and URL specifications don't actually say what
character set should be used, which means most software tends to use
the local default character set. Thus, URLs containing non-ASCII
characters aren't very interoperable across
different platforms and languages. With the Web becoming more
international and less English daily, this situation has become
increasingly problematic. Work is ongoing to define Internationalized
Resource Identifiers (IRIs) that can use the full range of Unicode.
At the time of this writing, the IRI draft specification indicates
that non-ASCII characters should be encoded by first converting them
to UTF-8, then percent-escaping each byte of the UTF-8, as specified
above. For instance, the Greek letter is Unicode code point 3C0. In
UTF-8, this letter is encoded as the three bytes E0, A7, 80. Thus in
a URL it would be encoded as %E0%A7%80.Punctuation characters such as / and @ must also be encoded with
percent escapes if they are used in any role other than
what's specified for them in the scheme-specific
part of a particular URL. For example, the forward slashes in the URI
http://www.cafeaulait.org/books/javaio/ do
not need to be encoded as %2F because they serve
to delimit the hierarchy as specified for the
http URI scheme. However, if a filename includes
a / characterfor instance, if the last directory were named
Java I/O instead of javaio
to more closely match the name of the bookthe URI would have
to be written as http://www.cafeaulait.org/books/Java%20I%2FO/.
This is not as farfetched as it might sound to Unix or Windows users.
Mac filenames frequently include a forward slash. Filenames on many
platforms often contain characters that need to be encoded, including
@, $, +, =, and many more.
3.1.1 URNs
There are two types of URIs: Uniform Resource Locators (URLs) and
Uniform Resource Names (URNs). A
URL is a pointer to a particular resource on the Internet at a
particular location. For example, http://www.oreilly.com/catalog/javanp3/ is
one of several URLs for the book Java Network
Programming. A URN is a name for a particular resource but
without reference to a particular location. For instance,
urn:isbn:1565928709 is a URN referring to the
same book. As this example shows, URNs, unlike URLs, are not limited
to Internet resources.The goal of URNs is to handle resources that are mirrored in many
different locations or that have moved from one site to another; they
identify the resource itself, not the place where the resource lives.
For instance, when given a URN for a particular piece of software, an
FTP program should get the file from the nearest mirror site. Given a
URN for a book, a browser might reserve the book at the local library
or order a copy from a bookstore.A URN has the general form:
urn:namespace:resource_nameThe namespace is the name of a collection
of certain kinds of resources maintained by some authority. The
resource_name is the name of a resource
within that collection. For instance, the URN
urn:ISBN:1565924851 identifies a resource in the
ISBN namespace with the identifier
1565924851. Of all the books published, this one
selects the first edition of Java I/O.The exact syntax of resource names depends on the namespace. The ISBN
namespace expects to see strings composed of 10 or 13 characters, all
of which are digitswith the single exception that the last
character may be the letter X (either upper- or
lowercase) instead. Furthermore, ISBNs may contain hyphens that are
ignored when comparing. Other namespaces will use very different
syntaxes for resource names. The IANA is responsible for handing out
namespaces to different organizations, as described in RFC 3406.
Basically, you have to submit an Internet draft to the IETF and
publish an announcement on the urn-nid mailing list for public
comment and discussion before formal standardization.
3.1.2 URLs
A URL identifies the location of a
resource on the Internet. It specifies the protocol used to access a
server (e.g., FTP, HTTP), the name of the server, and the location of
a file on that server. A typical URL looks like http://www.ibiblio.org/javafaq/javatutoriall.
This specifies that there is a file called
javatutoriall in a directory called
javafaq on the server
www.ibiblio.org, and that this file can be
accessed via the HTTP protocol. The syntax of a URL is:
protocol://username@hostname:port/path/filename?query#fragmentHere the protocol
is another word for what was called the scheme of the URI.
(Scheme is the word used in the URI RFC.
Protocol is the word used in the Java
documentation.) In a URL, the protocol part can be
file, ftp,
http, https,
gopher, news,
telnet, wais, or various
other strings (though not urn).The hostname part of a URL is the name of the
server that provides the resource you want, such as
www.oreilly.com or
utopia.poly.edu. It can also be the
server's IP address, such as 204.148.40.9 or
128.238.3.21. The
username
is an optional username for the server. The port
number is also optional. It's not necessary if the
service is running on its default
port (port 80 for HTTP
servers).The
path
points to a particular directory on the specified server. The path is
relative to the document root of the server, not necessarily to the
root of the filesystem on the server. As a rule, servers that are
open to the public do not show their entire filesystem to clients.
Rather, they show only the contents of a specified directory. This
directory is called the document root, and all paths and filenames
are relative to it. Thus, on a Unix server, all files that are
available to the public might be in
/var/public/html, but to somebody connecting
from a remote machine, this directory looks like the root of the
filesystem.The filename points to a particular file in the directory specified
by the path. It is often omittedin which case, it is left to
the server's discretion what file, if any, to send.
Many servers send an index file for that directory, often called
indexl or Welcomel.
Some send a list of the files and folders in the directory, as shown
in Figure 3-1. Others may send a 403 Forbidden error message, as
shown in Figure 3-2.
Figure 3-1. A web server configured to send a directory list when no index file exists

Figure 3-2. A web server configured to send a 403 error when no index file exists

for the server. It's commonly used only in
http URLs, where it contains form data for input
to programs running on the server.Finally, the
fragment references a particular part of the
remote resource. If the remote resource is HTML, the fragment
identifier names an anchor in the HTML document. If the remote
resource is XML, the fragment identifier is an XPointer. Some
documents refer to the fragment part of the URL as a
"section"; Java documents rather
unaccountably refer to the fragment identifier as a
"Ref". A named anchor is created in
an HTML document with a tag, like this:
<A NAME="xtocid1902914">Comments</A>This tag identifies a particular point in a document. To refer to
this point, a URL includes not only the document's
filename but the named anchor separated from the rest of the URL by a
#:
http://www.cafeaulait.org/#xtocid1902914
|
3.1.3 Relative URLs
A URL tells the web browser a lot about a
document: the protocol used to retrieve the document, the name of the
host where the document lives, and the path to that document on the
host. Most of this information is likely to be the same for other
URLs that are referenced in the document. Therefore, rather than
requiring each URL to be specified in its entirety, a URL may inherit
the protocol, hostname, and path of its parent document (i.e., the
document in which it appears). URLs that aren't
complete but inherit pieces from their parent are called
relative URLs. In contrast, a completely
specified URL is called an absolute URL. In a
relative URL, any pieces that are missing are assumed to be the same
as the corresponding pieces from the URL of the document in which the
URL is found. For example, suppose that while browsing http://www.ibiblio.org/javafaq/javatutoriall
you click on this hyperlink:
<a href=">The browser cuts javatutoriall off the end of http://www.ibiblio.org/javafaq/javatutoriall
to get http://www.ibiblio.org/javafaq/. Then it
attaches onto the
end of http://www.ibiblio.org/javafaq/ to get
http://www.ibiblio.org/javafaq/.
Finally, it loads that document.If the relative link begins with a /, then it is
relative to the document root instead of relative to the current
file. Thus, if you click on the following link while browsing
http://www.ibiblio.org/javafaq/javatutoriall:
<a href=">the browser would throw away /javafaq/javatutoriall and attach
to the
end of http://www.ibiblio.org to
get http://www.ibiblio.org.Relative URLs have a number of advantages. Firstand least
importantthey save a little typing. More importantly, relative
URLs allow a single document tree to be served by multiple protocols:
for instance, both FTP and HTTP. The HTTP might be used for direct
surfing, while the FTP could be used for mirroring the site. Most
importantly of all, relative URLs allow entire trees of documents to
be moved or copied from one site to another without breaking all the
internal links.
• Table of Contents• Index• Reviews• Reader Reviews• Errata• AcademicJava Network Programming, 3rd EditionBy
Elliotte Rusty Harold Publisher: O'ReillyPub Date: October 2004ISBN: 0-596-00721-3Pages: 706
Thoroughly revised to cover all the 100+ significant updates
to Java Developers Kit (JDK) 1.5, Java Network
Programming is a complete introduction to
developing network programs (both applets and applications)
using Java, covering everything from networking fundamentals
to remote method invocation (RMI). It includes chapters on
TCP and UDP sockets, multicasting protocol and content
handlers, servlets, and the new I/O API. This is the
essential resource for any serious Java developer.
