7.2 The URLEncoder and URLDecoder Classes
One of the challenges faced by the designers of the Web was dealing
with the differences between operating systems.
These differences can cause
problems with URLs: for example, some operating systems allow spaces
in filenames; some don't. Most operating systems
won't complain about a # sign in a filename; but in
a URL, a # sign indicates that the filename has ended, and a fragment
identifier follows. Other special characters, nonalphanumeric
characters, and so on, all of which may have a special meaning inside
a URL or on another operating system, present similar problems. To
solve these problems, characters used in URLs must come from a fixed
subset of ASCII, specifically:The capital letters A-ZThe lowercase letters a-zThe digits 0-9The punctuation characters - _ . ! ~ * ' (and ,)
The characters : / & ? @ # ; $ + = and % may also be used, but
only for their specified purposes. If these characters occur as part
of a filename, they and all other characters should be encoded.The encoding is very simple. Any characters that are not ASCII
numerals, letters, or the punctuation marks specified earlier are
converted into bytes and each byte is written as a percent sign
followed by two hexadecimal digits. Spaces are a special case because
they're so common. Besides being encoded as %20,
they can be encoded as a plus sign (+). The plus sign itself is
encoded as %2B. The / # = & and ? characters should be encoded
when they are used as part of a name, and not as a separator between
parts of the URL.
decoding automatically. You can construct URL
objects that use illegal ASCII and non-ASCII characters and/or
percent escapes. Such characters and escapes are not automatically
encoded or decoded when output by methods such as getPath() and toExternalForm( ). You are
responsible for making sure all such characters are properly encoded
in the strings used to construct a URL object.Luckily, Java provides a URLEncoder class to
encode strings in this format. Java 1.2 adds a
URLDecoder class that can decode strings in this
format. Neither of these classes will be instantiated.
public class URLDecoder extends Object
public class URLEncoder extends Object
7.2.1 URLEncoder
In Java 1.3 and earlier, the
java.net.URLEncoder class contains a single static
method called encode( ) that encodes a String
according to these rules:
public static String encode(String s)This method always uses the default encoding of the platform on which
it runs, so it will produce different results on different systems.
As a result, Java 1.4 deprecates this method and replaces it with a
method that requires you to specify the encoding:
public static String encode(String s, String encoding)Both variants change any nonalphanumeric
throws UnsupportedEncodingException
characters into % sequences (except the space, underscore, hyphen,
period, and asterisk characters). Both also encode all non-ASCII
characters. The space is converted into a plus sign. These methods
are a little over-aggressive; they also convert tildes, single
quotes, exclamation points, and parentheses to percent escapes, even
though they don't absolutely have to. However, this
change isn't forbidden by the URL specification, so
web browsers deal reasonably with these excessively encoded URLs.Both variants return a new String, suitably
encoded. The Java 1.3 encode(
) method uses the
platform's default encoding to calculate percent
escapes. This encoding is typically ISO-8859-1 on U.S. Unix systems,
Cp1252 on U.S. Windows systems, MacRoman on U.S. Macs, and so on in
other locales. Because both encoding and decoding are platform- and
locale-specific, this method is annoyingly non-interoperable, which
is precisely why it has been deprecated in Java 1.4 in favor of the
variant that requires you to specify an encoding. However, if you
just pick the platform default encoding, your program will be as
platform- and locale-locked as the Java 1.3 version. Instead, you
should always pick UTF-8, never anything else. UTF-8 is compatible
with the new IRI specification, the URI class,
modern web browsers, and more other software than any other encoding
you could choose.Example 7-8 is a program that uses
URLEncoder.encode( ) to print various encoded
strings. Java 1.4 or later is required to compile and run it.
Example 7-8. x-www-form-urlencoded strings
import java.net.URLEncoder;Here is the output. Note that the code needs to be saved in something
import java.io.UnsupportedEncodingException;
public class EncoderTest {
public static void main(String[] args) {
try {
System.out.println(URLEncoder.encode("This string has spaces",
"UTF-8"));
System.out.println(URLEncoder.encode("This*string*has*asterisks",
"UTF-8"));
System.out.println(URLEncoder.encode("This%string%has%percent%signs",
"UTF-8"));
System.out.println(URLEncoder.encode("This+string+has+pluses",
"UTF-8"));
System.out.println(URLEncoder.encode("This/string/has/slashes",
"UTF-8"));
System.out.println(URLEncoder.encode("This\"string\"has\"quote\"marks",
"UTF-8"));
System.out.println(URLEncoder.encode("This:string:has:colons",
"UTF-8"));
System.out.println(URLEncoder.encode("This~string~has~tildes",
"UTF-8"));
System.out.println(URLEncoder.encode("This(string)has(parentheses)",
"UTF-8"));
System.out.println(URLEncoder.encode("This.string.has.periods",
"UTF-8"));
System.out.println(URLEncoder.encode("This=string=has=equals=signs",
"UTF-8"));
System.out.println(URLEncoder.encode("This&string&has&ersands",
"UTF-8"));
System.out.println(URLEncoder.encode("Thiséstringéhasé
non-ASCII characters", "UTF-8"));
}
catch (UnsupportedEncodingException ex) {
throw new RuntimeException("Broken VM does not support UTF-8");
}
}
}
other than ASCII, and the encoding chosen should be passed as an
argument to the compiler to account for the non-ASCII characters in
the source code.
% javac -encoding UTF8 EncoderTestNotice in particular that this method encodes the forward slash, the
% java EncoderTest
This+string+has+spaces
This*string*has*asterisks
This%25string%25has%25percent%25signs
This%2Bstring%2Bhas%2Bpluses
This%2Fstring%2Fhas%2Fslashes
This%22string%22has%22quote%22marks
This%3Astring%3Ahas%3Acolons
This%7Estring%7Ehas%7Etildes
This%28string%29has%28parentheses%29
This.string.has.periods
This%3Dstring%3Dhas%3Dequals%3Dsigns
This%26string%26has%26ampersands
This%C3%A9string%C3%A9has%C3%A9non-ASCII+characters
ampersand, the equals sign, and the colon. It does not attempt to
determine how these characters are being used in a URL. Consequently,
you have to encode URLs piece by piece rather than
encoding an entire URL in one method call. This is an important
point, because the most common use of URLEncoder
is in preparing query strings for communicating with
server-side programs that use GET. For example,
suppose you want to encode this query string used for an AltaVista
search:
pg=q&kl=XX&stype=stext&q=+"Java+I/O"&search.x=38&search.y=3This code fragment encodes it:
String query = URLEncoder.encode(Unfortunately, the output is:
"pg=q&kl=XX&stype=stext&q=+\"Java+I/O\"&search.x=38&search.y=3");
System.out.println(query);
pg%3Dq%26kl%3DXX%26stype%3Dstext%26q%3D%2B%22Java%2BI%2FO%22%26searchThe problem is that URLEncoder.encode( ) encodes
.x%3D38%26search.y%3D3
blindly. It can't distinguish between special
characters used as part of the URL or query string, like
& and = in the previous
string, and characters that need to be encoded. Consequently, URLs
need to be encoded a piece at a time like this:
String query = URLEncoder.encode("pg");The output of this is what you actually want:
query += "=";
query += URLEncoder.encode("q");
query += "&";
query += URLEncoder.encode("kl");
query += "=";
query += URLEncoder.encode("XX");
query += "&";
query += URLEncoder.encode("stype");
query += "=";
query += URLEncoder.encode("stext");
query += "&";
query += URLEncoder.encode("q");
query += "=";
query += URLEncoder.encode("\"Java I/O\");
query += "&";
query += URLEncoder.encode("search.x");
query += "=";
query += URLEncoder.encode("38");
query += "&";
query += URLEncoder.encode("search.y");
query += "=";
query += URLEncoder.encode("3");
System.out.println(query);
pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3Example 7-9 is a QueryString class that
uses the URLEncoder to encode successive name and
value pairs in a Java object, which will be used for sending data to
server-side programs. When you create a
QueryString, you can supply the first name-value
pair to the constructor as individual strings. To add further pairs,
call the add( ) method, which also takes two
strings as arguments and encodes them. The getQuery(
)
method returns the accumulated list of encoded name-value pairs.
Example 7-9. -The QueryString class
package com.macfaq.net;Using this class, we can now encode the previous example:
import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;
public class QueryString {
private StringBuffer query = new StringBuffer( );
public QueryString(String name, String value) {
encode(name, value);
}
public synchronized void add(String name, String value) {
query.append('&');
encode(name, value);
}
private synchronized void encode(String name, String value) {
try {
query.append(URLEncoder.encode(name, "UTF-8"));
query.append('=');
query.append(URLEncoder.encode(value, "UTF-8"));
}
catch (UnsupportedEncodingException ex) {
throw new RuntimeException("Broken VM does not support UTF-8");
}
}
public String getQuery( ) {
return query.toString( );
}
public String toString( ) {
return getQuery( );
}
}
QueryString qs = new QueryString("pg", "q");
qs.add("kl", "XX");
qs.add("stype", "stext");
qs.add("q", "+\"Java I/O\");
qs.add("search.x", "38");
qs.add("search.y", "3");
String url = "http://www.altavista.com/cgi-bin/query?" + qs;
System.out.println(url);
7.2.2 URLDecoder
The
corresponding URLDecoder class has two static
methods that decode strings encoded in x-www-form-url-encoded format.
That is, they convert all plus signs to spaces and all percent
escapes to their corresponding character:
public static String decode(String s) throws ExceptionThe first variant is used in Java 1.3 and 1.2. The second variant is
public static String decode(String s, String encoding) // Java 1.4
throws UnsupportedEncodingException
used in Java 1.4 and later. If you have any doubt about which
encoding to use, pick UTF-8. It's more likely to be
correct than anything else.An IllegalArgumentException may be thrown if the
string contains a percent sign that isn't followed
by two hexadecimal digits or decodes into an illegal sequence. Then
again it may not be. This is implementation-dependent, and what
happens when an illegal sequence is detected and does not throw an
IllegalArgumentException is undefined. In
Sun's JDK 1.4, no exception is thrown and extra
bytes with no apparent meaning are added to the undecodable string.
This is truly brain-damaged, and possibly a security hole.Since this method does not touch non-escaped characters, you can pass
an entire URL to it rather than splitting it into pieces first. For
example:
String input = "http://www.altavista.com/cgi-bin/" +
"query?pg=q&kl=XX&stype=stext&q=%2B%22Java+I%2FO%22&search.x=38&search.y=3";
try {
String output = URLDecoder.decode(input, "UTF-8");
System.out.println(output);
}