2.2. Lexical Structure
This section explains the
lexical structure of a Java program. It
starts with a discussion of the Unicode character set in which Java
programs are written . It then covers the tokens that comprise a Java
program, explaining comments, identifiers, reserved words, literals,
and so on.
2.2.1. The Unicode Character Set
Java programs are written using
Unicode.
You can use Unicode characters anywhere in a Java program, including
comments and identifiers such as variable names. Unlike the 7-bit
ASCII character set, which is useful only for English, and the 8-bit
ISO Latin-1 character set, which is
useful only for major Western European languages, the Unicode
character set can represent virtually every written language in
common use on the planet. 16-bit Unicode characters are typically
written to files using an encoding known as
UTF-8, which converts the 16-bit
characters into a stream of bytes. The format is designed so that
plain ASCII text (and the 7-bit characters of Latin-1) are valid
UTF-8 byte streams. Thus, you can simply write plain ASCII programs,
and they will work as valid Unicode.If you do not use a Unicode-enabled
text editor, or if you do not want to force other programmers who
view or edit your code to use a Unicode-enabled editor, you can embed
Unicode characters into your Java programs using the special Unicode
escape sequence \uxxxx,
in other words, a backslash and a lowercase u, followed by four
hexadecimal characters. For example, \u0020 is the
space character, and \u03c0 is the character .Unicode 3.1 and above, used in Java 5.0 and later, includes
"supplementary
characters" that require 21 bits to represent.
16-bit encodings of Unicode characters represent these supplementary
characters using a surrogate
pair , which is a sequence of two 16-bit
characters taken from a special reserved range of the 16-bit encoding
space. If you ever need to include one of these (rarely used)
supplementary characters in Java source code, use two
\u sequences to represent the surrogate pair.
(Details of surrogate pair encoding are beyond the scope of this
book, however.)
2.2.2. Case-Sensitivity and Whitespace
Java is a
case-sensitive language. Its
keywords are written in lowercase and must
always be used that way. That is, While and
WHILE are not the same as the
while keyword. Similarly, if you declare a
variable named i in your program, you may not
refer to it as I.Java ignores spaces, tabs, newlines, and other whitespace, except
when it appears within quoted characters and string literals.
Programmers typically use whitespace to format and
indent their code for easy
readability, and you will see common indentation conventions in the
code examples of this book.
2.2.3. Comments
Comments
are
natural-language text intended for human readers of a program. They
are ignored by the Java compiler. Java supports three types of
comments. The first type is a single-line comment, which begins with
the characters // and continues until the end of
the current line. For example:
int i = 0; // Initialize the loop variableThe second kind of comment is a
multiline comment. It begins with the characters
/* and continues, over any number of lines, until
the characters */. Any text between the
/* and the */ is ignored by the
Java compiler. Although this style of comment is typically used for
multiline comments, it can also be used for single-line comments.
This type of comment cannot be nested (i.e., one /*
*/ comment cannot appear within another). When writing
multiline comments, programmers often use extra *
characters to make the comments stand out. Here is a typical
multiline comment:
/*The third type of comment is a
* First, establish a connection to the server.
* If the connection attempt fails, quit right away.
*/
special case of the second. If a comment begins with
/**, it is regarded as a special doc
comment . Like regular multiline comments, doc comments end
with */ and cannot be nested. When you write a
Java class you expect other programmers to use, use doc comments to
embed documentation about the class and each of its methods directly
into the source code. A program named javadoc
extracts these comments and processes them to create online
documentation for your class. A doc comment can contain HTML tags and
can use additional syntax understood by javadoc .
For example:
/**See Chapter 7 for more information on the doc
* Upload a file to a web server.
*
* @param file The file to upload.
* @return <tt>true</tt> on success,
* <tt>false</tt> on failure.
* @author David Flanagan
*/
comment syntax and Chapter 8 for more
information on the javadoc program.Comments may appear between any tokens of a Java program, but may not
appear within a token. In particular, comments may not appear within
double-quoted string literals. A comment within a string literal
simply becomes a literal part of that string.
2.2.4. Reserved Words
The following words are reserved in Java:
they are part of the syntax of the language and may not be used to
name variables, classes, and so forth.
abstract const final int public throwWe'll meet each of these reserved words again later
assert continue finally interface return throws
boolean default float long short transient
break do for native static true
byte double goto new strictfp try
case else if null super void
catch enum implements package switch volatile
char extends import private synchronized while
class false instanceof protected this
in this book. Some of them are the names of primitive types and
others are the names of Java statements, both of which are discussed
later in this chapter. Still others are used to define classes and
their members (see Chapter 3).Note that const and goto are
reserved but aren't actually used in the language.
strictfp was added in Java 1.2,
assert was added in Java 1.4, and
enum was added in Java 5.0.
2.2.5. Identifiers
An identifier
is simply a
name given to some part of a Java program, such as a class, a method
within a class, or a variable declared within a method. Identifiers
may be of any length and may contain letters and digits drawn from
the entire Unicode character set. An identifier may not begin with a
digit, however, because the compiler would then think it was a
numeric literal rather than an identifier.In general, identifiers may not contain
punctuation characters. Exceptions include
the ASCII underscore (_) and dollar sign
($) as well as other Unicode currency symbols such
as £ and ¥. Currency symbols are intended for
use in automatically generated source code, such as code produced by
parser generators. By avoiding the use of currency symbols in your
own identifiers you don't have to worry about
collisions with automatically generated identifiers. Formally, the
characters allowed at the beginning of and within an identifier are
defined by the methods isJavaIdentifierStart(
)
and
isJavaIdentifierPart( ) of the class
java.lang.Character.The following are examples of legal identifiers:
i x1 theCurrentTime the_current_time
2.2.6. Literals
Literals
are values that appear directly in Java source code. They include
integer and floating-point numbers, characters within single quotes,
strings of characters within double quotes, and the reserved words
true, false and
null. For example, the following are all literals:
1 1.0 '1' "one" true false nullThe syntax for expressing numeric, character, and string literals is
detailed in Section 2.3 later in
this chapter.
2.2.7. Punctuation
Java also uses
a number of
punctuation characters as
tokens. The Java Language Specification divides these characters
(somewhat arbitrarily) into two categories,
separators and operators. Separators are:
( ) { } [ ]Operators are:
< > : ;
, . @
+ - * / % & | ^ << >> >>>We'll see separators throughout the book, and will
+= -= *= /= %= &= |= ^= <<= >>= >>>=
= = = != < <= > >=
! ~ && || ++ -- ? :
cover each operator individually in Section 2.4 later in this
chapter.