Chapter 1. Strings
Contents:
IntroductionAccessing SubstringsEstablishing a Default ValueExchanging Values Without Using Temporary VariablesConverting Between Characters and ValuesUsing Named Unicode CharactersProcessing a String One Character at a TimeReversing a String by Word or CharacterTreating Unicode Combined Characters as Single CharactersCanonicalizing Strings with Unicode Combined CharactersTreating a Unicode String as OctetsExpanding and Compressing TabsExpanding Variables in User InputControlling CaseProperly Capitalizing a Title or HeadlineInterpolating Functions and Expressions Within StringsIndenting Here DocumentsReformatting ParagraphsEscaping CharactersTrimming Blanks from the Ends of a StringParsing Comma-Separated DataConstant VariablesSoundex MatchingProgram: fixstyleProgram: psgrep
Job 35:16He multiplieth words without knowledge.
1.0. Introduction
Many programming languages force you to work
at an uncomfortably low level. You think in lines, but your language
wants you to deal with pointers. You think in strings, but it wants
you to deal with bytes. Such a language can drive you to distraction.
Don't despair; Perl isn't a low-level language, so lines and strings
are easy to handle.Perl was designed for easy but powerful text
manipulation. In fact, Perl can manipulate text in so many ways that
they can't all be described in one chapter. Check out other chapters
for recipes on text processing. In particular, see Chapter 6 and Chapter 8, which
discuss interesting techniques not covered
here.Perl's fundamental
unit for working with data is the scalar, that is, single values
stored in single (scalar) variables. Scalar variables hold strings,
numbers, and references. Array and hash variables hold lists or
associations of scalars, respectively. References are used for
referring to values indirectly, not unlike pointers in low-level
languages. Numbers are usually stored in your machine's
double-precision floating-point notation. Strings in Perl may be of
any length, within the limits of your machine's virtual memory, and
can hold any arbitrary data you care to put there—even binary
data containing null bytes.A string in Perl is not an array of characters—nor of bytes,
for that matter. You cannot use array subscripting on a string to
address one of its characters; use substr for
that. Like all data types in Perl, strings grow on demand. Space is
reclaimed by Perl's garbage collection system when no longer used,
typically when the variables have gone out of scope or when the
expression in which they were used has been evaluated. In other
words, memory management is already taken care of, so you don't have
to worry about it.
A scalar value
is either defined or undefined. If defined, it may hold a string,
number, or reference. The only undefined value is
undef. All other values are defined, even numeric
and the empty string. Definedness is not the same as Boolean truth,
though; to check whether a value is defined, use the
defined function. Boolean truth has a specialized
meaning, tested with operators such as &&
and || or in an if or
while block's test condition.
Two
defined strings are false: the empty string (")
and a string of length one containing the digit zero
("0"). All other defined values (e.g.,
"false", 15, and
\$x) are true. You might be surprised to learn
that "0" is false, but this is due to Perl's
on-demand conversion between strings and numbers. The values
0., 0.00, and
0.0000000 are all numbers and are therefore false
when unquoted, since the number zero in any of its guises is always
false. However, those three values ("0.",
"0.00", and "0.0000000") are
true when used as literal quoted strings in your
program code or when they're read from the command line, an
environment variable, or an input file.This is seldom an issue, since conversion is automatic when the value
is used numerically. If it has never been used numerically, though,
and you just test whether it's true or false, you might get an
unexpected answer—Boolean tests never force any sort of
conversion. Adding 0 to the variable makes Perl explicitly convert
the string to a number:
print "Gimme a number: ";
0.00000
chomp($n = <STDIN>); # $n now holds "0.00000";
print "The value $n is ", $n ? "TRUE" : "FALSE", "\n";
That value 0.00000 is TRUE
$n += 0;
print "The value $n is now ", $n ? "TRUE" : "FALSE", "\n";
That value 0 is now FALSE
The undef value behaves like the empty string
(") when used as a string, 0
when used as a number, and the null reference when used as a
reference. But in all three possible cases, it's false. Using an
undefined value where Perl expects a defined value will trigger a
runtime warning message on STDERR if you've
enabled warnings. Merely asking whether something is true or false
demands no particular value, so this is exempt from warnings. Some
operations do not trigger warnings when used on variables holding
undefined values. These include the autoincrement and autodecrement
operators, ++ and --, and the
addition and concatenation assignment operators,
+= and .= ("plus-equals" and
"dot-equals").
Specify
strings in your program using single quotes, double quotes, the
quoting operators q// and qq//,
or here documents. No matter which notation you use, string literals
are one of two possible flavors: interpolated or uninterpolated.
Interpolation governs whether variable references and special
sequences are expanded. Most are interpolated by default, such as in
patterns (/regex/) and running commands
($x =
`cmd`).Where special characters are recognized, preceding any special
character with a backslash renders that character mundane; that is,
it becomes a literal. This is often referred to as "escaping" or
"backslash escaping."Using single quotes is the canonical way to get an uninterpolated
string literal. Three special sequences are still recognized:
' to terminate the string, \'
to represent a single quote, and \\ to represent a
backslash in the string.
$string = '\n'; # two characters, \ and an n
$string = 'Jon \'Maddog\' Orwant'; # literal single quotes
Double quotes interpolate variables (but not function calls—see
Recipe 1.15 to find how to do this) and
expand backslash escapes. These include "\n"
(newline), "\033" (the character with octal value
33), "\cJ" (Ctrl-J), "\x1B"
(the character with hex value 0x1B), and so on. The full list of
these is given in the perlop(1) manpage and the
section on "Specific Characters" in Chapter 5 of
Programming Perl.
$string = "\n"; # a "newline" character
$string = "Jon \"Maddog\" Orwant"; # literal double quotes
If there are no backslash escapes or variables to expand within the
string, it makes no difference which flavor of quotes you use. When
choosing between writing 'this' and writing
"this", some Perl programmers prefer to use double
quotes so that the strings stand out. This also avoids the slight
risk of having single quotes mistaken for backquotes by readers of
your code. It makes no difference to Perl, and it might help readers.
The
q// and qq// quoting operators
allow arbitrary delimiters on interpolated and uninterpolated
literals, respectively, corresponding to single- and double-quoted
strings. For an uninterpolated string literal that contains single
quotes, it's easier to use q// than to escape all
single quotes with backslashes:
$string = 'Jon \'Maddog\' Orwant'; # embedded single quotes
$string = q/Jon 'Maddog' Orwant/; # same thing, but more legible
Choose the same character for both delimiters, as we just did with
/, or pair any of the following four sets of
bracketing characters:
$string = q[Jon 'Maddog' Orwant]; # literal single quotes
$string = q{Jon 'Maddog' Orwant}; # literal single quotes
$string = q(Jon 'Maddog' Orwant); # literal single quotes
$string = q<Jon 'Maddog' Orwant>; # literal single quotes
Here documents are a notation
borrowed from the shell used to quote a large chunk of text. The text
can be interpreted as single-quoted, double-quoted, or even as
commands to be executed, depending on how you quote the terminating
identifier. Uninterpolated here documents do not expand the three
backslash sequences the way single-quoted literals normally do. Here
we double-quote two lines with a here document:
$a = <<"EOF";
This is a multiline here document
terminated by EOF on a line by itself
EOF
Notice there's no semicolon after the terminating
EOF. Here documents are covered in more detail in
Recipe 1.16.
1.0.1. The Universal Character Code
As far as the computer is concerned, all
data is just a series of individual numbers, each a string of bits.
Even text strings are just sequences of numeric codes interpreted as
characters by programs like web browsers, mailers, printing programs,
and editors.Back when memory sizes were far smaller and memory prices far more
dear, programmers would go to great lengths to save memory.
Strategies such as stuffing six characters into one 36-bit word or
jamming three characters into one 16-bit word were common. Even
today, the numeric codes used for characters usually aren't longer
than 7 or 8 bits, which are the lengths you find in ASCII and Latin1,
respectively.That doesn't leave many bits per character—and thus, not many
characters. Consider an image file with 8-bit color. You're limited
to 256 different colors in your palette. Similarly, with characters
stored as individual octets (an octet is an
8-bit byte), a document can usually have no more than 256 different
letters, punctuation marks, and symbols in
it. ASCII, being the
American Standard Code for Information
Interchange, was of limited utility outside the United States, since
it covered only the characters needed for a slightly stripped-down
dialect of American English. Consequently, many countries invented
their own incompatible 8-bit encodings built upon 7-bit ASCII.
Conflicting schemes for assigning numeric codes to characters sprang
up, all reusing the same limited range. That meant the same number
could mean a different character in different systems and that the
same character could have been assigned a different number in
different systems.Locales were an early attempt to address this and other language- and
country-specific issues, but they didn't work out so well for
character set selection. They're still reasonable for purposes
unrelated to character sets, such as local preferences for monetary
units, date and time formatting, and even collating sequences. But
they are of far less utility for reusing the same 8-bit namespace for
different character sets.That's because if you wanted to produce a document that used Latin,
Greek, and Cyrillic characters, you were in for big trouble, since
the same numeric code would be a different character under each
system. For example, character number 196 is a Latin capital A with a
diaeresis above it in ISO 8859-1 (Latin1); under ISO 8859-7, that
same numeric code represents a Greek capital delta. So a program
interpreting numeric character codes in the ISO 8859-1 locale would
see one character, but under the ISO 8859-7 locale, it would see
something totally different.This makes it hard to combine different character sets in the same
document. Even if you did cobble something together, few programs
could work with that document's text. To know what characters you
had, you'd have to know what system they were in, and you couldn't
easily mix systems. If you guessed wrong, you'd get a jumbled mess on
your screen, or worse.
1.0.2. Unicode Support in Perl
Enter Unicode.Unicode attempts to unify all character sets in the entire world,
including many symbols and even fictional character sets. Under
Unicode, different characters have different numeric codes, called
code points.Mixed-language documents are now easy, whereas before they weren't
even possible. You no longer have just 128 or 256 possible characters
per document. With Unicode you can have tens of thousands (and more)
of different characters all jumbled together in the same document
without
confusion.The problem of mixing, say, an Ä with a Δ
evaporates. The first character, formally named "LATIN CAPITAL LETTER
A WITH DIAERESIS" under Unicode, is assigned the code point U+00C4
(that's the Unicode preferred notation). The second, a "GREEK CAPITAL
LETTER DELTA", is now at code point U+0394. With different characters
always assigned different code points, there's no longer any
conflict.Perl has supported Unicode since v5.6 or so, but it wasn't until the
v5.8 release that Unicode support was generally considered robust and
usable. This by no coincidence corresponded to the introduction of
I/O layers and their support for encodings into Perl. These are
discussed in more detail in Chapter 8.All Perl's string functions and operators, including those used for
pattern matching, now operate on characters instead of octets. If you
ask for a string's length, Perl reports how many
characters are in that string, not how many bytes are in it. If you
extract the first three characters of a string using
substr, the result may or may not be three bytes.
You don't know, and you shouldn't care, either. One reason not to
care about the particular underlying bytewise representation is that
if you have to pay attention to it, you're probably looking too
closely. It shouldn't matter, really—but if it does, this might
mean that Perl's implementation still has a few bumps in it. We're
working on that.Because characters with code points above 256 are supported, the
chr function is no longer restricted to arguments
under 256, nor is ord restricted to returning an
integer smaller than that. Ask for chr(0x394), for
example, and you'll get a Greek capital delta: Δ.
$char = chr(0x394);
$code = ord($char);
printf "char %s is code %d, %#04x\n", $char, $code, $code;
char Δ is code 916, 0x394
If you test the length of that string, it will say 1, because it's
just one character. Notice how we said character; we didn't say
anything about its length in bytes. Certainly the internal
representation requires more than just 8 bits for a numeric code that
big. But you the programmer are dealing with characters as
abstractions, not as physical octets. Low-level details like that are
best left up to Perl.You shouldn't think of characters and bytes as the same. Programmers
who interchange bytes and characters are guilty of the same class of
sin as C programmers who blithely interchange integers and pointers.
Even though the underlying representations may happen to coincide on
some platforms, this is just a coincidence, and conflating abstract
interfaces with physical implementations will always come back to
haunt you, eventually.You have several ways to put Unicode characters into Perl literals.
If you're lucky enough to have a text editor that lets you enter
Unicode directly into your Perl program, you can inform Perl you've
done this via the use utf8 pragma. Another way is
to use \x escapes in Perl interpolated strings to
indicate a character by its code point in hex, as in
\xC4. Characters with code points above 0xFF
require more than two hex digits, so these must be enclosed in
braces.
print "\xC4 and \x{0394} look different\n";
char Ä andΔ look different\n
Recipe 1.5 describes how to use
charnames to put
\N{NAME} escapes in string
literals, such as \N{GREEK CAPITAL LETTER DELTA},
\N{greek:Delta}, or even just
\N{Delta} to indicate a Δ character.That's enough to get started using Unicode in Perl alone, but getting
Perl to interact properly with other programs requires a bit more.Using the old single-byte encodings like ASCII or ISO
8859-n, when you wrote out a character whose
numeric code was NN, a single byte with numeric
code NN would appear. What actually appeared
depended on which fonts were available, your current locale setting,
and quite a few other factors. But under Unicode, this exact
duplication of logical character numbers (code points) into physical
bytes emitted no longer applies. Instead, they must be encoded in any
of several available output formats.Internally, Perl uses
a format called UTF-8, but many other encoding formats for Unicode
exist, and Perl can work with those, too. The use
encoding pragma tells Perl in which encoding your script
itself has been written, or which encoding the standard filehandles
should use. The use open pragma can set encoding
defaults for all handles. Special arguments to
open or to binmode specify the
encoding format for that particular handle. The -C command-line flag is a shortcut to set the
encoding on all (or just standard) handles, plus the program
arguments themselves. The environment variables
PERLIO, PERL_ENCODING, and
PERL_UNICODE all give Perl various sorts of hints
related to these matters.