Introduction
Credit: Fred L. Drake, Jr., PythonLabsText-processing applications form a substantial part of the
application space for any scripting language, if only because
everyone can agree that text processing is useful. Everyone has bits
of text that need to be reformatted or transformed in various ways.
The catch, of course, is that every application is just a little bit
different from every other application, so it can be difficult to
find just the right reusable code to work with different file
formats, no matter how similar they are.
What Is Text?
Sounds
like an easy question, doesn't it? After all, we
know it when we see it, don't we? Text is a sequence
of characters, and it is distinguished from binary data by that very
fact. Binary data, after all, is a sequence of bytes.Unfortunately, all data enters our
applications as a sequence of bytes. There's no
library function we can call that will tell us whether a particular
sequence of bytes represents text, although we can create some useful
heuristics that tell us whether data can safely (not necessarily
correctly) be handled as text. Recipe 1.11
shows just such a heuristic.Python strings are immutable sequences of bytes or characters. Most
of the ways we create and process strings treat them as sequences of
characters, but many are just as applicable to sequences of bytes.
Unicode strings are immutable sequences of Unicode characters:
transformations of Unicode strings into and from plain strings use
codecs (coder-decoders) objects that embody
knowledge about the many standard ways in which sequences of
characters can be represented by sequences of bytes (also known as
encodings and character
sets). Note that Unicode strings do
not serve double duty as sequences of bytes.
Recipe 1.20,
Recipe 1.21, and
Recipe 1.22 illustrate the fundamentals
of Unicode in Python.Okay, let's assume that our application knows from
the context that it's looking at text.
That's usually the best approach because
that's where external input comes into play.
We're looking at a file either because it has a
well-known name and defined format (common in the
"Unix" world) or because it has a
well-known filename extension that indicates the format of the
contents (common on Windows). But now we have a problem: we had to
use the word format to make the previous
paragraph meaningful. Wasn't text supposed to be
simple?Let's face it: there's no such
thing as "pure" text, and if there
were, we probably wouldn't care about it (with the
possible exception of applications in the field of computational
linguistics, where pure text may indeed sometimes be studied for its
own sake). What we want to deal with in our applications is
information contained in text. The text we care about may contain
configuration data, commands to control or define processes,
documents for human consumption, or even tabular data. Text that
contains configuration data or a series of commands usually can be
expected to conform to a fairly strict syntax that can be checked
before relying on the information in the text. Informing the user of
an error in the input text is typically sufficient to deal with
things that aren't what we were expecting.Documents intended for humans tend to be simple, but they vary widely
in detail. Since they are usually written in a natural language,
their syntax and grammar can be difficult to check, at best.
Different texts may use different character sets or encodings, and it
can be difficult or even impossible to tell which character set or
encoding was used to create a text if that information is not
available in addition to the text itself. It is, however, necessary
to support proper representation of natural-language documents.
Natural-language text has structure as well, but the structures are
often less explicit in the text and require at least some
understanding of the language in which the text was written.
Characters make up words, which make up sentences, which make up
paragraphs, and still larger structures may be present as well.
Paragraphs alone can be particularly difficult to locate unless you
know what typographical conventions were used for a document: is each
line a paragraph, or can multiple lines make up a paragraph? If the
latter, how do we tell which lines are grouped together to make a
paragraph? Paragraphs may be separated by blank lines, indentation,
or some other special mark. See Recipe 19.10
for an example of reading a
text file as a sequence of paragraphs separated by blank lines.Tabular data has many issues that are similar to the problems
associated with natural-language text, but it adds a second dimension
to the input format: the text is no longer linearit is no
longer a sequence of characters, but rather a matrix of characters
from which individual blocks of text must be identified and
organized.
Basic Textual Operations
As with any other data format, we
need to do different things with text at different times. However,
there are still three basic operations:
- Parsing the data into a structure internal to our application
- Transforming the input into something similar in some way, but with
changes of some kind - Generating completely new data
Parsing
can be performed in a variety of ways, and many formats can be
suitably handled by ad hoc parsers that deal effectively with a very
constrained format. Examples of this approach include parsers for RFC
2822-style email headers (see the rfc822 module in
Python's standard library) and the configuration
files handled by the ConfigParser module. The
netrc module offers another example of a parser
for an application-specific file format, this one based on the
shlex module. shlex offers a
fairly typical tokenizer for basic languages, useful in creating
readable configuration files or allowing users to enter commands to
an interactive prompt. These sorts of ad hoc parsers are abundant in
Python's standard library, and recipes using them
can be found in Chapter 2 and Chapter 13. More formal parsing tools are also
available for Python; they depend on larger add-on packages and are
surveyed in the introduction to Chapter 16.Transforming text from one format to another is more interesting when
viewed as text processing, which is what we usually think of first
when we talk about text. In this chapter, we'll take
a look at some ways to approach transformations that can be applied
for different purposes. Sometimes we'll work with
text stored in external files, and other times we'll
simply work with it as strings in memory.The generation of textual data from application-specific data
structures is most easily performed using Python's
print statement or the write
method of a file or file-like object. This is often done using a
method of the application object or a function, which takes the
output file as a parameter. The function can then use statements such
as these:
print >>thefile, sometextwhich generate output to the appropriate file. However, this
thefile.write(sometext)
isn't generally thought of as text processing, as
here there is no input text to be processed. Examples of using both
print and write can of course
be found throughout this book.
Sources of Text
Working with text stored as a string in
memory can be easy when the text is not too large. Operations that
search the text can operate over multiple lines very easily and
quickly, and there's no need to worry about
searching for something that might cross a buffer boundary. Being
able to keep the text in memory as a simple string makes it very easy
to take advantage of the built-in string operations available as
methods of the string object.
File-based transformations
deserve special treatment, because there can be substantial overhead
related to I/O performance and the amount of data that must actually
be stored in memory. When working with data stored on disk, we often
want to avoid loading entire files into memory, due to the size of
the data: loading an 80 MB file into memory should not be done too
casually! When our application needs only part of the data at a time,
working on smaller segments of the data can yield substantial
performance improvements, simply because we've
allowed enough space for our program to run. If we are careful about
buffer management, we can still maintain the performance advantage of
using a small number of relatively large disk read and write
operations by working on large chunks of data at a time. File-related
recipes are found in Chapter 12
.Another interesting source for textual
data comes to light when we consider the network. Text is often
retrieved from the network using a socket. While we can always view a
socket as a file (using the makefile method of the
socket object), the data that is retrieved over a socket may come in
chunks, or we may have to wait for more data to arrive. The textual
data may not consist of all data until the end of the data stream, so
a file object created with makefile may not be
entirely appropriate to pass to text-processing code. When working
with text from a network connection, we often need to read the data
from the connection before passing it along for further processing.
If the data is large, it can be handled by saving it to a file as it
arrives and then using that file when performing text-processing
operations. More elaborate solutions can be built when the text
processing needs to be started before all the data is available.
Examples of parsers that are useful in such situations may be found
in the htmllib and HTMLParser
modules in the standard
library.
String Basics
The
main tool Python gives us to process text is stringsimmutable
sequences of characters. There are actually two kinds of strings:
plain strings, which contain 8-bit (ASCII) characters; and Unicode
strings, which contain Unicode characters. We won't
deal much with Unicode strings here: their functionality is similar
to that of plain strings, except each character takes up 2 (or 4)
bytes, so that the number of different characters is in the tens of
thousands (or even billions), as opposed to the 256 different
characters that make up plain strings. Unicode strings are important
if you must deal with text in many different alphabets, particularly
Asian ideographs. Plain strings are sufficient to deal with English
or any of a limited set of non-Asian languages. For example, all
western European alphabets can be encoded in plain strings, typically
using the international standard encoding known as ISO-8859-1 (or
ISO-8859-15, if you need the Euro currency symbol as
well).In Python, you express a literal string (curiously more often known
as a string literal) as:
'this is a literal string'String values can be enclosed in either single or double quotes. The
"this is another string"
two different kinds of quotes work the same way, but having both
allows you to include one kind of quotes inside of a string specified
with the other kind of quotes, without needing to escape them with
the backslash character:
'isn\'t that grand'To have a string literal span multiple lines, you can use a backslash
"isn't that grand"
as the last character on the line, which indicates that the next line
is a continuation:
big = "This is a long stringthat spans two lines."You must embed newlines in the string if you want the string to
output on two lines:
big = "This is a long string\nthat prints on two lines."Another approach is to enclose the string in a pair of matching
triple quotes (either single or double):
bigger = ""Using triple quotes, you don't need to use the
This is an even
bigger string that
spans three lines.
""
continuation character, and line breaks in the string literal are
preserved as newline characters in the resulting Python string
object. You can also make a string literal
"raw" string
by preceding it with an r or R:
big = r"This is a long stringwith a backslash and a newline in it"With a raw string, backslash escape sequences are left alone, rather
than being interpreted. Finally, you can precede a string literal
with a u or U to make it a
Unicode string:
hello = u'Hello\u0020World'Strings are immutable, which means that no matter what operation you
do on a string, you will always produce a new string object, rather
than mutating the existing string. A string is a sequence of
characters, which means that you can access a single character by
indexing:
mystr = "my string"You can also access a portion of the string with a slice:
mystr[0] # 'm'
mystr[-2] # 'n'
mystr[1:4] # 'y s'Slices can be extended, that is, include a third parameter that is
mystr[3:] # 'string'
mystr[-3:] # 'ing'
known as the stride or
step of the slice:
mystr[:3:-1] # 'gnirt'You can loop on a string's characters:
mystr[1::2] # 'ysrn'
for c in mystr:This binds c to each of the characters in
mystr in turn. You can form another sequence:
list(mystr) # returns ['m','y',' ','s','t','r','i','n','g']You can concatenate strings by addition:
mystr+'oid' # 'my stringoid'You can also repeat strings by multiplication:
'xo'*3 # 'xoxoxo'In general, you can do anything to a string that you can do to any
other sequence, as long as it doesn't require
changing the sequence, since strings are immutable.
String objects have many useful methods.
For example, you can test a string's contents with
s.isdigit( ), which returns
true if s is not empty
and all of the characters in s are digits
(otherwise, it returns False). You can produce a
new modified string with a method call such as s.toupper(
), which returns a new string that is like
s, but with every letter changed into its
uppercase equivalent. You can search for a string inside another with
haystack.count('needle'), which returns the number
of times the substring 'needle' appears in the
string haystack. When you have a large
string that spans multiple lines, you can split it into a list of
single-line strings with splitlines:
list_of_lines = one_large_string.splitlines( )You can produce the single large string again with
join:
one_large_string = '\n'.join(list_of_lines)The recipes in this chapter show off many methods of the string
object. You can find complete documentation in
Python's Library Reference
and Python in a Nutshell.Strings in Python can also be
manipulated with regular expressions, via the re
module. Regular expressions are a powerful (but complicated) set of
tools that you may already be familiar with from another language
(such as Perl), or from the use of tools such as the
vi editor and text-mode commands such as
grep. You'll find a number of
uses of regular expressions in recipes in the second half of this
chapter. For complete documentation, see the Library
Reference and Python in a
Nutshell. J.E.F. Friedl, Mastering Regular
Expressions (O'Reilly) is also
recommended if you need to master this
subjectPython's regular expressions are
basically the same as Perl's, which Friedl covers
thoroughly.Python's standard module
string offers much of the same functionality that
is available from string methods, packaged up as functions instead of
methods. The string module also offers a few
additional functions, such as the useful
string.maketrans function that is demonstrated in
a few recipes in this chapter; several helpful string constants
(string.digits, for example, is
'0123456789') and, in Python 2.4, the new class
Template, for simple yet flexible formatting of
strings with embedded variables, which as you'll see
features in one of this chapter's recipes. The
string-formatting operator, %, provides a handy
way to put strings together and to obtain precisely formatted strings
from such objects as floating-point numbers. Again,
you'll find recipes in this chapter that show how to
use % for your purposes. Python also has lots of
standard and extension modules that perform special processing on
strings of many kinds. This chapter doesn't cover
such specialized resources, but Chapter 12 is,
for example, entirely devoted to the important specialized subject of
processing XML.