Chapter 8. File Contents
Contents:
IntroductionReading Lines with Continuation CharactersCounting Lines (or Paragraphs or Records) in a FileProcessing Every Word in a FileReading a File Backward by Line or ParagraphTrailing a Growing FilePicking a Random Line from a FileRandomizing All LinesReading a Particular Line in a FileProcessing Variable-Length Text FieldsRemoving the Last Line of a FileProcessing Binary FilesUsing Random-Access I/OUpdating a Random-Access FileReading a String from a Binary FileReading Fixed-Length RecordsReading Configuration FilesTesting a File for TrustworthinessTreating a File as an ArraySetting the Default I/O LayersReading or Writing Unicode from a FilehandleConverting Microsoft Text Files into UnicodeComparing the Contents of Two FilesPretending a String Is a FileProgram: tailwtmpProgram: tcteeProgram: lastonProgram: Flat File Indexes
Mike O'Dell, only half jokinglyThe most brilliant decision in all of Unix was the choice of a single
character for the newline sequence.
8.0. Introduction
Before the Unix Revolution, every kind
of data source and destination was inherently different. Getting two
programs merely to understand each other required heavy wizardry and
the occasional sacrifice of a virgin stack of punch cards to an
itinerant mainframe repairman. This computational Tower of Babel made
programmers dream of quitting the field to take up a less painful
hobby, like autoflagellation.These days, such cruel and unusual programming is largely behind us.
Modern operating systems work hard to provide the illusion that I/O
devices, network connections, process control information, other
programs, the system console, and even users' terminals are all
abstract streams of bytes called files. This
lets you easily write programs that don't care where their input came
from or where their output goes.Because programs read and write streams of simple text, every program
can communicate with every other program. It is difficult to
overstate the power and elegance of this approach. No longer
dependent upon troglodyte gnomes with secret tomes of JCL (or COM)
incantations, users can now create custom tools from smaller ones by
using simple command-line I/O redirection, pipelines, and backticks.
8.0.1 Basic Operations
Treating files as unstructured byte streams necessarily governs what
you can do with them. You can read and write sequential, fixed-size
blocks of data at any location in the file, increasing its size if
you write past the current end. Perl uses an I/O library that
emulates C's stdio(3) to implement reading and
writing of variable-length records like lines, paragraphs, and words.What can't you do to an unstructured file? Because you can't insert
or delete bytes anywhere but at end-of-file, you can't easily change
the length of, insert, or delete records. An exception is the last
record, which you can delete by truncating the file to the end of the
previous record. For other modifications, you need to use a temporary
file or work with a copy of the file in memory. If you need to do
this a lot, a database system may be a better solution than a raw
file (see Chapter 14). Standard with Perl as of
v5.8 is the Tie::File module, which offers an array interface to
files of records. We use it in Recipe 8.4.
The most common files are text files,
and the most common operations on text files are reading and writing
lines. Use the line-input operator, <FH> (or
the internal function implementing it, readline),
to read lines, and use print to write them. These
functions can also read or write any record that has a specific
record separator. Lines are simply variable-length records that end
in "\n".The <FH> operator returns
undef on error or when end of the file is reached,
so use it in loops like this:
while (defined ($line = <DATAFILE>)) {
chomp $line;
$size = length($line);
print "$size\n"; # output size of line
}
Because this operation is extremely common in Perl programs that
process lines of text, and that's an awful lot to type, Perl
conveniently provides some shorter aliases for it. If all shortcuts
are taken, this notation might be too abstract for the uninitiated to
guess what it's really doing. But it's an idiom you'll see thousands
of times in Perl, so you'll soon get used to it. Here are
increasingly shortened forms, where the first line is the completely
spelled-out version:
while (defined ($line = <DATAFILE>)) { ... }
while ($line = <DATAFILE>) { ... }
while (<DATAFILE>){ ... }
In the second line, the explicit defined test
needed for detecting end-of-file is omitted. To make everyone's life
easier, you're safe to skip that defined test,
because when the Perl compiler detects this situation, it helpfully
puts one there for you to guarantee your program's correctness in odd
cases. This implicit addition of a defined occurs
on all while tests that do nothing but assign to
one scalar variable the result of calling
readline, readdir, or
readlink. As <FH> is just
shorthand for readline(FH), it also counts.We're not quite done shortening up yet. As the third line shows, you
can also omit the variable assignment completely, leaving just the
line input operator in the while test. When you do
that here in a while test, it doesn't simply
discard the line it just read as it would anywhere else. Instead, it
reads lines into the special global variable $_.
Because so many other operations in Perl also default to
$_, this is more useful than it might initially
appear.
while (<DATAFILE>) {
chomp;
print length( ), "\n"; # output size of line
}
In scalar context, <FH> reads just the next
line, but in list context, it reads all remaining lines:
@lines = <DATAFILE>;
Each time <FH> reads a record from a
filehandle, it increments the special variable $.
(the "current input record number"). This variable is reset only when
close is called explicitly, which means that it's
not reset when you reopen an already opened filehandle. Another
special variable is $/, the input record
separator. It is set to "\n" by default. You can
set it to any string you like; for instance, "\0"
to read null-terminated records. Read entire paragraphs by setting
$/ to the empty string, ".
This is almost like setting $/ to
"\n\n", in that empty lines function as record
separators. However, " treats two or more
consecutive empty lines as a single record separator, whereas
"\n\n" returns empty records when more than two
consecutive empty lines are read. Undefine $/ to
read the rest of the file as one scalar:
undef $/;
$whole_file = <FILE>; # "slurp" mode
The -0
option to Perl lets you set $/ from the command
line:
% perl -040 -e '$word = <>; print "First word is $word\n";'
The digits after -0 are the octal
value of the single character to which $/ is to be
set. If you specify an illegal value (e.g., with -0777), Perl will set $/ to
undef. If you specify -00, Perl will set $/ to
". The limit of a single octal value means you
can't set $/ to a multibyte string; for instance,
"%%\n" to read fortune files.
Instead, you must use a BEGIN block:
% perl -ne 'BEGIN { $/="%%\n" } chomp; print if /Unix/i' fortune.dat
Use print to write a line or any other data. The
print function writes its arguments one after
another and doesn't automatically add a line or record terminator by
default.
print HANDLE "One", "two", "three"; # "Onetwothree"
print "Baa baa black sheep.\n"; # Sent to default output handle
There is no comma between the filehandle and the data to print. If
you put a comma in there, Perl gives the error message
"No comma
allowed after
filehandle". The default output handle is
STDOUT. Change it with the
select function. (See the Introduction to Chapter 7.)
8.0.2. Newlines
All systems use the virtual "\n" to represent a
line terminator, called a newline. There is no
such thing as a newline character; it is a platform-independent way
of saying "whatever your string library uses to represent a line
terminator." On Unix, VMS, and Windows, this line terminator in
strings is "\cJ" (the Ctrl-J character). Versions
of the old Macintosh operating system before Mac OS X used
"\cM". As a Unix variant, Mac OS X uses
"\cJ".Operating systems also vary in how they store newlines in files. Unix
also uses "\cJ" for this. On Windows, though,
lines in a text file end in "\cM\cJ". If your I/O
library knows you are reading or writing a text file, it will
automatically translate between the string line terminator and the
file line terminator. So on Windows, you could read four bytes
("Hi\cM\cJ") from disk and end up with three in
memory ("Hi\cJ" where "\cJ" is
the physical representation of the newline character). This is never
a problem on Unix, as no translation needs to happen between the
disk's newline ("\cJ") and the string's newline
("\cJ").Terminals, of course, are a different kettle of fish. Except when
you're in raw mode (as in system("stty raw")), the
Enter key generates a "\cM" (carriage return)
character. This is then translated by the terminal driver into a
"\n" for your program. When you print a line to a
terminal, the terminal driver notices the "\n"
newline character (whatever it might be on your platform) and turns
it into the "\cM\cJ" (carriage return, line feed)
sequence that moves the cursor to the start of the line and down one
line.Even network protocols have their own expectations. Most protocols
prefer to receive and send "\cM\cJ" as the line
terminator, but many servers also accept merely a
"\cJ". This varies between protocols and servers,
so check the documentation closely!The important notion here is that if the I/O library thinks you are
working with a text file, it may be translating sequences of bytes
for you. This is a problem in two situations: when your file is not
text (e.g., you're reading a JPEG file) and when your file is text
but not in a byte-oriented ASCII-like encoding (e.g., UTF-8 or any of
the other encodings the world uses to represent their characters). As
if this weren't bad enough, some systems (again, MS-DOS is an
example) use a particular byte sequence in a text file to indicate
end-of-file. An I/O library that knows about text files on such a
platform will indicate EOF when that byte sequence is read.Recipe 8.11 shows how to disable any
translation that your I/O library might be doing.
8.0.3. I/O Layers
With v5.8, Perl I/O operations are no
longer simply wrappers on top of stdio. Perl now has a flexible
system (I/O layers) that transparently filters multiple encodings of
external data. In Chapter 7 we met the
:unix layer, which implements unbuffered I/O.
There are also layers for using your platform's stdio
(:stdio) and Perl's portable stdio implementation
(:perlio), both of which buffer input and output.
In this chapter, these implementation layers don't interest us as
much as the encoding layers built on top of them.
The
:crlf layer converts a carriage return and line
feed (CRLF, "\cM\cJ") to "\n"
when reading from a file, and converts "\n" to
CRLF when writing. The opposite of :crlf is
:raw, which makes it safe to read or write binary
data from the filehandle. You can specify that a filehandle contains
UTF-8 data with :utf8, or specify an encoding with
:encoding(...). You can even write your own filter
in Perl that processes data being read before your program gets it,
or processes data being written before it is sent to the
device.
It's worth
emphasizing: to disable :crlf, specify the
:raw layer. The :bytes layer is
sometimes misunderstood to be the opposite of
:crlf, but they do completely different things.
The former refers to the UTF-8ness of strings, and the latter to the
behind-the-scenes conversion of carriage returns and line feeds.You may specify I/O layers when you open the file:
open($fh, "<:raw:utf8", $filename); # read UTF-8 from the file
open($fh, "<:encoding(shiftjis)", $filename); # shiftjis japanese encoding
open(FH, "+<:crlf", $filename); # convert between CRLF and \n
Or you may use
binmode to change the layers of an existing
handle:
binmode($fh, ":raw:utf8");
binmode($fh, ":raw:encoding(shiftjis)");
binmode(FH, "<:raw:crlf");
Because binmode pushes onto the stack of I/O
layers, and the facility for removing layers is still evolving, you
should always specify a complete set of layers by making the first
layer be :raw as follows:
binmode(HANDLE, ":raw");# binary-safe
binmode(HANDLE); # same as :raw
binmode(HANDLE, ":raw :utf8"); # read/write UTF-8
binmode(HANDLE, ":raw :encoding(shiftjis)"); # read/write shiftjis
Recipe 8.18, Recipe 8.19,
and Recipe 8.20 show how to manipulate I/O layers.
8.0.4. Advanced Operations
Use the read
function to read a fixed-length record. It takes three arguments: a
filehandle, a scalar variable, and the number of characters to read.
It returns undef if an error occurred or else
returns the number of characters
read.
$rv = read(HANDLE, $buffer, 4096)
or die "Couldn't read from HANDLE : $!\n";
# $rv is the number of bytes read,
# $buffer holds the data read
To write a fixed-length record, just use print.The
truncate function changes the length (in bytes) of
a file, which can be specified as a filehandle or as a filename. It
returns true if the file was successfully truncated, false otherwise:
truncate(HANDLE, $length) or die "Couldn't truncate: $!\n";
truncate("/tmp/$$.pid", $length) or die "Couldn't truncate: $!\n";
Each filehandle
keeps track of where it is in the file. Reads and writes occur from
this point, unless you've specified the O_APPEND
flag (see Recipe 7.1). Fetch the file
position for a filehandle with tell, and set it
with seek. Because the library rewrites data to
preserve the illusion that "\n" is the line
terminator, and also because you might be using characters with code
points above 255 and therefore requiring a multibyte encoding, you
cannot portably seek to offsets calculated simply
by counting characters. Unless you can guarantee your file uses one
byte per character, seek only to offsets returned
by tell.
$pos = tell(DATAFILE);
print "I'm $pos bytes from the start of DATAFILE.\n";
The seek function takes three arguments: the
filehandle, the offset (in bytes) to go to, and a numeric argument
indicating how to interpret the offset. 0 indicates an offset from
the start of the file (like the value returned by
tell); 1, an offset from the current location (a
negative number means move backward in the file, a positive number
means move forward); and 2, an offset from end-of-file.
seek(LOGFILE, 0, 2) or die "Couldn't seek to the end: $!\n";
seek(DATAFILE, $pos, 0) or die "Couldn't seek to $pos: $!\n";
seek(OUT, -20, 1) or die "Couldn't seek back 20 bytes: $!\n";
So
far we've been describing buffered I/O. That is,
readline or <FH>,
print, read,
seek, and tell are all
operations that use buffering for speed and efficiency. This is their
default behavior, although if you've specified an unbuffered I/O
layer for that handle, they won't be buffered. Perl also provides an
alternate set of I/O operations guaranteed to be unbuffered no matter
what I/O layer is associated with the handle. These are
sysread, syswrite, and
sysseek, all discussed in Chapter 7.
The
sysread and syswrite functions
are different in appearance from their <FH>
and print counterparts. Both take a filehandle to
act on: a scalar variable to either read into or write out from, and
the number of characters to transfer. (With binary data, this is the
number of bytes, not characters.) They also accept an optional fourth
argument, the offset from the start of the scalar variable at which
to start reading or writing:
$written = syswrite(DATAFILE, $mystring, length($mystring));
die "syswrite failed: $!\n" unless $written = = length($mystring);
$read = sysread(INFILE, $block, 256, 5);
warn "only read $read bytes, not 256" if 256 != $read;
The syswrite call sends the contents of
$mystring to DATAFILE. The
sysread call reads 256 characters from
INFILE and stores 5 characters into
$block, leaving intact the 5 characters it
skipped. Both sysread and
syswrite return the number of characters
transferred, which could be different than the amount of data you
were attempting to transfer. Maybe the file didn't have as much data
as you thought, so you got a short read. Maybe the filesystem that
the file lives on filled up. Maybe your process was interrupted
partway through the write. Stdio takes care of finishing the transfer
in cases of interruption, but if you use raw
sysread and syswrite calls, you
must finish up yourself. See Recipe 9.3 for
an example.The
sysseek function doubles as an unbuffered
replacement for both seek and
tell. It takes the same arguments as
seek, but it returns the new position on success
and undef on error. To find the current position
within the file:
$pos = sysseek(HANDLE, 0, 1); # don't change position
die "Couldn't sysseek: $!\n" unless defined $pos;
These are the basic operations available to you. The art and craft of
programming lies in using these basic operations to solve complex
problems such as finding the number of lines in a file, reversing
lines in a file, randomly selecting a line from a file, building an
index for a file, and so on.