Chapter 7. File Access - Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید

Chapter 7. File Access


Contents:

Introduction

Opening a File

Opening Files with Unusual Filenames

Expanding Tildes in Filenames

Making Perl Report Filenames in Error Messages

Storing Filehandles into Variables

Writing a Subroutine That Takes Filehandles as Built-ins Do

Caching Open Output Filehandles

Printing to Many Filehandles Simultaneously

Opening and Closing File Descriptors by Number

Copying Filehandles

Creating Temporary Files

Storing a File Inside Your Program Text

Storing Multiple Files in the DATA Area

Writing a Unix-Style Filter Program

Modifying a File in Place with a Temporary File

Modifying a File in Place with the -i Switch

Modifying a File in Place Without a Temporary File

Locking a File

Flushing Output

Doing Non-Blocking I/O

Determining the Number of Unread Bytes

Reading from Many Filehandles Without Blocking

Reading an Entire Line Without Blocking

Program: netlock

Program: lockarea


Alfred, Lord Tennyson, Locksley Hall

I the heir of all ages, in the foremost files of time.


7.0. Introduction


Nothing is more central to data
processing than the file. As with everything else in Perl, easy
things are easy and hard things are possible. Common tasks (opening
files, reading data, writing data) use simple I/O functions and
operators, whereas fancier functions do hard things like non-blocking
I/O and file locking.

This chapter deals with the mechanics of file
access: opening a file, telling subroutines
which files to work with, locking files, and so on. Chapter 8 deals with techniques for working with the
contents of a file: reading, writing, shuffling
lines, and other operations you can do once you have access to the
file.

Here''s Perl code for printing all lines from the file
/usr/local/widgets/data that contain the word
"blue":

open(INPUT, "<", "/acme/widgets/data")
or die "Couldn''t open /acme/widgets/data for reading: $!\n";
while (<INPUT>) {
print if /blue/;
}
close(INPUT);

7.0.1. Getting a Handle on the File



Central
to file access in Perl is the filehandle, like
INPUT in the previous code example. Filehandles
are symbols inside your Perl program that you associate with an
external file, usually using the open function.
Whenever your program performs an input or output operation, it
provides that operation with an internal filehandle, not an external
filename. It''s the job of open to make that
association, and of close to break it. Actually,
any of several functions can be used to open files, and handles can
refer to entities beyond mere files on disk; see Recipe 7.1 for details.

While users think of open files in terms of those files'' names, Perl
programs do so using their filehandles. But as far as the operating
system itself is concerned, an open file is nothing more than a
file descriptor, which is a small,
non-negative integer. The fileno function divulges
the system file descriptor of its filehandle argument. Filehandles
are enough for most file operations, but for when they aren''t, Recipe 7.9 turns a system file descriptor into a
filehandle you can use from Perl.

Like the names for labels, subroutines, and packages, those for
filehandles are unadorned symbols like INPUT, not
variables like $input. However, with a few
syntactic restrictions, Perl also accepts in lieu of a named
filehandle a scalar expression that evaluates to a
filehandle—or to something that passes for a filehandle, such
as a typeglob, a reference to a typeglob, or an IO object. Typically,
this entails storing the filehandle''s typeglob in a scalar variable
and then using that variable as an indirect filehandle. Code written
this way can be simpler than code using named filehandles, because
now that you''re working with regular variables instead of names,
certain untidy and unobvious issues involving quoting, scoping, and
packages all become clearer.

As of the v5.6 release, Perl can be coaxed into implicitly
initializing variables used as indirect filehandles. If you supply a
function expecting to initialize a filehandle (like
open) with an undefined scalar, that function
automatically allocates an anonymous typeglob and stores its
reference into the previously undefined variable—a
tongue-twisting description normally abbreviated to something more
along the lines of, "Perl autovivifies filehandles passed to
open as undefined scalars."

my $input; # new lexical starts out undef
open($input, "<", "/acme/widgets/data")
or die "Couldn''t open /acme/widgets/data for reading: $!\n";
while (<$input>) {
print if /blue/;
}
close($input); # also occurs when $input GC''d

For more about references and their autovivification, see Chapter 11. That chapter deals more with customary data
references, though, than it does with exotics like the typeglob
references seen here.

Having open autovivify a filehandle is only one of
several ways to get indirect filehandles. We show different ways of
loading up variables with named filehandles and several esoteric
equivalents for later use as indirect filehandles in Recipe 7.5.

Some recipes in this chapter use filehandles along with the standard
IO::Handle module, and sometimes with the IO::File module. Object
constructors from these classes return new objects for use as
indirect filehandles anywhere a regular handle would go, such as with
built-ins like print, readline,
close, <FH>, etc. You can
likewise invoke any IO::Handle method on your regular, unblessed
filehandles. This includes autovivified handles and even named ones
like INPUT or STDIN, although
none of these has been blessed as an object.

Method invocation syntax is visually noisier than the equivalent Perl
function call, and incurs some performance penalty compared with a
function call (where an equivalent function exists). We generally
restrict our method use to those providing functionality that would
otherwise be difficult or impossible to achieve in pure Perl without
resorting to modules.

For example,
the blocking method sets or disables blocking on a
filehandle, a pleasant alternative to the Fcntl wizardry that at
least one of the authors and probably most of the readership would
prefer not having to know. This forms the basis of Recipe 7.20.

Most
methods are in the IO::Handle class, which IO::File inherits from,
and can even be applied directly to filehandles that aren''t objects.
They need only be something that Perl will accept as a filehandle.
For example:

STDIN->blocking(0);   # invoke on named handle
open($fh, "<", $filename) or die; # first autovivify handle, then...
$fh->blocking(0); # invoke on unblessed typeglob ref

Like most names in Perl, including those of subroutines and global
variables, named filehandles reside in packages. That way, two
packages can have filehandles of the same name. When unqualified by
package, a named filehandle has a full name that starts with the
current package. Writing INPUT is really
main::INPUT in the main
package, but it''s SomeMod::INPUT if you''re in a
hypothetical SomeMod package.

The built-in filehandles STDIN,
STDOUT, and STDERR are special.
If they are left unqualified, the main package
rather than the current one is used. This is the same exception to
normal rules for finding the full name that occurs with built-in
variables like @ARGV and %ENV,
a topic discussed in the Introduction to Chapter 12.

Unlike named filehandles, which are global symbols within the
package, autovivified filehandles implicitly allocated by Perl are
anonymous (i.e., nameless) and have no package of their own. More
interestingly, they are also like other references in being subject
to automatic garbage collection. When a variable holding them goes
out of scope and no other copies or references to that variable or
its value have been saved away somewhere more lasting, the garbage
collection system kicks in, and Perl implicitly closes the handle for
you (if you haven''t yet done so yourself). This is important in large
or long-running programs, because the operating system imposes a
limit on how many underlying file descriptors any process can have
open—and usually also on how many descriptors can be open
across the entire system.

In other words, just as real system memory is a finite resource that
you can exhaust if you don''t carefully clean up after yourself, the
same is true of system file descriptors. If you keep opening new
filehandles forever without ever closing them, you''ll eventually run
out, at which point your program will die if you''re lucky or careful,
and malfunction if you''re not. The implicit close
during garbage collection of autoallocated filehandles spares you the
headaches that can result from less than perfect bookkeeping.

For example, these two functions both autovivify filehandles into
distinct lexical variables of the same name:

sub versive {
open(my $fh, "<", $SOURCE)
or die "can''t open $SOURCE: $!";
return $fh;
}
sub apparent {
open(my $fh, ">", $TARGET)
or die "can''t open $TARGET: $!";
return $fh;
}
my($from, to) = ( versive( ), apparent( ) );

Normally, the handles in $fh would be closed
implicitly when each function returns. But since both functions
return those values, the handles will stay open a while longer. They
remain open until explicitly closed, or until the
$from and $to variables and any
copies you make all go out of scope—at which point Perl
dutifully tidies up by closing them if they''ve been left open.

For buffered handles with internal buffers containing unwritten data,
a more valuable benefit shows up. Because a flush precedes a close,
this guarantees that all data finally makes it to where you thought
it was going in the first place.[11] For global filehandle names, this implicit flush and
close takes place on final program exit, but it is not
forgotten.[12]

[11]Or at least tries
to; currently, no error is reported if the implicit write syscall
should fail at this stage, which might occur if, for example, the
filesystem the open file was on has run out of space.

[12]Unless you exit by way of an uncaught
signal, either by exec ing another program or by
calling POSIX::_exit( ).


7.0.2. Standard Filehandles




Every program starts with three
standard filehandles already open: STDIN,
STDOUT, and STDERR.
STDIN, typically pronounced standard
in
, represents the default source for data flowing
into a program. STDOUT,
typically pronounced standard out, represents
the default destination for data flowing out
from a program. Unless otherwise redirected, standard input will be
read directly from your keyboard, and standard output will be written
directly to your screen.

One need not be so direct about matters, however. Here we tell the
shell to redirect your program''s standard input to
datafile and its standard output to
resultsfile, all before your program even
starts:

% program < datafile > resultsfile

Suppose something goes wrong in your program that you need to report.
If your standard output has been redirected, the person running your
program probably wouldn''t notice a message that appeared in this
output. These are the precise circumstances for which
STDERR, typically pronounced standard
error
, was devised. Like STDOUT,
STDERR is initially directed to your screen, but
if you redirect STDOUT to a file or pipe,
STDERR''s destination remains unchanged. That way
you always have a standard way to get warnings or errors through to
where they''re likely to do some good.

Unlike STDERR for STDOUT, for
STDIN there''s no preopened filehandle for times
when STDIN has been redirected. That''s because
this need arises much less frequently than does the need for a
coherent and reliable diagnostic stream. Rarely, your program may
need to ask something of whoever ran it and read their response, even
in the face of redirection. The more(1) and
less(1) programs do this, for example, because
their STDIN s are often pipes from other programs
whose long output you want to see a page at a time. On Unix systems,
open the special file /dev/tty, which represents
the controlling device for this login session. The
open fails if the program has no controlling tty,
which is the system''s way of reporting that there''s no one for your
program to communicate with.

This arrangement makes it easy to plug the output from one program
into the input of the next, and so on down the line.

% first | second | third

That means to apply the first program to the input of the second, and
the output of the second as the input of the third. You might not
realize it at first, but this is the same logic as seen when stacking
functions calls like third(second(first( ))),
although the shell''s pipeline is a bit easier to read because the
transformations proceed from left to right instead of from inside the
expression to outside.

Under the uniform I/O interface of standard input and output, each
program can be independently developed, tested, updated, and executed
without risk of one program interfering with another, but they will
still easily interoperate. They act as tools or parts used to build
larger constructs, or as separate stages in a larger manufacturing
process. Like having a huge stock of ready-made, interchangeable
parts on hand, they can be reliably assembled into larger sequences
of arbitrary length and complexity. If the larger sequences (call
them scripts) are given names by being placed into executable scripts
indistinguishable from the store-bought parts, they can then go on to
take part in still larger sequences as though they were basic tools
themselves.

An environment where every data-transformation program does one thing
well and where data flows from one program to the next through
redirectable standard input and output streams is one that strongly
encourages a level of power, flexibility, and reliability in software
design that could not be achieved otherwise. This, in a nutshell, is
the so-called tool-and-filter philosophy that underlies the design of
not only the Unix shell but the entire operating system. Although
problem domains do exist where this model breaks down—and Perl
owes its very existence to plugging one of several infelicities the
model forces on you—it is a model that has nevertheless
demonstrated its fundamental soundness and scalability for nearly 30
years.




7.0.3. I/O Operations



Perl''s
most common operations for file interaction are
open, print,
<FH> to read a record, and
close. Perl''s I/O functions are documented in
Chapter 29 of Programming Perl, and in the
perlfunc(1) and
perlopentut(1) manpages. The next chapter
details I/O operations like <FH>,
print, seek, and
tell. This chapter focuses on
open and how you access the data, rather than what
you do with the data.

Arguably the most important I/O function
is open. You typically pass it two or three
arguments: the filehandle, a string containing the access mode
indicating how to open the file (for reading, writing, appending,
etc.), and a string containing the filename. If two arguments are
passed, the second contains both the access mode and the filename
jammed together. We use this conflation of mode and path to good
effect in Recipe 7.14.

To open /tmp/log for writing and to associate it
with the filehandle LOGFILE, say:

open(LOGFILE, "> /tmp/log")     or die "Can''t write /tmp/log: $!";

The three most common access modes are < for
reading, > for overwriting, and
>> for appending. The
open function is discussed in more detail in
Recipe 7.1. Access modes can also include
I/O layers like :raw and
:encoding(iso-8859-1). Later in this Introduction
we discuss I/O layers to control buffering, deferring until Chapter 8 the use of I/O layers to convert the contents
of files as they''re read.

When opening a file or making virtually any other system
call,[13] checking the return value is indispensable. Not every
open succeeds; not every file is readable; not
every piece of data you print reaches its
destination. Most programmers check open,
seek, tell, and
close in robust programs. You might want to check
other functions, too.

[13]The term system call
denotes a call into your operating system kernel. It is unrelated to
the C and Perl function that''s actually named
system. We''ll therefore often call these
syscalls, after the C and Perl function of that
name.


If a function is documented to return an error under such and such
conditions, and you don''t check for these conditions, then this will
certainly come back to haunt you someday. The Perl documentation
lists return values from all functions and operators. Pay special
attention to the glyph-like annotations in Chapter 29 of
Programming Perl that are listed on the
righthand side next to each function call entry—they tell you
at a glance which variables are set on error and which conditions
trigger exceptions.

Typically, a function that''s a true system call fails by returning
undef, except for wait,
waitpid, and syscall, which all
return -1 on failure. You can find the system
error message as a string and its corresponding numeric code in the
$! variable. This is often used in
die or warn messages.






The
most common input operation in Perl is <FH>,
the line input operator. Instead of sitting in
the middle of its operands the way infix
operators are, the line input operator surrounds
its filehandle operand, making it more of a
circumfix operator, like parentheses. It''s also
known as the angle operator because of the left- and right-angle
brackets that compose it, or as the readline
function, since that''s the underlying Perl core function that it
calls.

A
record is normally a line, but you can change the record terminator,
as detailed in Chapter 8. If FH
is omitted, it defaults to the special filehandle,
ARGV. When you read from this handle, Perl opens
and reads in succession data from those filenames listed in
@ARGV, or from STDIN if
@ARGV is empty. Customary and curious uses of this
are described in Recipe 7.14.

At one abstraction level, files are simply streams of octets; that
is, of eight-bit bytes. Of course, hardware may impose other
organizations, such as blocks and sectors for files on disk or
individual IP packets for a TCP connection on a network, but the
operating system thankfully hides such low-level details from you.

At a higher abstraction level, files are a stream of logical
characters independent of any particular underlying physical
representation. Because Perl programs most often deal with text
strings containing characters, this is the default set by
open when accessing filehandles. See the
Introduction to Chapter 8 or Recipe 8.11 for how and when to change that default.

Each filehandle has a numeric value
associated with it, typically called its seek offset, representing
the position at which the next I/O operation will occur. If you''re
thinking of files as octet streams, it''s how many octets you are from
the beginning of the file, with the starting offset represented by 0.
This position is implicitly updated whenever you read or write
non-zero-length data on a handle. It can also be updated explicitly
with the seek function.

Text files are a slightly higher level of abstraction than octet
streams. The number of octets need not be identical to the number of
characters. Unless you take special action, Perl''s filehandles are
logical streams of characters, not physical streams of octets. The
only time those two numbers (characters and octets) are the same in
text files is when each character read or written fits comfortably in
one octet (because all code points are below 256), and when no
special processing for end of line (such as conversion between
"\cJ\cM" and "\n") occurs. Only
then do logical character position and physical byte position work
out to be the same.

This is the sort of file you have with ASCII or Latin1 text files
under Unix, where no fundamental distinction exists between text and
binary files, which significantly simplifies programming.
Unfortunately, 7-bit ASCII text is no longer prevalent, and even
8-bit encodings of ISO 8859-n are quickly giving
way to multibyte-encoded Unicode text.

In other words, because encoding layers such as
":utf8" and translation layers such as
":crlf" can change the number of bytes transferred
between your program and the outside world, you cannot sum up how
many characters you''ve transferred to infer your current file
position in bytes. As explained in Chapter 1,
characters are not bytes—at least, not necessarily and not
dependably. Instead, you must use the tell
function to retrieve your current file position. For the same reason,
only values returned from tell (and the number 0)
are guaranteed to be suitable for passing to seek.

In Recipe 7.17, we read the entire contents
of a file opened in update mode into memory, change our internal
copy, and then seek back to the beginning of that
file to write those modifications out again, thereby overwriting what
we started with.

When
you no longer have use for a filehandle, close it.
The close function takes a single filehandle and
returns true if the filehandle could be successfully flushed and
closed, and returns false otherwise. You don''t need to explicitly
close every filehandle. When you open a filehandle that''s already
open, Perl implicitly closes it first. When your program exits, any
open filehandles also get closed.

These implicit closes are for convenience, not stability, because
they don''t tell you whether the syscall succeeded or failed. Not all
closes succeed, and even a close on a read-only
file can fail. For instance, you could lose access to the device
because of a network outage. It''s even more important to check the
close if the file was opened for writing;
otherwise, you wouldn''t notice if the filesystem filled up.

close(FH)           or die "FH didn''t close: $!";

Closing filehandles as soon as
you''re done with them can also aid portability to non-Unix platforms,
because some have problems in areas such as reopening a file before
closing it and renaming or removing a file while it''s still open.
These operations pose no problem to POSIX systems, but others are
less accommodating.

The paranoid programmer even checks the close on
standard output stream at the program''s end, lest
STDOUT had been redirected from the command line
and the output filesystem filled up. Admittedly, your runtime system
should take care of this for you, but it doesn''t.

Checking standard error, though, is more problematic. After all, if
STDERR fails to close, what are you planning to do
about it? Well, you could determine why the close failed to see
whether there''s anything you might do to correct the situation. You
could even load up the Sys::Syslog module and call syslog(
)
, which is what system daemons do, since they don''t
otherwise have access to a good STDERR stream.

STDOUT is the default filehandle used by
the print, printf, and
write functions if no filehandle argument is
passed. Change this default with select, which
takes the new default output filehandle and returns the previous one.
The new output filehandle must have already been opened before
calling select:

$old_fh = select(LOGFILE);   # switch to LOGFILE for output
print "Countdown initiated ...\n";
select($old_fh); # return to original output
print "You have 30 seconds to reach minimum safety distance.\n";

Some of
Perl''s special variables change the behavior of the currently
selected output filehandle. Most important is $|,
which controls output buffering for each filehandle. Flushing output
buffers is explained in Recipe 7.19.



Perl
has functions for buffered and unbuffered I/O. Although there are
some exceptions (see the following table), you shouldn''t mix calls to
buffered and unbuffered I/O functions. That''s because buffered
functions may keep data in their buffers that the unbuffered
functions can''t know about. The following table shows the two sets of
functions you should not mix. Functions on a particular row are only
loosely associated; for instance, sysread doesn''t
have the same semantics as <FH>, but they
are on the same row because they both read from a filehandle.
Repositioning is addressed in Chapter 8, but we
also use it in Recipe 7.17.























Action


Buffered


Unbuffered


input


<FH>,readline


sysread


output


print


syswrite


repositioning


seek, tell


sysseek

As of Perl v5.8
there is a way to mix these functions: I/O layers. You can''t turn on
buffering for the unbuffered functions, but you can turn off
buffering for the unbuffered ones. Perl now lets you select the
implementation of I/O you wish to use. One possible choice is
:unix, which makes Perl use unbuffered syscalls
rather than your stdio library or Perl''s portable reimplementation of
stdio called perlio. Enable the unbuffered I/O layer when you open
the file with:


open(FH, "<:unix", $filename)  or die;

Having opened the handle with the unbuffered layer, you can now mix
calls to Perl''s buffered and unbuffered I/O functions with impunity
because with that I/O layer, in reality there are no buffered I/O
functions. When you print, Perl is then really
using the equivalent of syswrite. More information
can be found in Recipe 7.19.

/ 875