Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 2.5. Counting Lines in a File

Credit: Luther Blissett

Problem

You need to compute the number of
lines in a file.

Solution

The simplest approach for reasonably sized files is to read the file
as a list of lines, so that the count of lines is the length of the
list. If the file's path is in a string bound to a
variable named thefilepath, all the code
you need to implement this approach is:

count = len(open(thefilepath, 'rU').readlines( ))

For a truly huge file, however, this simple approach may be very slow
or even fail to work. If you have to worry about humongous files, a
loop on the file always works:

count = -1
for count, line in enumerate(open(thefilepath, 'rU')):
pass
count += 1

A tricky alternative, potentially faster for truly humongous files,
for when the line terminator is '\n' (or has
'\n' as a substring, as happens on Windows):

count = 0
thefile = open(thefilepath, 'rb')
while True:
buffer = thefile.read(8192*1024)
if not buffer:
break
count += buffer.count('\n')
thefile.close( )

The 'rb' argument to open is
necessary if you're after speedwithout that
argument, this snippet might be very slow on Windows.

Discussion

When an external program counts a file's lines, such
as wc -l on Unix-like platforms, you can of course
choose to use that (e.g., via os.popen). However,
it's generally simpler, faster, and more portable to
do the line-counting in your own program. You can rely on almost all
text files having a reasonable size, so that reading the whole file
into memory at once is feasible. For all such normal files, the
len of the result of readlines
gives you the count of lines in the simplest way.

If the file is larger than available memory (say, a few hundred
megabytes on a typical PC today), the simplest solution can become
unacceptably slow, as the operating system struggles to fit the
file's contents into virtual memory. It may even
fail, when swap space is exhausted and virtual memory
can't help any more. On a typical PC, with 256MB RAM
and virtually unlimited disk space, you should still expect serious
problems when you try to read into memory files above, say, 1 or 2
GB, depending on your operating system. (Some operating systems are
much more fragile than others in handling virtual-memory issues under
such overly stressed load conditions.) In this case, looping on the
file object, as shown in this recipe's Solution, is
better. The enumerate built-in keeps the line
count without your code having to do it explicitly.

Counting line-termination characters while reading the file by bytes
in reasonably sized chunks is the key idea in the third approach.
It's probably the least immediately intuitive, and
it's not perfectly cross-platform, but you might
hope that it's fastest (e.g., when compared with
recipe 8.2 in the Perl Cookbook).

However, in most cases, performance doesn't really
matter all that much. When it does matter, the time-sink part of your
program might not be what your intuition tells you it is, so you
should never trust your intuition in this matterinstead,
always benchmark and measure. For example, consider a typical Unix
syslog file of middling size, a bit over 18 MB
of text in 230,000 lines:

[situ@tioni nuc]$ wc nuc
231581 2312730 18508908 nuc

And consider the following testing-and-benchmark framework script,
bench.py:

import time
def timeo(fun, n=10):
start = time.clock( )
for i in xrange(n): fun( )
stend = time.clock( )
thetime = stend-start
return fun._ _name_ _, thetime
import os
def linecount_w( ):
return int(os.popen('wc -l nuc').read( ).split( )[0])
def linecount_1( ):
return len(open('nuc').readlines( ))
def linecount_2( ):
count = -1
for count, line in enumerate(open('nuc')): pass
return count+1
def linecount_3( ):
count = 0
thefile = open('nuc', 'rb')
while True:
buffer = thefile.read(65536)
if not buffer: break
count += buffer.count('\n')
return count
for f in linecount_w, linecount_1, linecount_2, linecount_3:
print f._ _name_ _, f( )
for f in linecount_1, linecount_2, linecount_3:
print "%s: %.2f"%timeo(f)

First, I print the line-counts obtained by all methods, thus ensuring
that no anomaly or error has occurred (counting tasks are notoriously
prone to off-by-one errors). Then, I run each alternative 10 times,
under the control of the timing function
timeo, and look at the results. Here they
are, on the old but reliable machine I measured them on:

[situ@tioni nuc]$ python -O bench.py
linecount_w 231581
linecount_1 231581
linecount_2 231581
linecount_3 231581
linecount_1: 4.84
linecount_2: 4.54
linecount_3: 5.02

As you can see, the performance differences hardly matter: your users
will never even notice a difference of 10% or so in one auxiliary
task. However, the fastest approach (for my
particular circumstances, on an old but reliable PC running a popular
Linux distribution, and for this specific benchmark) is the humble
loop-on-every-line technique, while the slowest
one is the fancy, ambitious technique that counts line terminators by
chunks. In practice, unless I had to worry about files of many
hundreds of megabytes, I'd always use the simplest
approach (i.e., the first one presented in this recipe).

Measuring the
exact performance of code snippets (rather than blindly using
complicated approaches in the hope that they'll be
faster) is very importantso important, indeed, that the Python
Standard Library includes a module, timeit,
specifically designed for such measurement tasks. I suggest you use
timeit, rather than coding your own little
benchmarks as I have done here. The benchmark I just showed you is
one I've had around for years, since well before
timeit appeared in the standard Python library, so
I think I can be forgiven for not using timeit in
this specific case!

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی