Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 19.10. Reading a Text File by Paragraphs

Credit: Alex Martelli, Magnus Lie Hetland, Terry
Reedy

Problem

You need to read a text file (or any other iterable whose items are
lines of text) paragraph by paragraph, where a
"paragraph" is defined as a
sequence of nonwhite lines (i.e., paragraphs are separated by lines
made up exclusively of whitespace).

Solution

A generator is quite suitable for bunching up lines this way:

def paragraphs(lines, is_separator=str.isspace, joiner=''.join):
paragraph = [  ]
for line in lines:
if is_separator(line):
if paragraph:
yield joiner(paragraph)
paragraph = [  ]
else:
paragraph.append(line)
if paragraph:
yield joiner(paragraph)
if _ _name_ _ == '_ _main_ _':
text = 'a first\nparagraph\n\nand a\nsecond one\n\n'
for p in paragraphs(text.splitlines(True)): print repr(p)

Discussion

Python doesn't directly support paragraph-oriented
file reading, but it's not hard to add such
functionality. We define a
"paragraph" as the string formed by
joining a nonempty sequence of nonseparator lines, separated from any
adjoining paragraphs by nonempty sequences of separator lines. A
separator line is one that satisfies the predicate passed in as
argument is_separator. (A
predicate is a function whose result is taken
as a logical truth value, and we say a
predicate is satisfied
when the predicate returns a result that is true.) By default, a line
is a separator if it is made up entirely of whitespace characters
(e.g., space, tab, newline, etc.).

The recipe's code is quite straightforward. The
state of the generator during iteration is entirely held in local
variable paragraph, a list to which we append the
nonseparator lines that make up the current paragraph. Whenever we
meet a separator in the body of the for statement,
we test if paragraph to check whether the list is
currently empty. If the list is empty, we're already
skipping a run of separators and need do nothing special to handle
the current separator line. If the list is not empty,
we've just met a separator line that terminates the
current paragraph, so we must join up the list,
yield the resulting paragraph string, and then set
the list back to empty.

This recipe implements a special case of sequence adaptation by
bunching: an underlying iterable is "bunched
up" into another iterable with
"bigger" items.
Python's generators let you express sequence
adaptation tasks very directly and linearly. By passing as arguments,
with reasonable default values, the is_separator
predicate, and the joiner callable that determines
what happens to each "bigger item"
when we're done bunching it up, we achieve a
satisfactory amount of generality without any extra complexity. To
see this, consider a snippet such as:

import operator
numbers = [1, 2, 3, 0, 0, 6, 5, 3, 0, 12]
bunch_up = paragraphs
for s in bunch_up(numbers, operator.not_, sum): print 'S', s
for l in bunch_up(numbers, bool, len): print 'L', l

In this snippet, we use the paragraphs generator
(under the name of bunch_up, which is clearer in
this context) to get the sums of
"runs" of nonzero numbers separated
by runs of zeros, then the lengths of the runs of
zerosapplications that, at first sight, might appear to be
quite different from the recipe's stated purpose.
That's the magic of abstraction: when appropriately
and tastefully applied, it can easily turn the solution of a problem
into a family of solutions for many other apparently unrelated
problems.

An elementary issue, but a crucial one for getting good performance
in the "main" use case of this
recipe, is that the paragraphs'
generator builds up each resulting paragraph as a list of strings,
then concatenates all strings in the list with
''.join to obtain each result it
yields. An alternate approach, where a large
string is built up as a string, by repeated application of
+= or +, is never the right
approach in Python: it is both slow and clumsy. Good Pythonic style
absolutely demands that we use a list as the
intermediate accumulator, whenever we are building a long string by
concatenating a number of smaller ones. Python 2.4 has diminished the
performance penalty of the wrong approach. For example, to join a
list of 52 one-character strings into a 52-character string on my
machine, Python 2.3 takes 14.2 microseconds with the right approach,
73.6 with the wrong one; but Python 2.4 takes 12.7 microseconds with
the right approach, 41.6 with the wrong one, so the penalty in this
case has decreased from over five times to over three. Nevertheless,
there is no reason to choose to pay such a performance penalty
without any returns, even the lower penalty that Python 2.4 manages
to extract!

Python 2.4 offers a new itertools.groupby function
that is quite suitable for sequence-bunching tasks. Using it, we
could express the paragraphs'
generator in a really tight and concise way:

from itertools import groupby
def paragraphs(lines, is_separator=str.isspace, joiner=''.join):
for separator_group, lineiter in groupby(lines, key=is_separator):
if not separator_group:
yield joiner(lineiter)

itertools.groupby, like SQL's
GROUP BY clause, which inspired it, is not exactly
trivial use, but it can be quite useful indeed for sequence-bunching
tasks once you have mastered it thoroughly.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی