Recipe 19.12. Iterating on a Stream of Data Blocks as a Stream of Lines
Credit: Scott David Daniels, Peter Cogolo
Problem
You want to loop over all lines of a stream, but the stream arrives
as a sequence of data blocks of arbitrary size (e.g., from a network
socket).
Solution
We need to code a generator that gets blocks and yields lines:
def ilines(source_iterable, eol='\r\n', out_eol='\n'):When run as a main script, this code emits:
tail = ''
for block in source_iterable:
pieces = (tail+block).split(eol)
tail = pieces.pop( )
for line in pieces:
yield line + out_eol
if tail:
yield tail
if _ _name_ _ == '_ _main_ _':
s = 'one\r\ntwo\r,\nthree,four,five\r\n,six,\r\nseven\r\nlast'.split(',')
for line in ilines(s): print repr(line)
'one\n'
'two\n'
'threefourfive\n'
'six\n'
'seven\n'
'last'
Discussion
Many data sources produce their data in fits and
startssockets, RSS feeds, the results of expanding compressed
text, and (at its heart) most I/O. The data often
doesn't arrive at convenient boundaries, but you
nevertheless want to consume it in logical units. For text, the
logical units are often lines.This recipe shows generator ilines, a simple way to
consume a source_iterable, which yields blocks of
data, producing an iterator that yields lines of text instead.
ilines is vastly simplified by assuming that lines
are separated, on input, by a known end-of-line (EOL) stringby
default
'\r\n',
which is the standard EOL marker in most Internet protocols.
ilines' implementation is further
simplified by taking a high-level approach, relying on the
split method of Python's string
types to do most of the work. This basically leaves
ilines with the single task of
"buffering" data between successive
input blocks, on all occasions when a line starts in one block and
ends in a following one (including those occasions in which block
boundaries "split" an EOL marker).ilines easily accomplishes its buffering task
through its local variable tail, which
starts empty and, at each leg of the loop, holds that which followed
the latest EOL marker seen so far. When tail+block
ends with an EOL marker, the expression
(tail+block).split(eol) produces a list whose last
item is an empty string (''), exactly what we
need; otherwise, the last item of the list is that which followed the
last EOL, which again is exactly what we need.Python's built-in file objects
are even more powerful than ilines, since they
support a universal newlines reading mode
(mode 'U'), which is able to recognize and deal
with all common EOL markers (even when different markers are mixed
within the same stream!). However, ilines is more
flexible, since you may apply it in many situations where you have a
stream of arbitrary blocks of text and want to process it as a stream
of lines, with a known EOL marker.
See Also
Library Reference and Python in a
Nutshell docs about built-in file
objects; Chapter 2 for general issues about
handling files.