Recipe 16.7. Merging and Splitting Tokens
Credit: Peter Cogolo
Problem
You
need to tokenize an input language whose tokens are almost the same
as Python's, with a few exceptions that need token
merging and splitting.
Solution
Standard library module tokenize is very handy; we
need to wrap it with a generator to do the post-processing for a
little splitting and merging of tokens. The merging requires the
ability to "peek ahead" in an
iterator. We can get that ability by wrapping any iterator into a
small dedicated iterator class:
class peek_ahead(object):Armed with this tool, we can easily split and merge tokens. Say, for
sentinel = object( )
def _ _init_ _(self, it):
self._nit = iter(it).next
self.preview = None
self._step( )
def _ _iter_ _(self):
return self
def next(self):
result = self._step( )
if result is self.sentinel: raise StopIteration
else: return result
def _step(self):
result = self.preview
try: self.preview = self._nit( )
except StopIteration: self.preview = self.sentinel
return result
example, by the rules of the language we're lexing,
that we must consider each of ':=' and
':+' to be a single token, but a floating-point
token that is a '.' with digits on both sides,
such as '31.17', must be given as a sequence of
three tokens, '31', '.',
'17' in this case. Here's how
(using Python 2.4 code with comments on how to change it if
you're stuck with version 2.3):
import tokenize, cStringIO
# in 2.3, also do 'from sets import Set as set'
mergers = {':' : set('=+'), }
def tokens_of(x):
it = peek_ahead(toktuple[1] for toktuple in
tokenize.generate_tokens(cStringIO.StringIO(x).readline)
)
# in 2.3, you need to add brackets [ ] around the arg to peek_ahead
for tok in it:
if it.preview in mergers.get(tok, ( )):
# merge with next token, as required
yield tok+it.next( )
elif tok[:1].isdigit( ) and '.' in tok:
# split if digits on BOTH sides of the '.'
before, after = tok.split('.', 1)
if after:
# both sides -> yield as 3 separate tokens
yield before
yield '.'
yield after
else:
# nope -> yield as one token
yield tok
else:
# not a merge or split case, just yield the token
yield tok
Discussion
Here's an example of use of this
recipe's code:
>>> x = 'p{z:=23, w:+7}: m :+ 23.4'In this recipe, I yield tokens only as substrings of the string
>>> print ' / '.join(tokens_of(x))
p / { / z / := / 23 / , / w / :+ / 7 / } / : / m / :+ / 23 / . / 4 /
I'm lexing, rather than the whole
tuple yielded by
tokenize.generate_tokens, including such items as
token position within the overall string (by line and column). If
your needs are more sophisticated than mine, you should simply
peek_ahead on whole token tuples (while
I'm simplifying things by picking up just the
substring, item 1, out of each token tuple, by passing to
peek_ahead a generator expression), and compute
start and end positions appropriately when splitting or merging. For
example, if you're merging two adjacent tokens, the
overall token has the same start position as the first, and the same
end position as the second, of the two tokens you're
merging.The peek_ahead iterator wrapper class can often be
useful in many kinds of lexing and parsing tasks, exactly because
such tasks are well suited to operating on streams (which are well
represented by iterators) but often require a level of peek-ahead
and/or push-back ability. You can often get by with just one level;
if you need more than one level, consider having your wrapper hold a
container of peeked-ahead or pushed-back tokens. Python
2.4's collections.deque container
implements a double-ended queue, which is particularly well suited
for such tasks. For a more powerful look-ahead iterator wrapper, see
Recipe 19.18.
See Also
Library Reference and Python in a
Nutshell sections on the Python Standard Library modules
tokenize and cStringIO; Recipe 19.18 for a more powerful
look-ahead iterator wrapper.