Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 16.7. Merging and Splitting Tokens

Credit: Peter Cogolo

Problem

You
need to tokenize an input language whose tokens are almost the same
as Python's, with a few exceptions that need token
merging and splitting.

Solution

Standard library module tokenize is very handy; we
need to wrap it with a generator to do the post-processing for a
little splitting and merging of tokens. The merging requires the
ability to "peek ahead" in an
iterator. We can get that ability by wrapping any iterator into a
small dedicated iterator class:

class peek_ahead(object):
sentinel = object( )
def _ _init_ _(self, it):
self._nit = iter(it).next
self.preview = None
self._step( )
def _ _iter_ _(self):
return self
def next(self):
result = self._step( )
if result is self.sentinel: raise StopIteration
else: return result
def _step(self):
result = self.preview
try: self.preview = self._nit( )
except StopIteration: self.preview = self.sentinel
return result

Armed with this tool, we can easily split and merge tokens. Say, for
example, by the rules of the language we're lexing,
that we must consider each of ':=' and
':+' to be a single token, but a floating-point
token that is a '.' with digits on both sides,
such as '31.17', must be given as a sequence of
three tokens, '31', '.',
'17' in this case. Here's how
(using Python 2.4 code with comments on how to change it if
you're stuck with version 2.3):

import tokenize, cStringIO
# in 2.3, also do 'from sets import Set as set'
mergers = {':' : set('=+'), }
def tokens_of(x):
it = peek_ahead(toktuple[1] for toktuple in
tokenize.generate_tokens(cStringIO.StringIO(x).readline)
)
# in 2.3, you need to add brackets [ ] around the arg to peek_ahead
for tok in it:
if it.preview in mergers.get(tok, ( )):
# merge with next token, as required
yield tok+it.next( )
elif tok[:1].isdigit( ) and '.' in tok:
# split if digits on BOTH sides of the '.'
before, after = tok.split('.', 1)
if after:
# both sides -> yield as 3 separate tokens
yield before
yield '.'
yield after
else:
# nope -> yield as one token
yield tok
else:
# not a merge or split case, just yield the token
yield tok

Discussion

Here's an example of use of this
recipe's code:

>>> x = 'p{z:=23,  w:+7}: m :+ 23.4'
>>> print ' / '.join(tokens_of(x))
p / { / z / := / 23 / , / w / :+ / 7 / } / : / m / :+ / 23 / . / 4 /

In this recipe, I yield tokens only as substrings of the string
I'm lexing, rather than the whole
tuple yielded by
tokenize.generate_tokens, including such items as
token position within the overall string (by line and column). If
your needs are more sophisticated than mine, you should simply
peek_ahead on whole token tuples (while
I'm simplifying things by picking up just the
substring, item 1, out of each token tuple, by passing to
peek_ahead a generator expression), and compute
start and end positions appropriately when splitting or merging. For
example, if you're merging two adjacent tokens, the
overall token has the same start position as the first, and the same
end position as the second, of the two tokens you're
merging.

The peek_ahead iterator wrapper class can often be
useful in many kinds of lexing and parsing tasks, exactly because
such tasks are well suited to operating on streams (which are well
represented by iterators) but often require a level of peek-ahead
and/or push-back ability. You can often get by with just one level;
if you need more than one level, consider having your wrapper hold a
container of peeked-ahead or pushed-back tokens. Python
2.4's collections.deque container
implements a double-ended queue, which is particularly well suited
for such tasks. For a more powerful look-ahead iterator wrapper, see
Recipe 19.18.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی