Recipe 16.6. Colorizing Python Source Using the Built-in Tokenizer
Credit: Jürgen Hermann, Mike Brown
Problem
You need to
convert Python source code into HTML markup, rendering comments,
keywords, operators, and numeric and string literals in different
colors.
Solution
tokenize.generate_tokens does most of the work. We
just need to loop over all tokens it finds, to output them with
appropriate colorization:
"" MoinMoin - Python Source Parser ""
import cgi, sys, cStringIO
import keyword, token, tokenize
# Python Source Parser (does highlighting into HTML)
_KEYWORD = token.NT_OFFSET + 1
_TEXT = token.NT_OFFSET + 2
_colors = {
token.NUMBER: '#0080C0',
token.OP: '#0000C0',
token.STRING: '#004080',
tokenize.COMMENT: '#008000',
token.NAME: '#000000',
token.ERRORTOKEN: '#FF8080',
_KEYWORD: '#C00000',
_TEXT: '#000000',
}
class Parser(object):
"" Send colorized Python source HTML to output file (normally stdout).
""
def _ _init_ _(self, raw, out=sys.stdout):
"" Store the source text. ""
self.raw = raw.expandtabs( ).strip( )
self.out = out
def format(self):
"" Parse and send the colorized source to output. ""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
while True:
pos = self.raw.find('\n', pos) + 1
if not pos: break
self.lines.append(pos)
self.lines.append(len(self.raw))
# Parse the source and write it
self.pos = 0
text = cStringIO.StringIO(self.raw)
self.out.write('<pre><font face="Lucida, Courier New">')
try:
for token in tokenize.generate_tokens(text.readline):
# unpack the components of each token
toktype, toktext, (srow, scol), (erow, ecol), line = token
if False: # You may enable this for debugging purposes only
print "type", toktype, token.tok_name[toktype],
print "text", toktext,
print "start", srow,scol, "end", erow,ecol, "<br>"
# Calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)
# Handle newlines
if toktype in (token.NEWLINE, tokenize.NL):
self.out.write('\n')
continue
# Send the original whitespace, if needed
if newpos > oldpos:
self.out.write(self.raw[oldpos:newpos])
# Skip indenting tokens, since they're whitespace-only
if toktype in (token.INDENT, token.DEDENT):
self.pos = newpos
continue
# Map token type to a color group
if token.LPAR <= toktype <= token.OP:
toktype = token.OP
elif toktype == token.NAME and keyword.iskeyword(toktext):
toktype = _KEYWORD
color = _colors.get(toktype, _colors[_TEXT])
style = ''
if toktype == token.ERRORTOKEN:
style = ' style="border: solid 1.5pt #FF0000;"'
# Send text
self.out.write('<font color="%s"%s>' % (color, style))
self.out.write(cgi.escape(toktext))
self.out.write('</font>')
except tokenize.TokenError, ex:
msg = ex[0]
line = ex[1][0]
self.out.write("<h3>ERROR: %s</h3>%s\n" % (
msg, self.raw[self.lines[line]:]))
self.out.write('</font></pre>')
if _ _name_ _ == "_ _main_ _":
print "Formatting..."
# Open own source
source = open('python.py').read( )
# Write colorized version to "pythonl"
Parser(source, open('pythonl', 'wt')).format( )
# Load HTML page into browser
import webbrowser
webbrowser.open("pythonl")
Discussion
This code is part of MoinMoin (see http://moin.sourceforge.net/) and shows how
to use the built-in keyword,
token, and tokenize modules to
scan Python source code and re-emit it with appropriate color markup
but no changes to its original formatting ("no
changes" is the hard part!).The Parser class' constructor saves
the multiline string that is the Python source to colorize, and the
file object, which is open for writing, where you want to output the
colorized results. Then, the format method prepares
a self.lines list that holds the offset (i.e., the
index into the source string, self.raw) of each
line's start.format then loops over the result of generator
tokenize.tokenize, unpacking each token tuple into
items specifying the token type and starting and ending positions in
the source (each expressed as line number and offset within the
line). The body of the loop reconstructs the exact position within
the original source code string self.raw, so it
can emit exactly the same whitespace that was present in the original
source. It then picks a color code from the _colors
dictionary (which uses HTML color coding), with help from the
keyword standard module to determine whether a
NAME token is actually a Python keyword (to be
output in a different color than that used for ordinary identifiers).The test code at the bottom of the module formats the module itself
and launches a browser with the result, using the standard Python
library module webbrowser to enable you to see and
enjoy the result in your favorite browser.If you put this recipe's code into a module, you can
then import the module and reuse its functionality in CGI scripts
(using the PATH_TRANSLATED CGI environment
variable to know what file to colorize), command-line tools (taking
filenames as arguments), filters that colorize anything they get from
standard input, and so on. See http://skew.org/~mike/colorize.py for
versions that support several of these various possibilities.With small changes, it's also easy to turn this
recipe into an Apache handler, so your Apache web site can serve
colorized .py files. Specifically, if you set up
this script as a handler in Apache, then the file is served up as
colorized HTML whenever a visitor to the site requests a
.py file.For the purpose of using this recipe as an Apache handler, you need
to save the script as colorize.cgi (not
.py, lest it confuses Apache), and add, to your
.htaccess or httpd.conf
Apache configuration files, the following lines:
AddHandler application/x-python .pyAlso, make sure you have the Action module enabled
Action application/x-python /full/virtual/path/to/colorize.cgi
in your httpd.conf Apache configuration file.
See Also
Documentation for the webbrowser,
token, tokenize, and
keyword modules in the Library
Reference and Python in a
Nutshell; the colorizer is available at http://purl.net/wiki/python/MoinMoinColorizer,
as part of MoinMoin (http://moin.sourceforge.net), and, in a
somewhat different variant, also at http://skew.org/~mike/colorize.py; the Apache
web server is available and documented at http://httpd.apache.org.