Recipe 1.25. Converting HTML Documents to Texton a Unix Terminal
Credit: Brent Burley, Mark Moraes
Problem
You need to visualize
HTML documents as text, with support for bold and underlined display
on your Unix terminal.
Solution
The simplest approach is to code a filter
script, taking HTML on standard input and emitting text and terminal
control sequences on standard output. Since this recipe only targets
Unix, we can get the needed terminal control sequences from the
"Unix" command
tput, via the function popen of
the Python Standard Library module
os:
#!/usr/bin/env python
import sys, os, htmllib, formatter
# use Unix tput to get the escape sequences for bold, underline, reset
set_bold = os.popen('tput bold').read( )
set_underline = os.popen('tput smul').read( )
perform_reset = os.popen('tput sgr0').read( )
class TtyFormatter(formatter.AbstractFormatter):
''' a formatter that keeps track of bold and italic font states, and
emits terminal control sequences accordingly.
'''
def _ _init_ _(self, writer):
# first, as usual, initialize the superclass
formatter.AbstractFormatter._ _init_ _(self, writer)
# start with neither bold nor italic, and no saved font state
self.fontState = False, False
self.fontStack = [ ]
def push_font(self, font):
# the `font' tuple has four items, we only track the two flags
# about whether italic and bold are active or not
size, is_italic, is_bold, is_tt = font
self.fontStack.append((is_italic, is_bold))
self._updateFontState( )
def pop_font(self, *args):
# go back to previous font state
try:
self.fontStack.pop( )
except IndexError:
pass
self._updateFontState( )
def updateFontState(self):
# emit appropriate terminal control sequences if the state of
# bold and/or italic(==underline) has just changed
try:
newState = self.fontStack[-1]
except IndexError:
newState = False, False
if self.fontState != newState:
# relevant state change: reset terminal
print perform_reset,
# set underine and/or bold if needed
if newState[0]:
print set_underline,
if newState[1]:
print set_bold,
# remember the two flags as our current font-state
self.fontState = newState
# make writer, formatter and parser objects, connecting them as needed
myWriter = formatter.DumbWriter( )
if sys.stdout.isatty( ):
myFormatter = TtyFormatter(myWriter)
else:
myFormatter = formatter.AbstractFormatter(myWriter)
myParser = htmllib.HTMLParser(myFormatter)
# feed all of standard input to the parser, then terminate operations
myParser.feed(sys.stdin.read( ))
myParser.close( )
Discussion
The basic
formatter.AbstractFormatter class, offered by the
Python Standard Library, should work just about anywhere. On the
other hand, the refinements in the TtyFormatter
subclass that's the focus of this recipe depend on
using a Unix-like terminal, and more specifically on the availability
of the tput Unix command to obtain information on
the escape sequences used to get bold or underlined output and to
reset the terminal to its base state.Many systems that do not have Unix certification, such as Linux and
Mac OS X, do have a perfectly workable tput
command and therefore can use this recipe's
TtyFormatter subclass just fine. In other words, you
can take the use of the word "Unix"
in this recipe just as loosely as you can take it in just about every
normal discussion: take it as meaning
"*ix," if you will.If your "terminal" emulator
supports other escape sequences for controlling output appearance,
you should be able to adapt this TtyFormatter class
accordingly. For example, on Windows, a cmd.exe
command window should, I'm told, support standard
ANSI escape sequences, so you could choose to hard-code those
sequences if Windows is the platform on which you want to run your
version of this script.In many cases, you may prefer to use other existing Unix commands,
such as lynx -dump -, to get richer formatting
than this recipe provides. However, this recipe comes in quite handy
when you find yourself on a system that has a Python installation but
lacks such other helpful commands as lynx.
See Also
Library Reference and Python in a
Nutshell docs on the formatter and
htmllib modules; man tput on a
Unix or Unix-like system for more information about the
tput command.