Recipe 13.9. Fixing Messages Parsed by Python 2.4 email.FeedParser
Credit: Matthew Cowles
Problem
You're using Python
2.4's new email.FeedParser
module, but sometimes, when dealing with badly malformed incoming
messages, that module produces message objects that are internally
inconsistent (e.g., a message has a content-type header that says the
message is multipart, but the body isn't), and you
need to fix those inconsistencies.
Solution
Python 2.4's new standard library module
email.FeedParser is very useful, but a little
post-processing on the messages it returns can heuristically fix some
inconsistencies and make it even better. Here's a
module containing a class and a few functions to help with this task:
import email, email.FeedParser
import re, sys, sgmllib
# what chars are non-Ascii,
what max fraction of them can be in a text part
kGuessBinaryThreshold = 0.2
kGuessBinaryRE = re.compile("[\\0000-\\0025\\0200-\\0377]")
# what max fraction of HTML tags can be in a text (non-HTML) part
kGuessHTMLThreshold = 0.05
class Cleaner(sgmllib.SGMLParser):
entitydefs = {"nbsp": " "} # I'll break if I want to
def _ _init_ _(self):
sgmllib.SGMLParser._ _init_ _(self)
self.result = [ ]
def do_p(self, *junk):
self.result.append('\n')
def do_br(self, *junk):
self.result.append('\n')
def handle_data(self, data):
self.result.append(data)
def cleaned_text(self):
return ''.join(self.result)
def stripHTML(text):
''' return text, with HTML tags stripped '''
c = Cleaner( )
try:
c.feed(text)
except sgmllib.SGMLParseError:
return text
else:
return c.cleaned_text( )
def guessIsBinary(text):
''' return whether we can heuristically guess 'text' is binary '''
if not text: return False
nMatches = float(len(kGuessBinaryRE.findall(text)))
return nMatches/len(text) >= kGuessBinaryThreshold
def guessIsHTML(text):
''' return whether we can heuristically guess 'text' is HTML '''
if not text: return False
lt = len(text)
textWithoutTags = stripHTML(text)
tagsChars = float(lt-len(textWithoutTags))
return tagsChars/lt >= kGuessHTMLThreshold
def getMungedMessage(openFile):
openFile.seek(0)
p = email.FeedParser.FeedParser( )
p.feed(openFile.read( ))
m = p.close( )
# Fix up multipart content-type when message isn't multi-part
if m.get_main_type( )=="multipart" and not m.is_multipart( ):
t = m.get_payload(decode=1)
if guessIsBinary(t):
# Use generic "opaque" type
m.set_type("application/octet-stream")
elif guessIsHTML(t):
m.set_type("text/html")
else:
m.set_type("text/plain")
return m
Discussion
FeedParser is a
new module in the Python 2.4 Standard Library's
email package. The module's name
comes from the fact that it maintains a buffer, so that you
don't have to give it all the text at once. Possibly
more interesting is that the module doesn't raise an
error when called on malformed messages; instead, it tries to make
some sense of them and return a useful
email.Message object. That's
useful because so much mail is spam and so much spam is malformed.The other side of the coin, given that the heroic feed parser works
on incorrect messages, is that you can get back from it an
email.Message object that's
internally inconsistent. This recipe tries to make sense of one kind
of inconsistency: a message with a content-type header that says that
the message is multipart, but the body isn't.The heuristics that the recipe uses to guess at the correct
content-type are inevitably messy. Still, better to have such messy
heuristics in recipes, rather than embedded forever in the Python
Standard Library.
See Also
Documentation for the standard library package
email in the Python 2.4 Library
Reference.