Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 13.9. Fixing Messages Parsed by Python 2.4 email.FeedParser

Credit: Matthew Cowles

Problem

You're using Python
2.4's new email.FeedParser
module, but sometimes, when dealing with badly malformed incoming
messages, that module produces message objects that are internally
inconsistent (e.g., a message has a content-type header that says the
message is multipart, but the body isn't), and you
need to fix those inconsistencies.

Solution

Python 2.4's new standard library module
email.FeedParser is very useful, but a little
post-processing on the messages it returns can heuristically fix some
inconsistencies and make it even better. Here's a
module containing a class and a few functions to help with this task:

import email, email.FeedParser
import re, sys, sgmllib
# what chars are non-Ascii,
 what max fraction of them can be in a text part
kGuessBinaryThreshold = 0.2
kGuessBinaryRE = re.compile("[\\0000-\\0025\\0200-\\0377]")
# what max fraction of HTML tags can be in a text (non-HTML) part
kGuessHTMLThreshold = 0.05
class Cleaner(sgmllib.SGMLParser):
entitydefs = {"nbsp": " "}  # I'll break if I want to
def _ _init_ _(self):
sgmllib.SGMLParser._ _init_ _(self)
self.result = [  ]
def do_p(self, *junk):
self.result.append('\n')
def do_br(self, *junk):
self.result.append('\n')
def handle_data(self, data):
self.result.append(data)
def cleaned_text(self):
return ''.join(self.result)
def stripHTML(text):
''' return text, with HTML tags stripped '''
c = Cleaner( )
try:
c.feed(text)
except sgmllib.SGMLParseError:
return text
else:
return c.cleaned_text( )
def guessIsBinary(text):
''' return whether we can heuristically guess 'text' is binary '''
if not text: return False
nMatches = float(len(kGuessBinaryRE.findall(text)))
return nMatches/len(text) >= kGuessBinaryThreshold
def guessIsHTML(text):
''' return whether we can heuristically guess 'text' is HTML '''
if not text: return False
lt = len(text)
textWithoutTags = stripHTML(text)
tagsChars = float(lt-len(textWithoutTags))
return tagsChars/lt >= kGuessHTMLThreshold
def getMungedMessage(openFile):
openFile.seek(0)
p = email.FeedParser.FeedParser( )
p.feed(openFile.read( ))
m = p.close( )
# Fix up multipart content-type when message isn't multi-part
if m.get_main_type( )=="multipart" and not m.is_multipart( ):
t = m.get_payload(decode=1)
if guessIsBinary(t):
# Use generic "opaque" type
m.set_type("application/octet-stream")
elif guessIsHTML(t):
m.set_type("text/html")
else:
m.set_type("text/plain")
return m

Discussion

FeedParser is a
new module in the Python 2.4 Standard Library's
email package. The module's name
comes from the fact that it maintains a buffer, so that you
don't have to give it all the text at once. Possibly
more interesting is that the module doesn't raise an
error when called on malformed messages; instead, it tries to make
some sense of them and return a useful
email.Message object. That's
useful because so much mail is spam and so much spam is malformed.

The other side of the coin, given that the heroic feed parser works
on incorrect messages, is that you can get back from it an
email.Message object that's
internally inconsistent. This recipe tries to make sense of one kind
of inconsistency: a message with a content-type header that says that
the message is multipart, but the body isn't.

The heuristics that the recipe uses to guess at the correct
content-type are inevitably messy. Still, better to have such messy
heuristics in recipes, rather than embedded forever in the Python
Standard Library.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی