Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] - نسخه متنی

David Ascher, Alex Martelli, Anna Ravenscroft

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Recipe 10.10. Blocking Duplicate Mails


Credit: Marina Pianu, Peter Cogolo


Problem



Many of the mails you receive are
duplicates. You need to block the duplicates with a fast, simple
filter before they reach a more time-consuming step, such as an
anti-spam filter, in your email pipeline.


Solution


Many mail systems, such
as the popular procmail, and KDE's KMail, enable you
to control your mail-reception pipeline. Specifically, you can insert
in the pipeline your filter programs, which get messages on standard
input, may modify them, and emit them again on standard output. Here
is one such filter, with the specific purpose of performing the task
described in the Problemblocking messages that are duplicates
of other messages that you have received recently:

#!/usr/bin/python
import time, sys, os, email
now = time.time( )
# get archive of previously-seen message-ids and times
kde_dir = os.expanduser('~/.kde')
if not os.path.isdir(kde_dir):
os.mkdir(kde_dir)
arfile = os.path.join(kde_dir, 'duplicate_mails')
duplicates = { }
try:
archive = open(arfile)
except IOError:
pass
else:
for line in archive:
when, msgid = line[:-1].split(' ', 1)
duplicates[msgid] = float(when)
archive.close( )
redo_archive = False
# suck message in from stdin and study it
msg = email.message_from_file(sys.stdin)
msgid = msg['Message-ID']
if msgid:
if msgid in duplicates:
# duplicate message: alter its subject
subject = msg['Subject']
if subject is None:
msg['Subject'] = '**** DUP **** ' + msgid
else:
del msg['Subject']
msg['Subject'] = '**** DUP **** ' + subject
else:
# non-duplicate message: redo the archive file
redo_archive = True
duplicates[msgid] = now
else:
# invalid (missing message-id) message: alter its subject
subject = msg['Subject']
if subject is None:
msg['Subject'] = '**** NID **** '
else:
del msg['Subject']
msg['Subject'] = '**** NID **** ' + subject
# emit message back to stdout
print msg
if redo_archive:
# redo archive file, keep only msgs from the last two hours
keep_last = now - 2*60*60.0
archive = file(arfile, 'w')
for msgid, when in duplicates.iteritems( ):
if when > keep_last:
archive.write('%9.2f %s\n' % (when, what))
archive.close( )


Discussion


Whether it is because of spammers' malice or
incompetence, or because of hiccups at my Internet ISP (Internet
service provider), at times I get huge amounts of duplicate messages
that can overload my mail-reception pipeline, particularly antispam
filters. Fortunately, like many other mail systems,
KDE's KMail, the one I use, lets me insert my own
filters in the mail reception pipeline. In particular, I can diagnose
duplicate messages, alter their headers (I use
"Subject" for clarity), and tell
later stages in the filters' pipeline to throw away
messages with such subjects or to shunt them aside into a dedicated
mailbox for later perusal, without passing them on to the antispam
and other filters.

The
email module from the Python Standard Library
performs all the required parsing of the message and lets me access
headers with dictionary-like indexing syntax. I need some
"memory" of recently seen messages.
Fortunately, I have noticed all duplicates happen within a few
minutes of each other, so I don't have to keep that
memory for longtwo hours are plenty. Therefore, I keep that
memory in a simple text file, which records the time when a message
was received and the message ID. I thought I might have to find a
more advanced way to keep this kind of FIFO (first-in, first-out)
archive, but I tried a simple approach firsta simple text file
that is entirely rewritten whenever a new nonduplicate message
arrives. This approach appears to perform quite adequately for my
needs (at most a couple hundred messages an hour), even on my
somewhat dated PC. "Do the simplest thing that could
possibly work" strikes again!


See Also


Documentation about package email and modules
time, sys and
os in the Library Reference
and Python in a Nutshell.


/ 394