Recipe 13.8. Removing Attachments from an Email Message
Credit: Anthony Baxter
Problem
You're handling email
in Python and need to remove from email messages any attachments that
might be dangerous.
Solution
Regular expressions can help us identify dangerous content types and
file extensions, and thus code a function to remove any potentially
dangerous attachments:
ReplFormat = ""
This message contained an attachment that was stripped out.
The filename was: %(filename)s,
The original type was: %(content_type)s
(and it had additional parameters of:
%(params)s)
""
import re
BAD_CONTENT_RE = re.compile('application/(msword|msexcel)', re.I)
BAD_FILEEXT_RE = re.compile(r'(\.exe|\.zip|\.pif|\.scr|\.ps)$')
def sanitise(msg):
''' Strip out all potentially dangerous payloads from a message '''
ct = msg.get_content_type( )
fn = msg.get_filename( )
if BAD_CONTENT_RE.search(ct) or (fn and BAD_FILEEXT_RE.search(fn)):
# bad message-part, pull out info for reporting then destroy it
# present the parameters to the content-type, list of key, value
# pairs, as key=value forms joined by comma-space
params = msg.get_params( )[1:]
params = ', '.join([ '='.join(p) for p in params ])
# put informative message text as new payload
replace = ReplFormat % dict(content_type=ct, filename=fn, params=params)
msg.set_payload(replace)
# now remove parameters and set contents in content-type header
for k, v in msg.get_params( )[1:]:
msg.del_param(k)
msg.set_type('text/plain')
# Also remove headers that make no sense without content-type
del msg['Content-Transfer-Encoding']
del msg['Content-Disposition']
else:
# Now we check for any sub-parts to the message
if msg.is_multipart( ):
# Call sanitise recursively on any subparts
payload = [ sanitise(x) for x in msg.get_payload( ) ]
# Replace the payload with our list of sanitised parts
msg.set_payload(payload)
# Return the sanitised message
return msg
# Add a simple driver/example to show how to use this function
if _ _name_ _ == '_ _main_ _':
import email, sys
m = email.message_from_file(open(sys.argv[1]))
print sanitise(m)
Discussion
This issue has come up a few times on the newsgroup
The email parser in
Python 2.4 has been completely rewritten to be robust first, correct
second. Prior to that version, the parser was written for correctness
first. But focusing on correctness was a problem because many
virus/worm messages and other malware routinely send email messages
that are broken and nonconformantmalformed to the point that
the old email parser chokes and dies. The new parser is designed to
never actually break when reading a message. Instead, it tries its
best to fix whatever it can fix in the message. (If you have a
message that causes the parser to crash, please let us, the core
Python developers, know. It's a bug, and
we'll fix it. Please include a copy of the message
that makes the parser crash, or else it's very
unlikely that we can reproduce your problem!)The recipe's code itself is fairly well commented
and should be easy enough to follow. A mail message consists of one
or more parts; each of these parts can contain nested parts. We call
the sanitise function on the top-level
Message object, and it calls itself recursively on
the subobjects if and as needed.The
sanitise function first checks the
Content-Type of the part, and if
there's a filename, it also checks that
filename's extension against a known-to-be-bad list.
If the message part is bad, we replace the message itself with a
short text description describing the now-removed part and clean out
the headers that are relevant. We set this message
part's Content-Type to
'text/plain' and remove other headers related to
the now-removed message.Finally, we check whether the message is a multipart message. If so,
it means the message has subparts, so we recursively call the
sanitise function on each of them. We then replace
the payload with our list of sanitized subparts.If you're interested in working further on this
recipe, the most important extra functionality, which is easy to add
with a small amount of work, might be to store the attached file in
some directory (instead of destroying all suspect attachments), and
give the user a link to that file. Also consider extending the check
in sanitise that filters dangerous attachments to
have it verify more than just the content type and file extension;
other headers may be able to carry known signs of worm or virus
messages.
See Also
Documentation for the standard library modules
email and re in the
Library Reference and Python in a
Nutshell.