Recipe 14.7. Handling Cookies While Fetching Web Pages
Credit: Mike Foord, Nikos Kouremenos
Problem
You need to
fetch web pages (or other resources from the web) that require you to
handle cookies (e.g., save cookies you receive and also reload and
send cookies you had previously received from the same site).
Solution
The Python 2.4 Standard Library provides a
cookielib module exactly for this task. For Python
2.3, a third-party ClientCookie module works
similarly. We can write our code to ensure usage of the best
available cookie-handling moduleincluding none at all, in
which case our program will still run but without saving and
resending cookies. (In some cases, this might still be OK, just maybe
slower.) Here is a script to show how this concept works in practice:
import os.path, urllib2
from urllib2 import urlopen, Request
COOKIEFILE = 'cookies.lwp'
# "cookiejar" file for cookie saving/reloading
# first try getting the best possible solution, cookielib:
try:
import cookielib
except ImportError: # no cookielib, try ClientCookie instead
cookielib = None
try:
import ClientCookie
except ImportError:# nope, no cookies today
cj = None # so, in particular, no cookie jar
else: # using ClientCookie, prepare everything
urlopen = ClientCookie.urlopen
cj = ClientCookie.LWPCookieJar( )
Request = ClientCookie.Request
else: # we do have cookielib, prepare the jar
cj = cookielib.LWPCookieJar( )
# Now load the cookies, if any,
and build+install an opener using them
if cj is not None:
if os.path.isfile(COOKIEFILE):
cj.load(COOKIEFILE)
if cookielib:
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
else:
opener=ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
ClientCookie.install_opener(opener)
# for example, try a URL that sets a cookie
theurl = 'http://www.diy.co.uk'
txdata = None
# or, for POST instead of GET, txdata=urrlib.urlencode(somedict)
txheaders =
{'User-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
try:
req = Request(theurl, txdata, txheaders)
# create a request object
handle = urlopen(req) # and open it
except IOError, e:
print 'Failed to open "%s".' % theurl
if hasattr(e, 'code'):
print 'Error code: %s.' % e.code
else:
print 'Here are the headers of the page:'
print handle.info( )
# you can also use handle.read( )
to get the page, handle.geturl( ) to get the
# the true URL (could be different from
`theurl' if there have been redirects)
if cj is None:
print "Sorry, no cookie jar, can't show you any cookies today"
else:
print 'Here are the cookies received so far:'
for index, cookie in enumerate(cj):
print index, ': ', cookie
cj.save(COOKIEFILE) # save the cookies again
Discussion
The third-party module ClientCookie, available for
download at http://wwwsearch.sourceforge.net/ClientCookie/,
was so successful that, in Python 2.4, its functionality has been
added to the Python Standard Libraryspecifically, the
cookie-handling parts in the new module cookielib,
the rest in the current version of urllib2.So, you do need to be careful if you want your code to work just as
well on any 2.4 installation (using the latest and greatest
cookielib) or an installation of Python 2.3 with
ClientCookie on top. As long as
we're at it, we might as well handle running on a
2.3 installation that does not have
ClientCookierun anyway, just
don't save and resend cookies when we lack library
code to do so. On some sites, the inability to handle cookies will
just be a bother and perhaps a performance hit due to the loss of
session continuity, but the site will still work. Other sites, of
course, will be completely unusable without cookies.The recipe's code is an exercise in the careful
management of an idiom that's an essential part of
making your Python code portable among releases and installations,
while ensuring minimal graceful degradation when third-party modules
you'd like to use just aren't
there. The idiom is known as conditional
import and is expressed as follows:
try:The use of "conditional import" is
import something
except ImportError: # 'something' not available
...code to do without, degrading gracefully...
else: # 'something' IS available, hooray!
...code to run only when something is there...
# and then, go on with the rest of your program
...code able to run with or w/o `something'...
particularly delicate in this recipe because
ClientCookie and cookielib
aren't drop-in replacements for each
othertherefore, careful management is indeed necessary. But,
if you study this recipe, you will see that it is not rocket
scienceit just requires attention.One key technique is to make double use of a small number of names as
"flags", with value
None when the object to which they would normally
refer is not available. In this recipe, we do that for
cookielib (which refers to the module of that name
when there is one, and otherwise to None) and
cj (which refers to a cookie-jar
object when there is any, and otherwise to None).
Even better, when feasible, is to assign names appropriately to refer
to the best available object under the circumstances: the recipe does
that for variables urlopen and
Request. Note how crucial it is for this purpose
that Python treats all objects as first class:
urlopen is a function, Request is a
class, cookielib (if any) a module,
cj (if any) an instance object. The distinction,
however, doesn't matter in the least: the
name-object reference concept is exactly the same in every case, with
total uniformity, simplicity, and power.When either cookielib or
ClientCookie is available, the cookies are saved
in a file in cookie jar format (a useful plain-text format that is
automatically handled by either module but can also be examined and
modified with text editors and other programs). If the file already
exists when the program runs, cookies are loaded from the file, ready
to be sent back to the appropriate sites.My reason for developing this code is that I'm
developing a cgi-proxy, approx.py (http://www.voidspace.org.uk/atlantibots/pythonutilsl#cgiproxy),
which needs to be able to handle cookies when feasible. To keep the
proxy usable on various versions of Python, and ensure it degrades
gracefully when no cookie-handling library is available, I needed to
develop the carefully managed conditional imports that are shown in
the recipe's Solution. I decided to share them in
this recipe since, besides the importance of cookie handling,
conditional imports are such a generally important Python idiom.
Particularly when installing your code on a server you
don't control, it is unfortunately quite common to
have little say in which version of Python is running, nor in which
third-party extensions are installedexactly the kind of
situation that requires the conditional import technique to ensure
your code does the best it can under the circumstances.
See Also
Documentation on the cookielib and
urllib2 standard library modules in the
Library Reference for Python 2.4;
ClientCookie is at http://wwwsearch.sourceforge.net/ClientCookie/.