Recipe 13.2. Grabbing a Document from the Web
Credit: Gisle Aas, Magnus
Bodin
Problem
You need to grab
a document from a URL on the Web.
Solution
urllib.urlopen returns a file-like object, and you
can call the read method on that object to get all
of its contents:
from urllib import urlopen
doc = urlopen("http://www.python.org").read( )
print doc
Discussion
Once you obtain a file-like object from urlopen,
you can read it all at once into one big string by calling its
read method, as I do in this recipe.
Alternatively, you can read the object as a list of lines by calling
its readlines method, or, for special purposes,
just get one line at a time by looping over the object in a
for loop. In addition to these file-like
operations, the object that urlopen returns offers
a few other useful features. For example, the following snippet gives
you the headers of the document:
doc = urlopen("http://www.python.org")such as the Content-Type header
print doc.info( )
(text/html in this case) that defines the MIME
type of the document. doc.info returns a
mimetools.Message instance, so you can access it
in various ways besides printing it or otherwise transforming it into
a string. For example, doc.info(
).getheader(`Content-Type')
returns the 'text/html' string. The
maintype attribute of the
mimetools.Message object is the
'text' string, subtype is the
'html' string, and type is also
the 'text/html' string. If you need to perform
sophisticated analysis and processing, all the tools you need are
right there. At the same time, if your needs are simpler, you can
meet them in very simple ways, as this recipe shows.If what you need to do with the document you grab from the Web is
specifically to save it to a local file,
urllib.urlretrieve is just what you need, as the
"Introduction" to this chapter
describes.urllib implicitly supports the use of proxies (as
long as the proxies do not require authentication: the current
implementation of urllib does not support
authentication-requiring proxies). Just set environment variable
HTTP_PROXY to a URL, such as
'http://proxy.domain.com:8080', to use the proxy
at that URL. If the environment variable
HTTP_PROXY is not set, urllib
may also look for the information in other platform-specific
locations, such as the Windows registry if you're
running under Windows.If you have more advanced needs, such as using proxies that require
authentication, you may use the more sophisticated
urllib2 module of the Python Standard Library,
rather than simple module urllib. At http://pydoc.org/2.3/urllib2l, you can
find an example of how to use urllib2 for the
specific task of accessing the Internet through a proxy that does
require authentication.
See Also
Documentation for the standard library modules
urllib, urllib2, and
mimetools in the Library
Reference and Python in a
Nutshell.