Credit: Gisle Aas, Magnus Bodin
You need to grab a document from a URL on the Web.
urllib.urlopen returns a file-like object, and you can call the read method on that object to get all of its contents:
from urllib import urlopen doc = urlopen("http://www.python.org").read( ) print doc
Once you obtain a file-like object from urlopen, you can read it all at once into one big string by calling its read method, as I do in this recipe. Alternatively, you can read the object as a list of lines by calling its readlines method, or, for special purposes, just get one line at a time by looping over the object in a for loop. In addition to these file-like operations, the object that urlopen returns offers a few other useful features. For example, the following snippet gives you the headers of the document:
doc = urlopen("http://www.python.org") print doc.info( )
such as the Content-Type header
(textl in this case) that defines the MIME
type of the document. doc.info returns a
mimetools.Message instance, so you can access it
in various ways besides printing it or otherwise transforming it into
a string. For example, doc.info(
).getheader(`Content-Type')
returns the 'textl' string. The
maintype attribute of the
mimetools.Message object is the
'text' string, subtype is the
' If what you need to do with the document you grab from the Web is
specifically to save it to a local file,
urllib.urlretrieve is just what you need, as the
"Introduction" to this chapter
describes. urllib implicitly supports the use of proxies (as
long as the proxies do not require authentication: the current
implementation of urllib does not support
authentication-requiring proxies). Just set environment variable
HTTP_PROXY to a URL, such as
'http://proxy.domain.com:8080', to use the proxy
at that URL. If the environment variable
HTTP_PROXY is not set, urllib
may also look for the information in other platform-specific
locations, such as the Windows registry if you're
running under Windows. If you have more advanced needs, such as using proxies that require
authentication, you may use the more sophisticated
urllib2 module of the Python Standard Library,
rather than simple module urllib. At http://pydoc.org/2.3/urllib2l, you can
find an example of how to use urllib2 for the
specific task of accessing the Internet through a proxy that does
require authentication. Documentation for the standard library modules
urllib, urllib2, and
mimetools in the Library
Reference and Python in a
Nutshell.
See Also