Credit: James Thiele, Rogier Steehouder
You want to check whether an HTTP URL corresponds to an existing web page.
Using httplib allows you to easily check for a page's existence without actually downloading the page itself, just its headers. Here's a module implementing a function to perform this task:
""
httpExists.py
A quick and dirty way to check whether a web file is there.
Usage:
>>> import httpExists
>>> httpExists.httpExists('http://www.python.org/')
True
>>> httpExists.httpExists('http://www.python.org/PenguinOnTheTelly')
Status 404 Not Found : http://www.python.org/PenguinOnTheTelly
False
""
import httplib, urlparse
def httpExists(url):
host, path = urlparse.urlsplit(url)[1:3]
if ':' in host:
# port specified, try to use it
host, port = host.split(':', 1)
try:
port = int(port)
except ValueError:
print 'invalid port number %r' % (port,)
return False
else:
# no port specified, use default port
port = None
try:
connection = httplib.HTTPConnection(host, port=port)
connection.request("HEAD", path)
resp = connection.getresponse( )
if resp.status == 200: # normal 'found' status
found = True
elif resp.status == 302: # recurse on temporary redirect
found = httpExists(urlparse.urljoin(url,
resp.getheader('location', '')))
else: # everything else -> not found
print "Status %d %s : %s" % (resp.status, resp.reason, url)
found = False
except Exception, e:
print e._ _class_ _, e, url
found = False
return found
def _test( ):
import doctest, httpExists
return doctest.testmod(httpExists)
if _ _name_ _ == "_ _main_ _":
_test( )While this recipe is very simple and runs quite fast (thanks to the ability to use the HTTP command HEAD to get just the headers, not the body, of the page), it may be too simplistic for your specific needs: the HTTP result codes you might need to deal with may go beyond the simple 200 success code, and 302 temporary redirect, to include permanent redirects, temporary inaccessibility, permission problems, and so on.
In my case, I needed to check the correctness of a huge number of mutual links among pages of a site generated by a complex web application on an intranet, so I knew I had the privilege of relying on a simple check for "200 or bust." At any rate, you can use this simple recipe as a starting point to which to add any refinements you determine you actually need.
Documentation on the urlparse and httplib standard library modules in the Library Reference and Python in a Nutshell.