Recipe 10.4. Calculating Apache Hits per IP Address
Credit: Mark Nenadov, Ivo Woltring
Problem
You need to examine a log file from
Apache to count the number of hits recorded from each individual IP
address that accessed it.
Solution
Many of the chores of administering a web server have to do with
analyzing Apache logs, which Python makes easy:
def calculateApacheIpHits(logfile_pathname):
''' return a dict mapping IP addresses to hit counts '''
ipHitListing = { }
contents = open(logfile_pathname, "r")
# go through each line of the logfile
for line in contents:
# split the string to isolate the IP address
ip = line.split(" ", 1)[0]
# Ensure length of the IP address is proper (see discussion)
if 6 < len(ip) <= 15:
# Increase by 1 if IP exists; else set hit count = 1
ipHitListing[ip] = ipHitListing.get(ip, 0) + 1
return ipHitListing
Discussion
This recipe supplies a function that returns a dictionary containing
the hit counts for each individual IP address that has accessed your
Apache web server, as recorded in an Apache log file. For example, a
typical use would be:
HitsDictionary = calculateApacheIpHits(This function has many quite useful applications. For example, I
"/usr/local/nusphere/apache/logs/access_log")
often use it in my code to determine the number of hits that are
actually originating from locations other than my local host. This
function is also used to chart which IP addresses are most actively
viewing the pages that are served by a particular installation of
Apache.This function performs a modest validation of each IP address, which
is really just a length check: an IP address cannot be longer than 15
characters (4 sets of triplets and 3 periods) nor shorter than 7 (4
sets of single digits and 3 periods). This validation is not
stringent, but it does reduce, at tiny runtime cost, the probability
of placing into the dictionary some data that is obviously garbage.
As a general technique, low-cost, highly approximate sanity checks
for data that is expected to be OK (but one never knows for sure) are
worth considering. However, if you want to be stricter, regular
expressions can help. Change the loop in this
recipe's function's body to:
import reIn this variant, we use a regular expression to extract and validate
# an IP is: 4 strings, each of 1-3 digits, joined by periods
ip_specs = r'\.'.join([r'\d{1,3}']*4)
re_ip = re.compile(ip_specs)
for line in contents:
match = re_ip.match(line)
if match:
# Increase by 1 if IP exists; else set hit count = 1
ip = match.group( )
ipHitListing[ip] = ipHitListing.get(ip, 0) + 1
the IP at the same time. This approach enables us to avoid the split
operation as well as the length check, and thus amortizes most of the
runtime cost of matching the regular expression. This variant is only
a few percentage points slower than the recipe's
solution.Of course, the pattern given here as ip_specs is not
entirely precise either, since it accepts, as components of an IP
quad, arbitrary strings of one to three digits, while the components
should be more constrained. But to ward off garbage lines, this level
of sanity check is sufficient.Another alternative is to convert and check the address: extract
string ip just as we do in this
recipe's Solution, then:
# Ensure the IP address is properThis approach is more work, but it does guarantee that only IP
try:
quad = map(int, ip.split('.'))
except ValueError:
pass
else:
if len(quad)==4 and min(quad)>=0 and max(quad)<=255:
# Increase by 1 if IP exists; else set hit count = 1
ipHitListing[ip] = ipHitListing.get(ip, 0) + 1
addresses that are formally valid get counted at all.
See Also
The Apache web server is available and documented at http://httpd.apache.org; regular expressions
are covered in the docs of the re module in the
Library Reference and Python in a
Nutshell.