Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Recipe 10.4. Calculating Apache Hits per IP Address

Credit: Mark Nenadov, Ivo Woltring

Problem

You need to examine a log file from
Apache to count the number of hits recorded from each individual IP
address that accessed it.

Solution

Many of the chores of administering a web server have to do with
analyzing Apache logs, which Python makes easy:

def calculateApacheIpHits(logfile_pathname):
''' return a dict mapping IP addresses to hit counts '''
ipHitListing = {  }
contents = open(logfile_pathname, "r")
# go through each line of the logfile
for line in contents:
# split the string to isolate the IP address
ip = line.split(" ", 1)[0]
# Ensure length of the IP address is proper (see discussion)
if 6 < len(ip) <= 15:
# Increase by 1 if IP exists; else set hit count = 1
ipHitListing[ip] = ipHitListing.get(ip, 0) + 1
return ipHitListing

Discussion

This recipe supplies a function that returns a dictionary containing
the hit counts for each individual IP address that has accessed your
Apache web server, as recorded in an Apache log file. For example, a
typical use would be:

HitsDictionary = calculateApacheIpHits(
"/usr/local/nusphere/apache/logs/access_log")

This function has many quite useful applications. For example, I
often use it in my code to determine the number of hits that are
actually originating from locations other than my local host. This
function is also used to chart which IP addresses are most actively
viewing the pages that are served by a particular installation of
Apache.

This function performs a modest validation of each IP address, which
is really just a length check: an IP address cannot be longer than 15
characters (4 sets of triplets and 3 periods) nor shorter than 7 (4
sets of single digits and 3 periods). This validation is not
stringent, but it does reduce, at tiny runtime cost, the probability
of placing into the dictionary some data that is obviously garbage.
As a general technique, low-cost, highly approximate sanity checks
for data that is expected to be OK (but one never knows for sure) are
worth considering. However, if you want to be stricter, regular
expressions can help. Change the loop in this
recipe's function's body to:

    import re
# an IP is: 4 strings, each of 1-3 digits, joined by periods
ip_specs = r'\.'.join([r'\d{1,3}']*4)
re_ip = re.compile(ip_specs)
for line in contents:
match = re_ip.match(line)
if match:
# Increase by 1 if IP exists; else set hit count = 1
ip = match.group( )
ipHitListing[ip] = ipHitListing.get(ip, 0) + 1

In this variant, we use a regular expression to extract and validate
the IP at the same time. This approach enables us to avoid the split
operation as well as the length check, and thus amortizes most of the
runtime cost of matching the regular expression. This variant is only
a few percentage points slower than the recipe's
solution.

Of course, the pattern given here as ip_specs is not
entirely precise either, since it accepts, as components of an IP
quad, arbitrary strings of one to three digits, while the components
should be more constrained. But to ward off garbage lines, this level
of sanity check is sufficient.

Another alternative is to convert and check the address: extract
string ip just as we do in this
recipe's Solution, then:

        # Ensure the IP address is proper
try:
quad = map(int, ip.split('.'))
except ValueError:
pass
else:
if len(quad)==4 and min(quad)>=0 and max(quad)<=255:
# Increase by 1 if IP exists; else set hit count = 1
ipHitListing[ip] = ipHitListing.get(ip, 0) + 1

This approach is more work, but it does guarantee that only IP
addresses that are formally valid get counted at all.

Python Cookbook 2Nd Edition Jun 1002005 [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی