20.12. Parsing a Web Server Log File
20.12.1. Problem
You want to extract selected information
from a web server log file.
20.12.2. Solution
Pull apart the log file as follows:
while (<LOGFILE>) {
my ($client, $identuser, $authuser, $date, $time, $tz, $method,
$url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.*?) (\S+)"
(\S+) (\S+)$/;
# ...
}
20.12.3. Discussion
This regular expression pulls apart
entries in Common Log Format, an informal standard that most web
servers adhere to. The fields are listed in Table 20-1.
Table 20-2. Common Log Format fields
Field | Meaning |
---|---|
client | IP address or hostname of browser''s machine |
identuser | If IDENT (RFC 1413) was used, what it returned |
authuser | If username/password authentication was used, whom they logged in as |
date | Date of request (e.g., 01/Mar/1997) |
time | Time of request (e.g., 12:55:36) |
tz | Time zone (e.g., -0700) |
method | Method of request (e.g., GET, POST, or PUT) |
url | URL in request (e.g., /~user/1l) |
protocol | HTTP/1.0 or HTTP/1.1 |
status | Returned status (200 is okay, 500 is server error) |
bytes | Number of bytes returned (could be "-" for errors, redirects, and other non-document transfers) |
needs only minor changes to work with other log file formats. Beware
that spaces in the URL field are not escaped, so we can''t use
\S* to extract the URL. .*
would cause the regex to match the entire string and then backtrack
until it could satisfy the rest of the pattern. We use
.*? and anchor the pattern to the end of the
string with $ to make the regular expression
engine initially match nothing but then add characters until the
entire pattern is satisfied.
20.12.4. See Also
The CLF spec at http://www.w3.org/Daemon/User/Config/Loggingl