Advanced.Linux.Networking..Roderick.Smith [Electronic resources] نسخه متنی

Analyzing
Server Log Files

Web server log files can be an important
source of information to help you manage your Web site. Log files may include
information on the clients that are visiting your site, which of your documents
are popular with those clients, when your files are being accessed, and so on. Unfortunately,
examining raw log files can be a tedious undertaking, so various tools exist to
help summarize the data in the log files. Two common tools are Analog
and Webalizer.

NOTE

style='width:90.0%'>

align=left border=0>

This section describes the routine log
files created by Apache, as set by the CustomLog directive,
described earlier. Apache may also log errors, startup messages, and so on to
a separate log file.

The Apache Log
File Format

There are actually several different Apache
log file formats, which you can set with the CustomLog directive, as
described in " href="http:// /?xmlid=0-201-77423-2/ch20lev1sec3#ch20lev2sec3"> Setting Common Configuration Options ." This section describes the combined format, which
combines information into a single file. The other options provide a subset of
this information.

An entry in the combined log file looks
something like this:

192.168.1.1 - - [ 06/Nov/2002 :

16:45:49 -0500] "GET /indexl \ HTTP/1.0" 200 8597 "-"

width=18 height=11 src="/image/library/english/10035_image002.gif" align=left
border=0>"Mozilla (X11; I; Linux 2.0.32 i586)"

This entry consists of several parts:

Client hostname or IP
address The first field is the IP address or
hostname of the client that made a request.

User identification The next two fields (both dashes in this example) provide the
username of the individual who made the request. The dashes indicate that this
information isn't available. If they're present, the first field is the name as
identified by the identd server and the second is as identified by HTTP user authentication.

Date and time Apache logs the date and time of the transfer request. This
information is recorded in local time, but the log includes the time zone ( -0500 in this
example, meaning five hours before GMT).

HTTP request The HTTP request code ( GET /indexl HTTP/1.0 in this example) shows the command that the client used ( GET ), the
document requested ( /indexl ), and the HTTP level used ( 1.0 ). You can use this
information to discover which of your pages are the most popular. This field
also often contains clues to attempted break-ins, because these often rely upon
requests for strange documents.

Response code Apache replies to the client, in part, with a response code that
provides information about the ability of the server to fulfill a request. In
this example, the response code is 200 , which means Apache
could fulfill the request. Codes beginning in 3 are redirections,
client errors are indicated by codes beginning in 4 , and server errors turn
up in codes beginning in 5 .

Object size The 8597 entry in this example is the size of the document that Apache
returned, not counting HTTP overhead.

Referrer When a user clicks a link from another page, most browsers deliver
the URL of the referring page to the new page's Web server. Apache records this
information in its log file. In the preceding example, the referrer is "-" ,
indicating there was no referring pagethe user typed the URL into the Web
browser directly.

User agent The final field contains information that the browser sends to
Apache about itself, such as its name and the OS on which it runs. (Note that
Netscape reports itself as Mozilla .) This information isn't wholly reliable; Web browsers can be
programmed to lie, or proxy servers may change the information.

Using this information, you can peruse your
Apache log files to determine something about the popularity of your Web pages,
when they're being accessed, who's accessing them, and so on. Examining these
files "raw" can be tedious, though. That's where log file analysis
tools come in handy.

NOTE

style='width:90.0%'>

align=left border=0>

Most Linux installations include cron jobs
to automatically rotate log files, including Apache log files. Check your
system's cron jobs (usually stored in /etc/cron.d , /etc/cron.interval , or a similar location) for such a log file rotation system. If
your Web file logs aren't being rotated, you may want to add this feature to
prevent the log files from growing to consume available disk space.

Using Analog

Analog ( target="_blank">http://www.analog.cx ) claims to be the most popular Web log file analyzer in the world. This
package's output is heavily text-based, but includes some bar and pie charts. You
can see an example report at href="http://www.statslab.cam.ac.uk/~sret1/stats/statsl" target="_blank">http://www.statslab.cam.ac.uk/~sret1/stats/statsl . Analog ships with some distributions, or you can obtain it from
its Web page.

Setting Analog Options

Analog is controlled through its configuration
file, analog.cfg , which usually resides in /etc . This file contains
various options that help Analog combine data into useful chunks. For instance,
SEARCHENGINE specifies search engines that might appear as referrers, so that
Analog can summarize search engine links to your sites. Three options you're
likely to want to set immediately are the following:

LOGFILE /path/to/log/file OUTFILE /path/to/output/file HOSTNAME "

Your Organization's Name "

The first two of these items are critically
important. If you don't specify them, Analog won't be able to locate your log
file, and it will dump its output file directly to standard output. Analog's
output is in the form of an HTML file with associated graphics, so you can read
it with a Web browser. (You specify only the name for the main file, such as /home/httpd/html/analog/indexl ; Analog creates its graphics files in the same directory.) The HOSTNAME specification is purely cosmetic; Analog displays this information at the top
of its report.

Unfortunately, some Analog packages are not
fully functional out of the box because they make peculiar and contradictory
assumptions about the locations of files. These problems can be overcome by
creating a few symbolic links:

Configuration files Some Analog packages are built to assume that the analog.cfg file will be in the same directory as the Analog executable (usually /usr/bin ),
although the file actually resides in /etc . The /usr/bin directory is a bizarre location for a configuration file, but you can type ln -s /etc/analog.cfg /usr/bin to leave the file in /etc but still satisfy Analog.

Language files Analog relies upon language files to operate properly. Some
packages place these in /var/lib/analog/lang , but Analog may look for them in /usr/bin/lang . Typing
ln -s /var/lib/ analog/lang /usr/bin
allows Analog to function.

Support graphics Analog generates some graphics, such as pie charts, for each site
it summarizes, but it also relies on other fixed graphics files. Some packages
drop these files in /var/www/html/images by default, but the HTML that Analog generates looks for them in
the images directory under the Analog output directory, so you may need to
create another symbolic link. Change to the output directory you specified with
the OUTFILE option and type ln -s /var/www/html/ images to give
Analog access to these graphics files.

Keep in mind that these adjustments may not
be required for all Analog packages. (I found them to be necessary with an
Analog package intended for Linux Mandrake, analog-5.01-1mdk .)

Running Analog

Analog can be run by typing its name: analog . The
user who types this command must have read access to the Web access logs, as
well as write access to the Analog output directory. If you plan your
permissions appropriately, you do not need root access
for either task.

In most cases, you'll want to run Analog from
a cron job on a regular basis, such as once a month, once a week, or even once
a day. Keep in mind that Analog, although not a huge process, does consume some
system resources, so running it very frequently (such as once a minute) can
cause a performance hit, particularly on a busy server.

Interpreting Analog Output

The Analog output is broken into several
distinct reports, each of which provides information that's been processed and
summarized in a different way. The specific sections are as follows:

General summary This section provides general information that may be useful to
judging the overall health of your Web server, such as the average number of
requests it processes per day, the average number of successful and failed
requests per day, and the total and average daily data transfers.

Monthly report The monthly report summarizes the number of pages served on a
monthly basis. Increasing monthly use and decreasing perceived performance
could mean you need to upgrade the server or its network connections.

Daily summary This section provides information on the number of pages served by
the day of week (Monday, Tuesday, and so on).

Hourly summary This section is like the daily summary, but it summarizes server
use by the hour within a day ( 1:00 , 2:00 , and so on). If
you're experiencing slowdowns, you may want to check this summary and use it to
fine-tune your diagnostics; you might miss a problem if you look for it during
a less busy time of the day.

Domain report If your server handles multiple domains, you'll see a summary of
the amount of traffic each one processes.

Organizational report If you associate different organizations with different domains or
pages, this report breaks traffic down by organization.

Operating system
report You can see which OSs your clients
report using if you use a combined Apache log format or another format that
provides this information. Note that because of proxies and other reporting
inaccuracies, this information may not be wholly reliable.

Status code report Analog provides a pie chart showing the number of each status code
responses issued by the Web server. This can be useful in quickly spotting
problems if there are a lot of 4xx or 5xx responses.

File size report This section shows the number of files of various sizes that the
Web server delivers. This can be very useful in traffic management; if you see
your file sizes drifting up, you might want to take steps to check this trend,
such as using higher compression levels on your graphics files.

File type report You can see the types of files (JPEG files, HTML files, and so on)
delivered by your Web server. This may be useful in conjunction with the file
size report in controlling the expansion of your Web site.

Directory report Most Web sites are broken into multiple directories, and this
report tells you which of these are most popular, by bytes delivered.

Request report This report displays the popularity of the files in the root
directory on the Web site.

You can use these various reports to get a
good idea of how your Web site is being used. It can be even more valuable if
you maintain a record of Analog reports that spans some time, because then you
can examine multiple reports for changes over time. You can do this by creating
or editing a cron job to rotate the Apache log files. (Many distributions'
Apache packages include such a cron job script.) When the log file rotation
occurs, back up an existing Analog directory in a subdirectory. You can then
create a master HTML page that links into these backup directories so that you
can peruse several weeks or months worth of Analog summaries.

Although Analog is a useful tool that can produce
a wealth of data, sifting through that data can sometimes be almost as
intimidating as confronting the raw Apache log files. Various additional tools,
such as Report Magic ( target="_blank">http://www.reportmagic.com ), can further summarize Analog's reports and present details in a
more readable form.

Using the Webalizer

The Webalizer ( href="http://www.webalizer.org" target="_blank">http://www.webalizer.org ) is
a major competitor to Analog in the Web page summary sphere. Like Analog, the
Webalizer reads configuration files and creates an output HTML file and
supporting graphics so that you can peruse your Web site's traffic patterns in
a convenient summary form. Webalizer ships with some distributions, or you can
obtain it from its Web page. You can view a sample report at href="http://www.webalizer.org/sample/" target="_blank">http://www.webalizer.org/sample/ .

Setting the Webalizer Options

The Webalizer is controlled through its
configuration file, webalizer.conf , which is typically stored in /etc . As with Analog, you
must tell the Webalizer where to find your Web server log files and where to
store its output. You do this with options like the following:

LogFile /path/to/log/file OutputDir /path/to/output/directory
One important difference between the Analog
and Webalizer settings for these options is that Analog requires you to specify
an output filename, but the Webalizer has you specify an output directory in
which it stores its files. If you set the output directory to a location within
your Web server area, you can browse the Webalizer output using a Web browser. If
you place the output elsewhere, you can still access it with your Web browser,
but only on the Web server computer itself by specifying a file:// URL. There
are a few other Webalizer configuration options you might want to adjust,
including:

Incremental If set to yes , this option causes the Webalizer to store its internal state
between runs so that you can process logs in chunks. For instance, you can run
the Webalizer once a day and it will remember the entries it's already
processed and adjust to rotated log files. This option defaults to no , which
causes the Webalizer to analyze the log file fresh each time it's run.

HostName You can set the hostname used in the report title (which is set
with the ReportTitle option).

GroupDomains When reporting hostnames, Webalizer normally analyzes by complete
hostname. You can group hostnames within a domain by specifying a non-0 value
for GroupDomains , though. The value is the number of elements, starting from the
rightmost hostname element, to use as a group. For instance, GroupDomains 2 causes gingko.pangaea.edu and birch.pangaea.edu to be grouped into pangaea.edu . This option can
help to unclutter some of the information that the Webalizer produces.

GroupSite This is another grouping option, but it works on individual sites.
For instance, GroupSite *.abigisp.net causes all hostnames under abigisp.net to be grouped
together in reports.

HideSite This option hides the sites under a given domain, which is
specified as in the GroupSite option. The GroupSite and HideSite options are frequently used together to create a grouping with no
reporting of the individual sites.

Webalizer configuration files are often
longer and more complex than are Analog configuration files, and the preceding
list covers just a handful of Webalizer options. Most of the options are
documented by comments in the standard configuration file, so you can consult
it for more information.

Running the Webalizer

You can run the Webalizer by typing its name:
webalizer . Like Analog, Webalizer doesn't need to be run as root unless
read access to the Web server access logs or write access to the Webalizer
output directory is restricted to root . You may want to run the Webalizer in
a cron job with the Incremental option set to yes in order to have the program automatically build a history of Web
site access summaries.

TIP

style='width:90.0%'>

align=left border=0>

Chances are your Apache installation
created a cron job to rotate the Apache log files; if it didn't, you'll want
to create such a configuration, as noted earlier. To ensure that Webalizer
catches as many Web hits as possible, run Webalizer just before the rotation
occurs, even if you also run Webalizer in its own cron job.

Interpreting
the Webalizer Output

Webalizer presents a two-tiered report. The
first overview tier shows a summary of activity over the past year. (On a newly
installed system, most of those months will be empty.) This summary includes
information presented in both a table and a bar chart on the number of hits,
Web page downloads, kilobytes transferred, and so on for each month. You can
click on the month name in the summary table to get to the second tier of the
analysis, which breaks down the month's activity in more detail. This
page contains several subsections:

Monthly statistics The first area presents the same information as in the first-tier
analysis page, plus a bit more, such as the number of various response codes
returned to clients.

Daily statistics The second area shows a bar graph and table summarizing the Web
traffic for each day of the month. Summary statistics include the number of
pages, number of hits, number of files, and number of kilobytes transferred.

Hourly statistics This area presents information similar to the daily statistics
area, but broken down by hour of the day. You can use this to locate peak
traffic times for your site, which can be important information when planning
capacity or debugging capacity-related problems.

Top URLs The Webalizer presents two tables that summarize the number of
hits and kilobytes associated with specific URLs. (You can use grouping options
in the Webalizer configuration file to create groups of URLs to appear in this
list, if you like.) One table presents the top URLs by hits, the other the top
URLs by kilobytes.

Entry and Exit pages Two tables show the most popular entry and exit pages. An entry
page is the first page that a user viewed when visiting your site. An exit page
is the last page a user viewed when visiting your site.

Top sites The Webalizer summarizes the clients that accessed your site the
most, both by number of hits and by number of kilobytes. You can group sites
together in the Webalizer configuration file with options like GroupSite ,
described earlier.

Top referrers If your Web log files include referrer information, the Webalizer
summarizes this information so you can see which sites produce the most links
to yours.

Top search strings Some Web search engines, when they produce links to your site,
include the search string as part of the referrer URL. The Webalizer can break
this information out and regenerate the search strings, which the Webalizer
then summarizes for you.

Top user agents The Webalizer summarizes the names of the Web browsers that most
frequently accessed your site.

Top countries The Webalizer's final section summarizes access by what it calls
countries. In reality, the Webalizer is summarizing access by top-level domain
(TLD) name, so your top "countries" may include US Commercial, Network,
and other domains that aren't restricted to particular countries.

If you want to compare trends in your Web
server access, the overview tier can give you general trends, but you'll need
to compare the monthly reports (say, in side-by-side Web browser windows) to
see how specific access patterns change with time.

Advanced.Linux.Networking..Roderick.Smith [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی