
![]() | ![]() |
Chapter 20. Web Automation
Contents:
IntroductionFetching a URL from a Perl ScriptAutomating Form SubmissionExtracting URLsConverting ASCII to HTMLConverting HTML to ASCIIExtracting or Removing HTML TagsFinding Stale LinksFinding Fresh LinksUsing Templates to Generate HTMLMirroring Web PagesCreating a RobotParsing a Web Server Log FileProcessing Server LogsUsing CookiesFetching Password-Protected PagesFetching https:// Web PagesResuming an HTTP GETParsing HTMLExtracting Table DataProgram: htmlsubProgram: hrefsubRobert Louis Stevenson, On some Technical Elements ofStyle in Literature (1885)The web, then, or the pattern, a web at once sensuous and logical, an
elegant and pregnant texture: that is style, that is the foundation
of the art of literature.
20.0. Introduction
Chapter 19
concentrated on responding to browser requests and producing
documents using CGI. This chapter approaches the Web from the other
side: instead of responding to a browser, you pretend to be one,
generating requests and processing returned documents. We make
extensive use of modules to simplify this process because the
intricate network protocols and document formats are tricky to get
right. By letting existing modules handle the hard parts, you can
concentrate on the interesting part—your own program.The relevant modules can all be found under the following URL:http://search.cpan.org/modlist/World_Wide_Web
There you'll
find modules for computing credit card checksums, interacting with
Netscape or Apache server APIs, processing image maps, validating
HTML, and manipulating MIME. The largest and most important modules
for this chapter, though, are found in the libwww-perl suite of
modules, referred to collectively as LWP. Table 20-1 lists just a few modules included in LWP.
Table 20-1. LWP modules (continued)
Module name | Purpose |
---|---|
LWP::UserAgent | WWW user agent class |
LWP::RobotUA | Develop robot applications |
LWP::Protocol | Interface to various protocol schemes |
LWP::Authen::Basic | Handle 401 and 407 responses |
LWP::MediaTypes | MIME types configuration (text/html, etc.) |
LWP::Debug | Debug logging module |
LWP::Simple | Simple procedural interface for common functions |
HTTP::Headers | MIME/RFC 822-style headers |
HTTP::Message | HTTP-style message |
HTTP::Request | HTTP request |
HTTP::Response | HTTP response |
HTTP::Daemon | HTTP server class |
HTTP::Status | HTTP status code (200 OK, etc.) |
HTTP::Date | Date-parsing module for HTTP date formats |
HTTP::Negotiate | HTTP content negotiation calculation |
WWW::RobotRules | Parse robots.txt files |
File::Listing | Parse directory listings |
LWP::Simple module offers an easy way to fetch a document. However,
the module can't access individual components of the HTTP response.
For these, use HTTP::Request, HTTP::Response, and LWP::UserAgent. We
show both sets of modules in Recipe 20.1,
Recipe 20.2, and Recipe 20.10.Once distributed with LWP, but now in distributions of their own, are
the HTML:: modules. These parse HTML. They provide the basis for
Recipe 20.5, Recipe 20.4,
Recipe 20.6, Recipe 20.3, Recipe 20.7, and the programs htmlsub
and hrefsub.Recipe 20.12 gives a regular expression to
decode fields in your web server's log files and shows how to
interpret the fields. We use this regular expression and the
Logfile::Apache module in Recipe 20.13 to show
two ways of summarizing data in web server log files.For detailed guidance on the LWP modules, see Sean Burke's
Perl & LWP (O'Reilly) This book expands on
much of this chapter, picking up where recipes such as Recipe 20.5 on converting
HTML to ASCII, Recipe
20.14 on fetching pages that use cookies, and Recipe 20.15 on fetching password-protected pages leave off.
![]() | ![]() | ![]() |
19.13. Program: chemiserie | ![]() | 20.1. Fetching a URL from a Perl Script |

Copyright © 2003 O'Reilly & Associates. All rights reserved.