Fetching a URL from a Perl Script
Extracting or Removing HTML Tags
Using Templates to Generate HTML
Fetching Password-Protected Pages
Robert Louis Stevenson, On some Technical Elements of Style in Literature (1885)
The web, then, or the pattern, a web at once sensuous and logical, an elegant and pregnant texture: that is style, that is the foundation of the art of literature.
Chapter 19 concentrated on responding to browser requests and producing documents using CGI. This chapter approaches the Web from the other side: instead of responding to a browser, you pretend to be one, generating requests and processing returned documents. We make extensive use of modules to simplify this process because the intricate network protocols and document formats are tricky to get right. By letting existing modules handle the hard parts, you can concentrate on the interesting part—your own program.
The relevant modules can all be found under the following URL:
http://search.cpan.org/modlist/World_Wide_Web
There you'll find modules for computing credit card checksums, interacting with Netscape or Apache server APIs, processing image maps, validating HTML, and manipulating MIME. The largest and most important modules for this chapter, though, are found in the libwww-perl suite of modules, referred to collectively as LWP. Table 20-1 lists just a few modules included in LWP.
Module name |
Purpose |
---|---|
LWP::UserAgent |
WWW user agent class |
LWP::RobotUA |
Develop robot applications |
LWP::Protocol |
Interface to various protocol schemes |
LWP::Authen::Basic |
Handle 401 and 407 responses |
LWP::MediaTypes |
MIME types configuration (textl, etc.) |
LWP::Debug |
Debug logging module |
LWP::Simple |
Simple procedural interface for common functions |
HTTP::Headers |
MIME/RFC 822-style headers |
HTTP::Message |
HTTP-style message |
HTTP::Request |
HTTP request |
HTTP::Response |
HTTP response |
HTTP::Daemon |
HTTP server class |
HTTP::Status |
HTTP status code (200 OK, etc.) |
HTTP::Date |
Date-parsing module for HTTP date formats |
HTTP::Negotiate |
HTTP content negotiation calculation |
WWW::RobotRules |
Parse robots.txt files |
File::Listing |
Parse directory listings |
The HTTP:: and LWP:: modules request documents from a server. The LWP::Simple module offers an easy way to fetch a document. However, the module can't access individual components of the HTTP response. For these, use HTTP::Request, HTTP::Response, and LWP::UserAgent. We show both sets of modules in Recipe 20.1, Recipe 20.2, and Recipe 20.10.
Once distributed with LWP, but now in distributions of their own, are
the HTML:: modules. These parse HTML. They provide the basis for
Recipe 20.5, Recipe 20.4,
Recipe 20.6, Recipe 20.3, Recipe 20.7, and the programs
and hrefsub. Recipe 20.12 gives a regular expression to
decode fields in your web server's log files and shows how to
interpret the fields. We use this regular expression and the
Logfile::Apache module in Recipe 20.13 to show
two ways of summarizing data in web server log files. For detailed guidance on the LWP modules, see Sean Burke's
Perl & LWP (O'Reilly) This book expands on
much of this chapter, picking up where recipes such as Recipe 20.5 on converting
HTML to ASCII, Recipe
20.14 on fetching pages that use cookies, and Recipe 20.15 on fetching password-protected pages leave off. Copyright © 2003 O'Reilly & Associates. All rights reserved.
19.13. Program: chemiserie 20.1. Fetching a URL from a Perl Script