20.7. Finding Stale Links
20.7.1. Problem
You want to check a document for invalid
links.
20.7.2. Solution
Use the technique outlined in
Recipe 20.3 to extract each link, and then
use LWP::Simple''s head function to make sure that
link exists.
20.7.3. Discussion
Example 20-5 is an applied example of the
link-extraction technique. Instead of just printing the name of the
link, we call LWP::Simple''s head function on it.
The HEAD method fetches the remote document''s metainformation without
downloading the whole document. If it fails, the link is bad, so we
print an appropriate message.Because this program uses the get function from
LWP::Simple, it is expecting a URL, not a filename. If you want to
supply either, use the URI::Heuristic module described in Recipe 20.1.
Example 20-5. churl
#!/usr/bin/perl -w
# churl - check urls
use HTML::LinkExtor;
use LWP::Simple;
$base_url = shift
or die "usage: $0 <start_url>\n";
$parser = HTML::LinkExtor->new(undef, $base_url);
$html = get($base_url);
die "Can''t fetch $base_url" unless defined($html);
$parser->parse($html);
@links = $parser->links;
print "$base_url: \n";
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element) {
my ($attr_name , $attr_value) = splice(@element, 0, 2);
if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {
print " $attr_value: ", head($attr_value) ? "OK" : "BAD", "\n";
}
}
}
Here''s an example of a program run:
% churl http://www.wizards.com
http://www.wizards.com:
FrontPage/FP_Color.gif: OK
FrontPage/FP_BW.gif: BAD
#FP_Map: OK
Games_Library/Welcomel: OK
This program has the same limitation as the HTML::LinkExtor program
in Recipe 20.3.
20.7.4. See Also
The documentation for the CPAN modules HTML::LinkExtor, LWP::Simple,
LWP::UserAgent, and HTTP::Response; Recipe 20.8