20.8. Finding Fresh Links
20.8.1. Problem
Given a list of URLs, you want to
determine which have been modified most recently. For example, you
want to sort your bookmarks so those most recently updated are on the
top.
20.8.2. Solution
The program in Example 20-6 reads URLs from standard
input, rearranges them by date, and prints them to standard output
with those dates prepended.
Example 20-6. surl
#!/usr/bin/perl -w
# surl - sort URLs by their last modification date
use strict;
use LWP::UserAgent;
use HTTP::Request;
use URI::URL qw(url);
my %Date;
my $ua = LWP::UserAgent->new( );
while ( my $url = url(scalar <>) ) {
my $ans;
next unless $url->scheme =~ /^(file|https?)$/;
$ans = $ua->head($url);
if ($ans->is_success) {
$Date{$url} = $ans->last_modified || 0; # unknown
} else {
warn("$url: Error [", $ans->code, "] ", $ans->message, "!\n");
}
}
foreach my $url ( sort { $Date{$b} <=> $Date{$a} } keys %Date ) {
printf "%-25s %s\n", $Date{$url} ? (scalar localtime $Date{$url})
: "<NONE SPECIFIED>", $url;
}
20.8.3. Discussion
The surl script
works more like a traditional filter program. It reads from standard
input one URL per line. (Actually, it uses ARGV to
read, which defaults to STDIN when
@ARGV is empty.) The last-modified date on each
URL is fetched by a HEAD request. That date is stored in a hash with
the URL as key. Then a simple sort by value is run on the hash to
reorder the URLs by date. On output, the internal date is converted
into localtime format.Here''s an example of using the xurl program from
the earlier recipe to extract the URLs, then running that program''s
output to feed into surl.
% xurl http://use.perl.org/~gnat/journal | surl | head
Mon Jan 13 22:58:16 2003 http://www.nanowrimo.org/
Sun Jan 12 19:29:00 2003 http://www.costik.com/gamespekl
Sat Jan 11 20:57:03 2003 http://www.cpan.org/ports/1l
Sat Jan 11 09:46:19 2003 http://jakarta.apache.org/gump/
Tue Jan 7 20:27:30 2003 http://use.perl.org/images/menu_gox.gif
Tue Jan 7 20:27:30 2003 http://use.perl.org/images/menu_bgo.gif
Tue Jan 7 20:27:30 2003 http://use.perl.org/images/menu_gxg.gif
Tue Jan 7 20:27:30 2003 http://use.perl.org/images/menu_ggx.gif
Tue Jan 7 20:27:30 2003 http://use.perl.org/images/menu_gxx.gif
Tue Jan 7 20:27:30 2003 http://use.perl.org/images/menu_gxo.gif
Having a variety of small programs that each do one thing and can be
combined into more powerful constructs is the hallmark of good
programming. You could even argue that xurl
should work on files, and that some other program should actually
fetch the URL''s contents over the Web to feed into
xurl, churl, or
surl. That program would probably be called
gurl, except that program already exists: the
LWP module suite has a program called
lwp-request with aliases
HEAD, GET, and
POST to run those operations from shell scripts.
20.8.4. See Also
The documentation for the CPAN modules LWP::UserAgent, HTTP::Request,
and URI::URL; Recipe 20.7