Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید

20.3. Extracting URLs


20.3.1. Problem


You want to extract all URLs from an
HTML file. For example, you have downloaded a page that lists the MP3
files downloadable from some site. You want to extract those MP3s'
URLS so you can filter the list and write a program to download the
ones you want.

20.3.2. Solution


Use the HTML::LinkExtor module from
CPAN:

use HTML::LinkExtor;
$parser = HTML::LinkExtor-new(undef, $base_url);
$parser-parse_file($filename);
@links = $parser-links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element; # element type
# possibly test whether this is an element we're interested in
while (@element) {
# extract the next attribute and its value
my ($attr_name, $attr_value) = splice(@element, 0, 2);
# ... do something with them ...
}
}

20.3.3. Discussion


You can use HTML::LinkExtor in two different ways: either by calling
links to get a list of all links in the document
once it is completely parsed, or by passing a code reference in the
first argument to new. The referenced function is
called on each link as the document is parsed.

The links method clears
the link list, so call it only once per parsed document. It returns a
reference to an array of elements. Each element is itself an array
reference with an HTML::Element object at the front followed by a
list of attribute name and attribute value pairs. For instance, the
HTML:

<A HREF="http://www.perl.com/"Home page</A
<IMG SRC="/image/library/english/10159_big.gif"
LOWSRC="/image/library/english/10159_big-lowres.gif"

would return a data structure like this:

[
[ a, href = "http://www.perl.com/" ],
[ img, src = "/image/library/english/10159_big.gif",
lowsrc = "/image/library/english/10159_big-lowres.gif" ]
]

Here's an example of how to use $elt_type and
$attr_name to print out and anchor an image:

if ($elt_type eq 'a' && $attr_name eq 'href') {
print "ANCHOR: $attr_value\n"
if $attr_value-scheme =~ /http|ftp/;
}
if ($elt_type eq 'img' && $attr_name eq 'src') {
print "IMAGE: $attr_value\n";
}

To extract links only to MP3 files, you'd say:

foreach my $linkarray (@links) {
my ($elt_type, %attrs) = @$linkarray;
if ($elt_type eq 'a' && $attrs{'href'} =~ /\.mp3$/i) {
# do something with $attr{'href'}, the URL of the mp3 file
}
}

Example 20-2 is a complete program that takes as its
arguments a URL, such as file:///tmp/testingl or http://www.ora.com/, and produces on standard
output an alphabetically sorted list of unique URLs linked from that
site.

Example 20-2. xurl


  #!/usr/bin/perl -w
# xurl - extract unique, sorted list of links from URL
use HTML::LinkExtor;
use LWP::Simple;
$base_url = shift;
$parser = HTML::LinkExtor-new(undef, $base_url);
$parser-parse(get($base_url))-eof;
@links = $parser-links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element) {
my ($attr_name , $attr_value) = splice(@element, 0, 2);
$seen{$attr_value}++;
}
}
for (sort keys %seen) { print $_, "\n" }

This program does have a limitation: if the get of
$base_url involves a redirection, links resolve
using the original URL instead of the URL after the redirection. To
fix this, fetch the document with LWP::UserAgent and examine the
response code to find out whether a redirection occurred. Once you
know the post-redirection URL (if any), construct the HTML::LinkExtor
object accordingly.

Here's an example of the run:

% xurl http://www.perl.com/CPAN
ftp://ftp@ftp.perl.com/CPAN/CPANl
http://language.perl.com/misc/CPAN.cgi
http://language.perl.com/misc/cpan_module
http://language.perl.com/misc/getcpan
http://www.perl.com/1l
http://www.perl.com/gifs/lcb.xbm

In mail or Usenet messages, you may see URLs written
as:

<URL:http://www.perl.com

This is supposed to make it easy to pick URLs from messages:

@URLs = ($message =~ /<URL:(.*?)/g);

20.3.4. See Also


The documentation for the CPAN modules LWP::Simple, HTML::LinkExtor,
and HTML::Entities; Recipe 20.1

/ 875