Perl Cd Bookshelf [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Perl Cd Bookshelf [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید

20.19. Extracting Table Data


20.19.1. Problem


You have data
in an HTML table, and you would like to turn that into a Perl data
structure. For example, you want to monitor changes to an author's
CPAN module list.

20.19.2. Solution


Use the HTML::TableContentParser module from
CPAN:

use HTML::TableContentParser;
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
foreach $table (@$tables) {
@headers = map { $_->{data} } @{ $table->{headers} };
# attributes of table tag available as keys in hash
$table_width = $table->{width};
foreach $row (@{ $tables->{rows} }) {
# attributes of tr tag available as keys in hash
foreach $col (@{ $row->{cols} }) {
# attributes of td tag available as keys in hash
$data = $col->{data};
}
}
}

20.19.3. Discussion


The HTML::TableContentParser module converts all tables in the HTML
document into a Perl data structure. As with HTML tables, there are
three layers of nesting in the data structure: the table, the row,
and the data in that row.

Each table, row, and data tag is represented as a hash reference. The
hash keys correspond to attributes of the tag that defined that
table, row, or cell. In addition, the value for a special key gives
the contents of the table, row, or cell. In a table, the value for
the rows key is a reference to an array of rows.
In a row, the cols key points to an array of
cells. In a cell, the data key holds the HTML
contents of the data tag.

For example, take the following table:

<table width="100%" bgcolor="#ffffff">
<tr>
<td>Larry &amp; Gloria</td>
<td>Mountain View</td>
<td>California</td>
</tr>
<tr>
<td><b>Tom</b></td>
<td>Boulder</td>
<td>Colorado</td>
</tr>
<tr>
<td>Nathan &amp; Jenine</td>
<td>Fort Collins</td>
<td>Colorado</td>
</tr>
</table>

The parse method returns this data
structure:

[
{
'width' => '100%',
'bgcolor' => '#ffffff',
'rows' => [
{
'cells' => [
{ 'data' => 'Larry &amp; Gloria' },
{ 'data' => 'Mountain View' },
{ 'data' => 'California' },
],
'data' => "\n "
},
{
'cells' => [
{ 'data' => '<b>Tom</b>' },
{ 'data' => 'Boulder' },
{ 'data' => 'Colorado' },
],
'data' => "\n "
},
{
'cells' => [
{ 'data' => 'Nathan &amp; Jenine' },
{ 'data' => 'Fort Collins' },
{ 'data' => 'Colorado' },
],
'data' => "\n "
}
]
}
]

The data tags still contain tags and entities. If you don't want the
tags and entities, remove them by hand using techniques from Recipe 20.6.

Example 20-11 fetches a particular CPAN author's page
and displays in plain text the modules they own. You could use this
as part of a system that notifies you when your favorite CPAN authors
do something new.

Example 20-11. Dump modules for a particular CPAN author


  #!/usr/bin/perl -w
# dump-cpan-modules-for-author - display modules a CPAN author owns
use LWP::Simple;
use URI;
use HTML::TableContentParser;
use HTML::Entities;
use strict;
our $URL = shift || 'http://search.cpan.org/author/TOMC/';
my $tables = get_tables($URL);
my $modules = $tables->[4]; # 5th table holds module data
foreach my $r (@{ $modules->{rows} }) {
my ($module_name, $module_link, $status, $description) =
parse_module_row($r, $URL);
print "$module_name <$module_link>\n\t$status\n\t$description\n\n";
}
sub get_tables {
my $URL = shift;
my $page = get($URL);
my $tcp = new HTML::TableContentParser;
return $tcp->parse($page);
}
sub parse_module_row {
my ($row, $URL) = @_;
my ($module_html, $module_link, $module_name, $status, $description);
# extract cells
$module_html = $row->{cells}[0]{data}; # link and name in HTML
$status = $row->{cells}[1]{data}; # status string and link
$description = $row->{cells}[2]{data}; # description only
$status =~ s{<.*?>}{ }g; # naive link removal, works on this simple HTML
# separate module link and name from html
($module_link, $module_name) = $module_html =~ m{href=".*?>(.*)<}i;
$module_link = URI->new_abs($module_link, $URL); # resolve relative links
# clean up entities and tags
decode_entities($module_name);
decode_entities($description);
return ($module_name, $module_link, $status, $description);
}

20.19.4. See Also


The documentation for the CPAN module HTML::TableContentParser;
http://search.cpan.org


/ 875