Perl Cd Bookshelf [Electronic resources]

نسخه متنی -صفحه : 875/ 702

20.19. Extracting Table Data

20.19.1. Problem

You have data in an HTML table, and you would like to turn that into a Perl data structure. For example, you want to monitor changes to an author's CPAN module list.

20.19.2. Solution

Use the HTML::TableContentParser module from CPAN:

use HTML::TableContentParser;
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
foreach $table (@$tables) {
@headers = map { $_->{data} } @{ $table->{headers} };
# attributes of table tag available as keys in hash
$table_width = $table->{width};
foreach $row (@{ $tables->{rows} }) {
# attributes of tr tag available as keys in hash
foreach $col (@{ $row->{cols} }) {
# attributes of td tag available as keys in hash
$data = $col->{data};
}
}
}

20.19.3. Discussion

The HTML::TableContentParser module converts all tables in the HTML document into a Perl data structure. As with HTML tables, there are three layers of nesting in the data structure: the table, the row, and the data in that row.

Each table, row, and data tag is represented as a hash reference. The hash keys correspond to attributes of the tag that defined that table, row, or cell. In addition, the value for a special key gives the contents of the table, row, or cell. In a table, the value for the rows key is a reference to an array of rows. In a row, the cols key points to an array of cells. In a cell, the data key holds the HTML contents of the data tag.

For example, take the following table:

<table width="100%" bgcolor="#ffffff">
<tr>
<td>Larry &amp; Gloria</td>
<td>Mountain View</td>
<td>California</td>
</tr>
<tr>
<td><b>Tom</b></td>
<td>Boulder</td>
<td>Colorado</td>
</tr>
<tr>
<td>Nathan &amp; Jenine</td>
<td>Fort Collins</td>
<td>Colorado</td>
</tr>
</table>

The parse method returns this data structure:

[
{
'width' => '100%',
'bgcolor' => '#ffffff',
'rows' => [
{
'cells' => [
{ 'data' => 'Larry &amp; Gloria' },
{ 'data' => 'Mountain View' },
{ 'data' => 'California' },
],
'data' => "\n      "
},
{
'cells' => [
{ 'data' => '<b>Tom</b>' },
{ 'data' => 'Boulder' },
{ 'data' => 'Colorado' },
],
'data' => "\n      "
},
{
'cells' => [
{ 'data' => 'Nathan &amp; Jenine' },
{ 'data' => 'Fort Collins' },
{ 'data' => 'Colorado' },
],
'data' => "\n      "
}
]
}
]

The data tags still contain tags and entities. If you don't want the tags and entities, remove them by hand using techniques from Recipe 20.6.

Example 20-11 fetches a particular CPAN author's page and displays in plain text the modules they own. You could use this as part of a system that notifies you when your favorite CPAN authors do something new.

Example 20-11. Dump modules for a particular CPAN author

  #!/usr/bin/perl -w
# dump-cpan-modules-for-author - display modules a CPAN author owns
use LWP::Simple;
use URI;
use HTML::TableContentParser;
use HTML::Entities;
use strict;
our $URL = shift || 'http://search.cpan.org/author/TOMC/';
my $tables = get_tables($URL);
my $modules = $tables->[4];    # 5th table holds module data
foreach my $r (@{ $modules->{rows} }) {
my ($module_name, $module_link, $status, $description) = 
parse_module_row($r, $URL);
print "$module_name <$module_link>\n\t$status\n\t$description\n\n";
} 
sub get_tables {
my $URL = shift;
my $page = get($URL);
my $tcp = new HTML::TableContentParser;
return $tcp->parse($page);
}
sub parse_module_row {
my ($row, $URL) = @_;
my ($modulel, $module_link, $module_name, $status, $description);
# extract cells
$modulel = $row->{cells}[0]{data};  # link and name in HTML
$status      = $row->{cells}[1]{data};  # status string and link
$description = $row->{cells}[2]{data};  # description only
$status =~ s{<.*?>}{  }g; # naive link removal, works on this simple HTML
# separate module link and name froml
($module_link, $module_name) = $modulel =~ m{href=".*?>(.*)<}i;
$module_link = URI->new_abs($module_link, $URL); # resolve relative links
# clean up entities and tags
decode_entities($module_name);
decode_entities($description);
return ($module_name, $module_link, $status, $description);
}

20.19.4. See Also

The documentation for the CPAN module HTML::TableContentParser; http://search.cpan.org