20.19. Extracting Table Data
20.19.1. Problem
You have data
in an HTML table, and you would like to turn that into a Perl data
structure. For example, you want to monitor changes to an author's
CPAN module list.
20.19.2. Solution
Use the HTML::TableContentParser module from
CPAN:
use HTML::TableContentParser;
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
foreach $table (@$tables) {
@headers = map { $_->{data} } @{ $table->{headers} };
# attributes of table tag available as keys in hash
$table_width = $table->{width};
foreach $row (@{ $tables->{rows} }) {
# attributes of tr tag available as keys in hash
foreach $col (@{ $row->{cols} }) {
# attributes of td tag available as keys in hash
$data = $col->{data};
}
}
}
20.19.3. Discussion
The HTML::TableContentParser module converts all tables in the HTML
document into a Perl data structure. As with HTML tables, there are
three layers of nesting in the data structure: the table, the row,
and the data in that row.Each table, row, and data tag is represented as a hash reference. The
hash keys correspond to attributes of the tag that defined that
table, row, or cell. In addition, the value for a special key gives
the contents of the table, row, or cell. In a table, the value for
the rows key is a reference to an array of rows.
In a row, the cols key points to an array of
cells. In a cell, the data key holds the HTML
contents of the data tag.For example, take the following table:
<table width="100%" bgcolor="#ffffff">
<tr>
<td>Larry & Gloria</td>
<td>Mountain View</td>
<td>California</td>
</tr>
<tr>
<td><b>Tom</b></td>
<td>Boulder</td>
<td>Colorado</td>
</tr>
<tr>
<td>Nathan & Jenine</td>
<td>Fort Collins</td>
<td>Colorado</td>
</tr>
</table>
The parse method returns this data
structure:
[
{
'width' => '100%',
'bgcolor' => '#ffffff',
'rows' => [
{
'cells' => [
{ 'data' => 'Larry & Gloria' },
{ 'data' => 'Mountain View' },
{ 'data' => 'California' },
],
'data' => "\n "
},
{
'cells' => [
{ 'data' => '<b>Tom</b>' },
{ 'data' => 'Boulder' },
{ 'data' => 'Colorado' },
],
'data' => "\n "
},
{
'cells' => [
{ 'data' => 'Nathan & Jenine' },
{ 'data' => 'Fort Collins' },
{ 'data' => 'Colorado' },
],
'data' => "\n "
}
]
}
]
The data tags still contain tags and entities. If you don't want the
tags and entities, remove them by hand using techniques from Recipe 20.6.Example 20-11 fetches a particular CPAN author's page
and displays in plain text the modules they own. You could use this
as part of a system that notifies you when your favorite CPAN authors
do something new.
Example 20-11. Dump modules for a particular CPAN author
#!/usr/bin/perl -w
# dump-cpan-modules-for-author - display modules a CPAN author owns
use LWP::Simple;
use URI;
use HTML::TableContentParser;
use HTML::Entities;
use strict;
our $URL = shift || 'http://search.cpan.org/author/TOMC/';
my $tables = get_tables($URL);
my $modules = $tables->[4]; # 5th table holds module data
foreach my $r (@{ $modules->{rows} }) {
my ($module_name, $module_link, $status, $description) =
parse_module_row($r, $URL);
print "$module_name <$module_link>\n\t$status\n\t$description\n\n";
}
sub get_tables {
my $URL = shift;
my $page = get($URL);
my $tcp = new HTML::TableContentParser;
return $tcp->parse($page);
}
sub parse_module_row {
my ($row, $URL) = @_;
my ($module_html, $module_link, $module_name, $status, $description);
# extract cells
$module_html = $row->{cells}[0]{data}; # link and name in HTML
$status = $row->{cells}[1]{data}; # status string and link
$description = $row->{cells}[2]{data}; # description only
$status =~ s{<.*?>}{ }g; # naive link removal, works on this simple HTML
# separate module link and name from html
($module_link, $module_name) = $module_html =~ m{href=".*?>(.*)<}i;
$module_link = URI->new_abs($module_link, $URL); # resolve relative links
# clean up entities and tags
decode_entities($module_name);
decode_entities($description);
return ($module_name, $module_link, $status, $description);
}
20.19.4. See Also
The documentation for the CPAN module HTML::TableContentParser;
http://search.cpan.org