20.6. Extracting or Removing HTML Tags
20.6.1. Problem
You want to remove HTML tags from a
string, leaving just plain text. For example, you are indexing a
document but don''t want your index to show "words" like
<B> and
<body>.
20.6.2. Solution
The following oft-cited solution is simple but wrong on all but the
most trivial HTML:
($plain_text = $html_text) =~ s/<[^>]*>//gs; # WRONG
A correct but slower and slightly more complicated way is to use the
technique from Recipe 20.5:
use HTML::FormatText 2;
$plain_text = HTML::FormatText->format_string($html_text);
20.6.3. Discussion
As with almost everything else in Perl, there is more than one way to
do it. Each solution attempts to strike a balance between speed and
flexibility. Occasionally you may find HTML that''s simple enough that
a trivial command-line call works:
% perl -pe ''s/<[^>]*>//g'' file
However, this breaks with files whose tags cross line boundaries,
like this:
<IMG SRC = "/image/library/english/10159_/image/library/english/10159_foo.gif"
ALT = "Flurp!">
So, you''ll see people doing this instead:
% perl -0777 -pe ''s/<[^>]*>//gs'' file
or its scripted equivalent:
{
local $/; # temporary whole-file input mode
$html = <FILE>;
$html =~ s/<[^>]*>//gs;
}
But even that isn''t good enough except for simplistic HTML without
any interesting bits in it. This approach fails for the following
examples of valid HTML (among many others):
<IMG SRC = "/image/library/english/10159_/image/library/english/10159_foo.gif" ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solutions would also break
on text like this:
<!-- This section commented out.
<B>You can''t see me!</B>
-->
The only solution that works well here is to use the HTML parsing
routines from CPAN. The second code snippet shown in the Solution
demonstrates this better technique.For more flexible parsing, subclass the HTML::Parser class and record
only the text elements you see:
package MyParser;
use HTML::Parser;
use HTML::Entities qw(decode_entities);
@ISA = qw(HTML::Parser);
sub text {
my($self, $text) = @_;
print decode_entities($text);
}
package main;
MyParser->new->parse_file(*F);
If you''re only interested in simple tags
that don''t contain others nested inside, you can often make do with
an approach like the following, which extracts the title from a
non-tricky HTML document:
($title) = ($html =~ m#<TITLE>\s*(.*?)\s*</TITLE>#is);
Again, the regex approach has its flaws, so a more complete solution
using LWP to process the HTML is shown in Example 20-4.
Example 20-4. htitle
#!/usr/bin/perl
# htitle - get html title from URL
use LWP;
die "usage: $0 url ...\n" unless @ARGV;
foreach $url (@ARGV) {
$ua = LWP::UserAgent->new( );
$res = $ua->get($url);
print "$url: " if @ARGV > 1;
if ($res->is_success) {
print $res->title, "\n";
} else {
print $res->status_line, "\n";
}
}
Here''s an output example:
% htitle http://www.ora.com
www.oreilly.com -- Welcome to O''Reilly & Associates!
% htitle http://www.perl.com/ http://www.perl.com/nullvoid
http://www.perl.com/: The www.perl.com Home Page
http://www.perl.com/nullvoid: 404 File Not Found
20.6.4. See Also
The documentation for the CPAN modules HTML::TreeBuilder,
HTML::Parser, HTML::Entities, and LWP::UserAgent; Recipe 20.5