22.9. Reading and Writing RSS Files
22.9.1. Problem
You want
to create an Rich Site Summary (RSS) file, or read one produced by
another application.
22.9.2. Solution
Use the CPAN module XML::RSS to read an existing RSS
file:
use XML::RSS;
my $rss = XML::RSS->new;
$rss->parsefile($RSS_FILENAME);
my @items = @{$rss->{items}};
foreach my $item (@items) {
print "title: $item->{''title''}\n";
print "link: $item->{''link''}\n\n";
}
To create an RSS file:
use XML::RSS;
my $rss = XML::RSS->new (version => $VERSION);
$rss->channel( title => $CHANNEL_TITLE,
link => $CHANNEL_LINK,
description => $CHANNEL_DESC);
$rss->add_item(title => $ITEM_TITLE,
link => $ITEM_LINK,
description => $ITEM_DESC,
name => $ITEM_NAME);
print $rss->as_string;
22.9.3. Discussion
There are at least four variations of RSS extant: 0.9, 0.91, 1.0, and
2.0. At the time of this writing, XML::RSS understood all but RSS
2.0. Each version has different capabilities, so methods and
parameters depend on which version of RSS you''re using. For example,
RSS 1.0 supports RDF and uses the Dublin Core metadata ( http://dublincore.org/). Consult the
documentation for what you can and cannot call.XML::RSS uses XML::Parser to parse the RSS. Unfortunately, not all
RSS files are well-formed XML, let alone valid. The XML::RSSLite
module on CPAN offers a looser approach to parsing RSS—it uses
regular expressions and is much more forgiving of incorrect XML.Example 22-13 uses XML::RSSLite and LWP::Simple to
download The Guardian''s RSS feed and print out the items whose
descriptions contain the keywords we''re interested in.
Example 22-13. rss-parser
#!/usr/bin/perl -w
# guardian-list -- list Guardian articles matching keyword
use XML::RSSLite;
use LWP::Simple;
use strict;
# list of keywords we want
my @keywords = qw(perl internet porn iraq bush);
# get the RSS
my $URL = ''http://www.guardian.co.uk/rss/1,,,00.xml'';
my $content = get($URL);
# parse the RSS
my %result;
parseRSS(\%result, \$content);
# build the regex from keywords
my $re = join "|", @keywords;
$re = qr/\b(?:$re)\b/i;
# print report of matching items
foreach my $item (@{ $result{items} }) {
my $title = $item->{title};
$title =~ s{\s+}{ }; $title =~ s{^\s+}{ }; $title =~ s{\s+$}{ };
if ($title =~ /$re/) {
print "$title\n\t$item->{link}\n\n";
}
}
The following is sample output from Example 22-13:
UK troops to lead Iraq peace force
http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00l?=rss
Shia cleric challenges Bush plan for Iraq
http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00l?=rss
We can combine this with XML::RSS to generate a new RSS feed from the
filtered items. It would be easier, of course, to do it all with
XML::RSS, but this way you get to see both modules in action. Example 22-14 shows the finished program.
Example 22-14. rss-filter
#!/usr/bin/perl -w
# guardian-filter -- filter the Guardian''s RSS feed by keyword
use XML::RSSLite;
use XML::RSS;
use LWP::Simple;
use strict;
# list of keywords we want
my @keywords = qw(perl internet porn iraq bush);
# get the RSS
my $URL = ''http://www.guardian.co.uk/rss/1,,,00.xml'';
my $content = get($URL);
# parse the RSS
my %result;
parseRSS(\%result, \$content);
# build the regex from keywords
my $re = join "|", @keywords;
$re = qr/\b(?:$re)\b/i;
# make new RSS feed
my $rss = XML::RSS->new(version => ''0.91'');
$rss->channel(title => $result{title},
link => $result{link},
description => $result{description});
foreach my $item (@{ $result{items} }) {
my $title = $item->{title};
$title =~ s{\s+}{ }; $title =~ s{^\s+}{ }; $title =~ s{\s+$}{ };
if ($title =~ /$re/) {
$rss->add_item(title => $title, link => $item->{link});
}
}
print $rss->as_string;
Here''s an example of the RSS feed it produces:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
"http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
<channel>
<title>Guardian Unlimited</title>
<link>http://www.guardian.co.uk</link>
<description>Intelligent news and comment throughout the day from The Guardian
newspaper</description>
<item>
<title>UK troops to lead Iraq peace force</title>
<link>http://www.guardian.co.uk/Iraq/Story/0,2763,989318,00l?=rss</link>
</item>
<item>
<title>Shia cleric challenges Bush plan for Iraq</title>
<link>http://www.guardian.co.uk/Iraq/Story/0,2763,989364,00l?=rss</link>
</item>
</channel>
</rss>
22.9.4. See Also
The documentation for the modules XML::RSS and XML::RSSLite