Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Google Hacks 2Nd Edition [Electronic resources] - نسخه متنی

Tara Calishain

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Hack 54. Scrape Google News

Scrape Google News Search results to get at the
latest from thousands of aggregated news sources .

Google News, with its thousands of news sources worldwide, is a
veritable treasure trove for any news hound. However, because you
can't access Google News through the Google API
[Chapter 9], you'll have to
scrape your results from the HTML of a Google News results page. This
hack does just that, gathering up results into a comma-delimited file
suitable for loading into a spreadsheet or database. For each news
story, it extracts the title, URL, source (i.e., news agency),
publication date or age of the news item, and an excerpted
description.

Because Google's Terms of Service prohibit the
automated access of their search engines except through the Google
API, this hack does not actually connect to Google. Instead, it works
on a page of results that you've saved from a Google
News Search that you've run yourself. Simply save
the results page as HTML source using your browser's
FileSave As... command.

Make sure the results are listed by date instead of relevance. When
results are listed by relevance, some of the descriptions are missing
because similar stories are clumped together. You can sort results by
date by choosing the "Sort by date"
link on the results page or by adding
&scoring=d to the end of the results URL.
Also, make sure you're getting the maximum number of
results by adding &num=100 to the end of the
results URL. For example, Figure 4-3 shows results
of a query for election 2004, the latest on the
2004 United States Presidential Electionsomething of great
import at the time of this writing.


Figure 4-3. Google News results for election 2004, sorted by date



Bear in mind that scraping web pages is a brittle occupation. A
single change in the HTML code underlying the standard Google News
results page and you're more than likely to end up
with no results.

At the time of this writing, a typical Google News Search result
looks a little something like this:

<a href="http://www.townhall.com/columnists/phyllisschlafly/ps20041018.shtml" id=r-1>
Show Me State debate spotlights conservative issues</a><br><font size=-1>
<font color=#6f6f6f>Town Hall,&nbsp;DC&nbsp;-</font> <nobr>8
minutes ago</nobr></font><br><font size=-1><b>...</b>
The <b>2004</b> <b>election</b> presents a stark choice to voters
on many issues. The second debate highlighted those differences on taxes, sovereignty
<b>...</b> </font><br>

While for most of you this is utter gobbledygook, it is probably of
some use to those trying to understand the regular expression
matching in the following script.


4.8.1. The Code


Save this code to a file called news2csv.pl:

#!/usr/bin/perl
# news2csv.pl
# Google News Results exported to CSV suitable for import into Excel.
# Usage: perl news2csv.pl < newsl > news.csv
print qq{"title","link","source","date or age", "description"\n};
my %unescape = ('&lt;'=>'<', '&gt;'=>'>', '&amp;'=>'&',
'&quot;'=>'"', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;
my $results = join '', <>;
$results =~ s/($unescape_re)/$unescape{$1}/migs; # unescape HTML
$results =~ s![\n\r]! !migs; # drop spurious newlines
while ( $results =~ m!<a href="]+)" ?>(.+?)</a>
<br>(.+?)<nobr>(.+?)</nobr>.*?<br>.+?)<br>migs {
my($url, $title, $source, $date_age, $description) =
($1||'',$2||'',$3||'',$4||'', $5||'');
$title =~ s!"!"!g; # double escape " marks
$description =~ s!"!"!g;
my $output =
qq{"$title","$url","$source","$date_age","$description"\n};
$output =~ s!<.+?>!!g; # drop all HTML tags
print $output;
}

4.8.2. Running the Script


Run the script from the command line ["Running the
Hacks" in the the Preface], specifying the Google
News results HTML filename and the name of the CSV file that you wish
to create or to which you wish to append additional results. For
example, using newsl as our input and
news.csv as our output:

$ perl news2csv.pl < newsl
> news.csv Leaving off the > and CSV filename sends the
results to the screen for your perusal.


4.8.3. The Results


The following are some of the 20,300 results returned by a Google
News Search for election 2004 and using the HTML
page of results shown in Figure 4-3:

$ perl news2csv.pl < newsl
"title","link","source","date or age", "description"
"Bush and Kerry on the razor's edge","http://www.dallasnews.com/sharedcontent/dws/dn/
opinion/columnists/mdavis/stories/101304dnedidavis.97af9l","Dallas Morning News
(subscription),<A0>TX<A0>- ","12 minutes ago","... But just as the election
is a referendum on Mr. Bush, debate analysis depends on ... Tonight history will write
whether the Bush debate grade for 2004 is generally ... "
"MINNESOTA: Notes and quotes from battleground Minnesota in ...",
"http://www.twincities.com
/mld/twincities/news/state/minnesota/9902651","Pioneer Press (subscription),<A0>
MN<A0>- ","16 minutes ago","... the state. Both sides had assumed Minnesota would
easily go Democratic, as it had for every presidential election since 1972. In ... "
"MINNESOTA: Notes and quotes from battleground Minnesota in ...","http://www.
duluthsuperior.com/mld/duluthsuperior/news/politics/9902651","Duluth News Tribune,<
A0>MN<A0>- ","19 minutes ago","... the state. Both sides had assumed Minnesota
would easily go Democratic, as it had for every presidential election since 1972. In ... "
...


Each listing actually occurs on its own line.


4.8.4. Hacking the Hack


You'll want to leave most of the
news2csv.pl script alone, since
it's been built to make sense out of the Google News
formatting. If you don't like the way the program
organizes the information that's taken out of the
results page, you can change it. Just rearrange the variables on the
following line, sorting them in any way that you choose. Be sure to
keep a comma between each one.

my $output =
qq{"$title","$url","$source","$date_age","$description"\n};

For example, perhaps you want only the URL and title. The line should
read:

my $output =
qq{"$url","$title"\n};

That \n specifies a new line, and the
$ characters specify that $url
and $title are variable names; keep them intact.

Of course, now your output won't match the header at
the top of the CSV file, by default:

print qq{"title","link","source","date or age", "description"\n};

As before, simply change this to match, as follows:

print qq{"url","title"\n};


/ 209