Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Hack 41. Scrape Yahoo! Buzz for a Google Search

A proof-of-concept hack scrapes the buzziest
items from Yahoo! Buzz and submits them to a Google search .

No web site is an island. Billions
of hyperlinks link to billions of documents. Sometimes, however, you
want to take information from one site and apply it to another site.

Unless that site has a web service API like
Google's, your best bet is scraping. Scraping is
where you use an automated program to remove specific bits of
information from a web page. Examples of the sorts of elements people
scrape include stock quotes, news headlines, prices, and so forth.
You name it and someone's probably scraped it.

There's some controversy about scraping. Some sites
don't mind it, while others can't
stand it. If you decide to scrape a site, do it gently; take the
minimum amount of information you need and, whatever you do,
don't hog the scrapee's bandwidth.

So, what are we scraping?

Google has a query popularity page called

Google
Zeitgeist (http://www.google.com/press/zeitgeistl).
Unfortunately, the Zeitgeist is updated only once a week and contains
only a limited amount of scrapable data. That's
where

Yahoo! Buzz (http://buzz.yahoo.com) comes in. The site is
rich with constantly updated information. Its Buzz Index keeps tabs
on what's hot in popular culture: celebs, games,
movies, television shows, music, and more.

This hack grabs the buzziest of the buzz, the top of the Leaderboard,
and searches Google for all it knows on the subject. And to keep
things current, only pages indexed by Google within the past few days
[Hack #16]
are considered.

This hack requires additional Perl modules:

Time::JulianDay (http://search.cpan.org/search?query=Time%3A%3AJulianDay)
and LWP::Simple (http://search.cpan.org/search?query=LWP%3A%3ASimple). It won't run without them.

2.23.1. The Code

Save the following code to a plain text file named
buzzgle.pl:

#!/usr/local/bin/perl
# buzzgle.pl
# Pull the top item from the Yahoo! Buzz Index and query the last
# three day's worth of Google's index for it.
# Usage: perl buzzgle.pl
# Your Google API developer's key.
my $google_key='insert key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
# Number of days back to go in the Google index.
my $days_back = 3;
use strict;
use SOAP::Lite;
use LWP::Simple;
use Time::JulianDay;
# Scrape the top item from the Yahoo! Buzz Index.
# Grab a copy of http://buzz.yahoo.com.
my $buzz_content = get("http://buzz.yahoo.com/")
or die "Couldn't grab the Yahoo Buzz: $!";
# Find the first item on the Buzz Index list.
my($buzziest) = $buzz_content =~ m!http://search.yahoo.com/search\?p=.+">

(.+?) <\/a>!i;
die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest;
# Figure out today's Julian date.
my $today = int local_julian_day(time);
# Build the Google query.
my $query = "\"$buzziest\" daterange:" . ($today - $days_back) . "-$today";
print
"The buzziest item on Yahoo Buzz today is: $buzziest\n",
"Querying Google for: $query\n",
"Results:\n\n";
# Create a new SOAP::Lite instance, feeding it GoogleSearch.wsdl.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
# Query Google.
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 0, 10, "false", ", "false",
", "latin1", "latin1"
);
# No results?
@{$results->{resultElements}} or die "No results";
# Loop through the results.
foreach my $result (@{$results->{'resultElements'}}) {
my $output =
join "\n",
$result->{title} || "no title",
$result->{URL},
$result->{snippet} || 'no snippet',
"\n";
$output =~ s!<.+?>!!g; # drop all HTML tags
print $output;
}

2.23.2. Running the Hack

The script runs from the command line ["How to Run
the Hacks" in the Preface] without need of arguments
of any kind. Probably the best thing to do is to direct the output to
a pager (a command-line application that allows you to page through
long output, usually by hitting the spacebar), like so:

% perl buzzgle.pl | more Or you can direct the output to a file for later perusal:

% perl buzzgle.pl > buzzgle.txt As with all scraping applications, this code is fragile, subject to
breakage if (read: when) HTML formatting of the Yahoo! Buzz page
changes. If you find you have to adjust to match
Yahoo!'s formatting, you'll have to
alter the regular expression match as appropriate:

my($buzziest) = $buzz_content =~ m!http://search.yahoo.com/search\?p=.+">(.+?)<\/a>!i;

Regular expressions and general HTML scraping are beyond the scope of
this book. For more information, I suggest you consult
O'Reilly's Perl and
LWP
(http://www.oreilly.com/catalog/perllwp) or
Mastering Regular
Expressions (http://www.oreilly.com/catalog/regex).

2.23.3. The Results

At the time of this writing, Maria Sharapova, the Russian tennis
star, is all the rage:

% perl buzzgle.pl | less
The buzziest item on Yahoo Buzz today is: Maria Sharapova
Querying Google for: "Maria Sharapova" daterange:2453292-2453295
Results:
Maria Sharapova
http://www.mariaworld.net/
everything about Maria Sharapova: photos, interviews, articles, statistics, results and
much more! ... Maria Sharapova: 2004 Tokyo Champion! ...
Maria Sharapova
http://www.mariaworld.net/photos
everything about Maria Sharapova: photos, interviews, articles, statistics, results and
much more! HOME, BIOGRAPHY, PHOTOS, RESULTS, ...
Maria Sharapova Picture Page
http://milano.vinden.nl/
Maria Sharapova Picture Page. Country: Russia. Date of Birth: April 19, 1987. Place of
Birth: Nyagan, Russia. Residence: Bradenton, Florida USA. Height: 1.83 metres ...

2.23.4. Hacking the Hack

Here are some ideas for hacking the hack:

As it stands, the program returns 10 results. You could change that
to one result and immediately open that result instead of returning a
list. Bravo, you've just written
I'm Feeling Popular, as in Google's
I'm Feeling Lucky.

This version of the program searches the last three days of indexed
pages. Because there's a slight lag in indexing news
stories, I would index at least the last two days'
worth of indexed pages, but you could extend it to seven days or even
a month. Simply change my $days_back =
3;, altering the value of the
$days_back variable.

You could create a "Buzz Effect"
hack by running the Yahoo! Buzz query with and without the date range
limitation. How do the results change between a full search and a
search of the last few days?

Yahoo!'s Buzz has several different sections. This
one looks at the Buzz summary, but you could create other ones based
on Yahoo!'s other buzz charts (television,
http://buzz.yahoo.com/television/, for
instance).

Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی