Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Hack 45. Glean Weblog-Free Google Results

With so many weblogs being indexed by Google,
you might worry about too much emphasis on the hot topic of the
moment. In this hack, we'll show you how to remove
the weblog factor from your Google results .

Weblogs those frequently

updated, link-heavy personal
pagesare quite the fashionable thing these days. There are at
least 4,000,000 active weblogs across the Internet, covering almost
every possible subject and interest. For humans,
they're good reading, but for search engines,
they're heavenly bundles of fresh content and links
galore.

Some people think that the search engine's delight
in weblogs slants search results by placing too much emphasis on too
small a group of recent rather than evergreen content. As I write,
for example, I am the twelfth most important Ben on the Internet,
according to Google. This rank comes solely from my
weblog's popularity.

This hack searches Google, discarding any results coming from
weblogs. It uses the Google Web Services API (http://api.google.com) and the API of
Technorati
(http://www.technorati.com/members), an
excellent interface to David Sifry's weblog
data-tracking tool. Both APIs require keys, available from the URLs
mentioned.

Finally, you'll need a simple HTML page with a form
that passes a text query to the parameter q (the
query that will run on Google), something like this:

<form action="googletech.cgi" method="POST">
Your query: <input type="text" name="q">
<input type="submit" name="Search!" value="Search!">
</form>

Save the form as googletechl.

2.27.1. The Code

Save the following code ["How to Run the
Hacks" in the Preface] to a file called
googletech.cgi.

You'll need the XML::Simple and
SOAP::Lite Perl modules to run this hack.

#!/usr/bin/perl -w
# googletech.cgi
# Getting Google results
# without getting weblog results.
use strict;
use SOAP::Lite;
use XML::Simple;
use CGI qw(:standard);
use HTML::Entities ( );
use LWP::Simple qw(!head);
my $technoratikey = "insert technorati key here";
my $googlekey = "insert google key here";
# Set up the query term
# from the CGI input.
my $query = param("q");
# Initialize the SOAP interface and run the Google search.
my $google_wdsl = "http://api.google.com/GoogleSearch.wsdl";
my $service = SOAP::Lite->service->($google_wdsl);
# Start returning the results page;
# do this now to prevent timeouts.
my $cgi = new CGI;
print $cgi->header( );
print $cgi->start_html(-title=>'Blog Free Google Results');
print $cgi->h1('Blog Free Results for '. "$query");
print $cgi->start_ul( );
# Go through each of the results.
foreach my $element (@{$result->{'resultElements'}}) {
my $url = HTML::Entities::encode($element->{'URL'});
# Request the Technorati information for each result.
my $technorati_result = get("http://api.technorati.com/bloginfo?".
"url=$url&key=$technoratikey");
# Parse this information.
my $parser = new XML::Simple;
my $parsed_feed = $parser->XMLin($technorati_result);
# If Technorati considers this site to be a weblog,
# go onto the next result. If not, display it, and then go on.
if ($parsed_feed->{document}{result}{weblog}{name}) { next; }
else {
print $cgi-> i('<a href="'.$url.'">'.$element->{title}.'</a>');
print $cgi-> l("$element->{snippet}");
}
}
print $cgi -> end_ul( );
print $cgi->end_html;

Let's step through the meaningful bits of this code.
First comes pulling in the query from Google. Notice the
10 in the doGoogleSearch; this
is the number of search results requested from Google. You should try
to set this as high as Google will allow whenever you run the script;
otherwise, you might find that searching for terms that are extremely
popular in the weblogging world does not return any results at all,
having been rejected as originating from a blog.

Since we're about to make a web services call for
every one of the returned results, which might take a while, we want
to start returning the results page now; this helps prevent
connection timeouts. As such, we spit out a header using the
CGI module, and then jump into our loop.

We then get to the final part of our code: actually looping through
the search results returned by Google and passing the HTML-encoded
URL to the Technorati API as a get request.
Technorati will then return its results as an XML document.

Be careful that you do not run out of Technorati requests. As I write
this, Technorati is offering 500 free requests a day, which, with
this script, is around 50 searches. If you make this script available
to your web site audience, you will soon run out of Technorati
requests. One possible workaround is forcing the user to enter her
own Technorati key. You can get the user's key from
the same form that accepts the query. See the
"Hacking the Hack" section for a
means of doing this.

Parsing this result is a matter of passing it through
XML::Simple . Since Technorati returns only an
XML construct containing name when the site is
thought to be a weblog, we can use the presence of this construct as
a marker. If the program sees the construct, it skips to the next
result. If it doesn't, the site is not thought to be
a weblog by Technorati and we display a link to it, along with the
title and snippet (when available) returned by Google.

2.27.2. Running the Hack

Point your browser at the form googletechl .

2.27.3. Hacking the Hack

As mentioned previously, this script can burn through your Technorati
allowances rather quickly under heavy use. The simplest way of
solving this is to force the end user to supply his own Technorati
key. First, add a new input to your HTML form for the
user's key:

Your query: <input type="text" name="key">

Then, suck in the user's key as a replacement to
your own:

# Set up the query term
# from the CGI input.
my $query = param("q");
$technoratikey = param("key");

Ben Hammersley

Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی