Hack 49. Search Yesterday's Index

the Google index yesterday . [Hack #48]
is a simple web form-driven CGI script for building date range Google
queries. A simple web-based interface is fine when you
want to search for only one or two items at a time. But what of
performing multiple searches over time, saving the results to your
computer for comparative analysis?A better fit for this task is a client-side application that you run
from the comfort of your own computer's desktop.
This Perl script feeds specified queries to Google via the Google Web
API, limiting results to those indexed yesterday. New finds are
appended to a comma-delimited text file per query, suitable for
import into Excel or your average database application.
|
2.31.1. The Queries
First, you'll need to prepare a few queries to feed
the script. Try these out via the Google search interface itself
first to make sure you're receiving the kind of
results you're expecting. Your queries can be
anything that you'd be interested in tracking over
time: topics of long-lasting or current interest, searches for new
directories of information [Hack #1] coming
online, unique quotes from articles, or other sources that you want
to monitor for signs of plagiarism.Use whatever special syntaxes you like except for
link: ; as you might remember,
link: can't be used in concert
with any other special syntax such as daterange:,
upon which this hack relies. If you insist on trying anyway (e.g.,
link:www.yahoo.com
daterange:2452421-2452521), Google will simply
treat link as yet another query word (e.g.,
link www.yahoo.com), yielding
some unexpected and useless results.Put each query on its own line. A sample query file will look
something like this: "digital archives"
intitle:"state library of"
intitle:directory intitle:resources
"now * * time for all good men * come * * aid * * party" Save the text file somewhere memorable; alongside the script
you're about to write is as good a place as any.
2.31.2. The Code
Save the following code as goonow.pl . Be sure to
replace insert key here with your Google
API key along the way. #!/usr/local/bin/perl -w
# goonow.pl
# Feeds queries specified in a text file to Google, querying
# for recent additions to the Google index. The script appends
# to CSV files, one per query, creating them if they don't exist.
# usage: perl goonow.pl [query_filename]
# My Google API developer's key.
my $google_key='insert key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
use strict;
use SOAP::Lite;
use Time::JulianDay;
$ARGV[0] or die "usage: perl goonow.pl [query_filename]\n";
my $julian_date = int local_julian_day(time) - 2;
my $google_search = SOAP::Lite->service("file:$google_wdsl");
open QUERIES, $ARGV[0] or die "Couldn't read $ARGV[0]: $!";
while (my $query = <QUERIES>) {
chomp $query;
warn "Searching Google for $query\n";
$query .= " daterange:$julian_date-$julian_date";
(my $outfile = $query) =~ s/\W/_/g;
open (OUT, ">> $outfile.csv")
or die "Couldn't open $outfile.csv: $!\n";
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 0, 10, "false", ", "false",
", "latin1", "latin1"
);
foreach (@{$results->{'resultElements'}}) {
print OUT '"' . join('","', (
map {
s!\n!!g; # drop spurious newlines
s!<.+?>!!g; # drop all HTML tags
s!"!"!g; # double escape " marks
$_;
} @$_{'title','URL','snippet'}
) ) . "\"\n";
}
} You'll notice that GooNow checks the day before
yesterday's rather than yesterday's
additions (my $julian_date
= int
local_julian_day(time) -
2;). Google indexes some pages very frequently;
these show up in yesterday's additions and really
bulk up your search results. So if you search for
yesterday's results in addition to updated pages,
you'll get a lot of noise, pages that Google indexes
every day, rather than the fresh content that you're
after. Skipping back one more day is a nice hack to get around the
noise.
2.31.3. Running the Hack
This script is invoked on the command line ["Running
the Hacks" in Preface] like so: $ perl goonow.pl query_filename where query_filename is the name of the
text file holding all the queries to be fed to the script. The file
can be located either in the local directory or elsewhere; if the
latter, be sure to include the entire path (e.g.,
/mydocu~1/hacks/queries.txt).Bear in mind that all output is directed to CSV files, one per query,
so don't expect any fascinating output on the
screen.
2.31.4. The Results
Here's a quick look at one of the CSV output files
created, intitle_state_library_of_.csv: "State Library of Louisiana","http://www.state.lib.la.us/"," ...
Click
here if you have any questions or comments. Copyright <C2><A9>
1998-2001 State Library of Louisiana Last modified: August 07,
2002. "
"STATE LIBRARY OF NEW SOUTH WALES, SYDNEY
AUSTRALIA","http://www.slnsw.gov.au/", " ... State Library of New
South
Wales Macquarie St, Sydney NSW Australia 2000 Phone: +61 2 9273
1414
Fax: +61 2 9273 1255. Your comments You could win a prize! ... "
"State Library of Victoria","http://www.slv.vic.gov.au/"," ...
clicking
on our logo. State Library of Victoria Logo with link to homepage
State
Library of Victoria. A world class cultural resource ... "
...
2.31.5. Hacking the Hack
The script keeps appending new finds to the appropriate CSV output
file. If you wish to reset the CSV files associated with particular
queries, simply delete them, and the script will create them anew.Or you can make one slight adjustment to have the script create the
CSV files anew each time, overwriting the previous version, like so: ...
(my $outfile = $query) =~ s/\W/_/g;
open (OUT, "> $outfile.csv")
or die "Couldn't open $outfile.csv: $!\n";
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 0, 10, "false", ", "false",
", "latin1", "latin1"
);
... Notice the only change in the code is the removal of one of the
> characters when the output file is
createdi.e., open (OUT,
"> $outfile.csv") instead of
open (OUT, ">> $outfile.csv").