Hack 33. Summarize Results by Domain

(educational, commercial, foreign, and so forth) found in the results
of a Google query .You want to know about a topic, so you do a search. But what do you
have? A list of pages. You can't get a good idea of
the types of pages these are without taking a close look at the list
of sites. This hack is an attempt to get a snapshot of the
types of sites that result from a query. It does this by taking a
suffix census , a count of the different domains
that appear in search results.This is most ideal for running link: queries,
providing a good idea of what kinds of domains (commercial,
educational, military, foreign, etc.) are linking to a particular
page.You could also run it to see where technical terms,
slang terms, and
unusual words are turning up. Which pages mention a particular singer
more often? Or a political figure? Does the word
"democrat" come up more often on
.com or .edu sites?Of course, this snapshot doesn't provide a complete
inventory, but as overviews go, it's rather
interesting.
2.15.1. The Code
Save the code as suffixcensus.cgi, a CGI script
["How to Run the Hacks" in the
Preface] on your web server: #!/usr/local/bin/perl
# suffixcensus.cgi
# Generates a snapshot of the kinds of sites responding to a
# query. The suffix is the .com, .net, or .uk part.
# suffixcensus.cgi is called as a CGI with form input.
# Your Google API developer's key.
my $google_key='insert key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
# Number of times to loop, retrieving 10 results at a time.
my $loops = 10;
use SOAP::Lite;
use CGI qw/:standard *table/;
header( ),
start_html("SuffixCensus"),
h1("SuffixCensus"),
start_form(-method=>'GET'),
'Query: ', textfield(-name=>'query'),
' ',
submit(-name=>'submit', -value=>'Search'),
end_form( ), p( );
if (param('query')) {
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %suffixes;
for (my $offset = 0; $offset <= $loops*10; $offset += 10) {
my $results = $google_search ->
doGoogleSearch(
$google_key, param('query'), $offset, 10, "false", ", "false",
", "latin1", "latin1"
);
last unless @{$results->{resultElements}};
map { $suffixes{ ($_->{URL} =~ m#://.+?\.(\w{2,4})/#)[0] }++ }
@{$results->{resultElements}};
}
h2('Results: '), p( ),
start_table({cellpadding => 5, cellspacing => 0, border => 1}),
map( { Tr(td(uc $_),td($suffixes{$_})) } sort keys %suffixes ),
end_table( );
}
print end_html( ); Be sure to replace insert key here with
your Google API key.
2.15.2. Running the Hack
This hack runs as a CGI script. Point your browser at
suffixcensus.cgi to run it.
2.15.3. The Results
Searching for the prevalence of "soda
pop" by suffix finds, as one might expect, the
most mention on .com s, as shown in Figure 2-9.
Figure 2-9. Prevalence of "soda pop" by suffix

2.15.4. Hacking the Hack
There are a couple of ways to hack this hack.
2.15.4.1 Going back for more
This script, by default, visits Google 10 times, grabbing the top 100
(or fewer, if there aren't as many) results. To
increase or decrease the number of visits, simply change the value of
the $loops variable at the top of the script. Bear
in mind, however, that making $loops = 50 might
net you 500 results, but you're also eating quickly
into your daily allotment of 1,000 Google API queries.
2.15.4.2 Returning comma-separated output
It's rather simple to adjust this script to run from
the command line and return a comma-separated
output suitable for Excel or your average database. Remove the
starting HTML, form, and ending HTML output, and alter the code that
prints out the results. In the end, you come to something like this
(changes in bold): #!/usr/local/bin/perl
# suffixcensus_csv.pl
# Generates a snapshot of the kinds of sites responding to a
# query. The suffix is the .com, .net, or .uk part.
# Usage: perl suffixcensus_csv.pl query="your query" > results.csv
# Your Google API developer's key.
my $google_key='insert key';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
# Number of times to loop, retrieving 10 results at a time.
my $loops = 1;
use SOAP::Lite;
use CGI qw/:standard/;param('query')
or die qq{usage: suffixcensus_csv.pl query="{query}" [> results.csv]\n};print qq{"suffix","count"\n};
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %suffixes;
for (my $offset = 0; $offset <= $loops*10; $offset += 10) {
my $results = $google_search ->
doGoogleSearch(
$google_key, param('query'), $offset, 10, "false", ", "false",
", "latin1", "latin1"
);
last unless @{$results->{resultElements}};
map { $suffixes{ ($_->{URL} =~ m#://.+?\.(\w{2,4})/#)[0] }++ }
@{$results->{resultElements}};
}
print map { qq{"$_", "$suffixes{$_}"\n} } sort keys %suffixes; Invoke the script from the command line like so: $ perl suffixcensus_csv.pl query="query" > results.csv Searching for mentions of
"colddrink," the South African
version of "soda pop," sending the
output straight to the screen rather than a
results.csv file, looks like this: $ perl suffixcensus_csv.pl query="colddrink"
"suffix","count"
"com", "12"
"info", "1"
"net", "1"
"za", "6"