Hack 46. Spot Trends with Geotargeting

fashion in different locations, using only Google and Directi search
results .One of the latest buzzwords on the Internet is
geotargeting ,
which is just a fancy name for the process of matching hostnames
(e.g., http://www.oreilly.com) to
addresses (e.g., 208.201.239.36) to country names (e.g., U.S.). The
whole thing works because there are people who compile such databases
and make them readily available. This information must be compiled by
hand or at least semiautomatically because the DNS system that
resolves hostnames to addresses does not store it in its distributed
database.While it is possible to add geographic location data to DNS records,
it is highly impractical to do so. However, since we know which
addresses have been assigned to which businesses, governments,
organizations, or educational establishments, we can assume with a
high probability that the geographic location of the institution
matches that of its hosts, at least for most of them. For example, if
the given address belongs to the range of addresses assigned to
British Telecom, then it is highly probable that it is used by a host
located within the territory of the United Kingdom.Why go to such lengths when a simple DNS lookup (e.g.,
nslookup 208.201.239.36) gives the name of the
host, and in that name we can look up the top-level domain (e.g.,
.pl, .de, or
.uk) to find out where this particular host is
located? There are four good reasons for this: Not all lookups on addresses return hostnames. A single address might serve more than one virtual host. Some country domains are registered by foreigners and hosted on
servers on the other side of the globe. .com, .net,
.org, .biz, or
.info domains tell us nothing about the
geographic location of the servers they are hosted on.
That's where geotargeting can help.
Geotargeting is by no means perfect. For example, if an international
organization such as AOL gets a large chunk of addresses that it uses
not only for servers in the U.S., but also in Europe, the European
hosts might be reported as being based in the U.S. Fortunately, such
aberrations do not constitute a large percentage of addresses.The first users of geotargeting were advertisers, who thought it
would be a neat idea to serve local advertising. In other words, if a
user visits a New York Times site, the ads they
see depend on their physical location. Those in the U.S. might see
the ads for the latest Chrysler car, while those in Japan might see
ads for i-mode; users from Poland might see ads for
"Ekstradycja" (a cult Polish police
TV series), and those in India might see ads for the latest Bollywood
movie. While such use of geotargeting might be used to maximize the
return on the invested dollar, it also goes against the idea behind
the Internet, which is a global network. (In other words, if you are
entering a global audience, don't try to hide from
it by compartmentalizing it.) Another problem with geotargeted ads is
that they follow the viewer. Advertisers must love it, but it is
annoying to the user; how would you feel if you saw the same ads for
your local burger bar everywhere you went in the world?Another application of geotargeting is to serve content in the local
language. The idea is really nice, but it's often
poorly implemented and takes a lot of clicking to get to the pages in
other languages. The local pages have a habit of returning out of
nowhere, especially after you upgrade your web browser. A much more
interesting application of geotargeting is the analysis of trends,
which is usually done in two ways: analysis of server logs and via
analysis of results of querying Google.Server log analysis is used to determine the geographic location of
your visitors. For example, you might discover that your
company's site is being visited by a large number of
people from Japan. Perhaps that number is so significant that it
would justify the rollout of a Japanese version of your site. Or it
might be a signal that your company's products are
becoming popular in that country and you should spend more marketing
dollars there. But if you run a server for U.S. expatriates living in
Tokyo, the same information might mean that your site is growing in
popularity and you need to add more information in English. This
method is based on the list of addresses of hosts that connect to the
server, stored in your server's access log. You
could write a script that looks up their geographic location to find
out where your visitors come from. It is more accurate than looking
up top-level domains, although it's a little slower
due to the number of DNS lookups that need to be done.Another interesting use of geotargeting is analysis of the spread of
trends. This can be done with a simple script that plugs into the
Google API and the IP-to-Country database provided by
Directi (http://ip-to-country.directi.com). The idea
behind trend analysis is simple: perform repetitive queries using the
same keywords, but change the language of results and top-level
domains for each query. Compare the number of results returned for
each language, and you will get a good idea of the spread of the
analyzed trend across cultures. Then, compare the number of results
returned for each top-level domain, and you will get a good idea of
the spread of the analyzed trend across the globe. Finally, look up
geographic locations of hosts to better approximate the geographic
spread of the analyzed trend.You might discover some interesting things this way: it could turn
out that a particular .com domain that serves a
significant number of documents and that contained the given query in
Japanese is located in Germany. It might be a sign that there is a
large Japanese community in Germany that uses that particular
.com domain for their portal.
Shouldn't you be trying to get in touch with them?The geospider.pl script shown in this hack is a
sample implementation of this idea. It queries Google and then
matches the names of hosts in returned URLs against the IP-to-Country
database.
2.28.1. The Code
Save the following code ["How to Run the
Hacks" in the Preface] as
geospider.pl.
|
#
# geospider.pl
#
# Geotargeting spider -- queries Google through the Google API, extracts
# hostnames from returned URLs, looks up addresses of hosts, and matches
# addresses of hosts against the IP-to-Country database from Directi:
# ip-to-country.directi.com. For more information about this software:
# http://www.artymiak.com/software or contact jacek@artymiak.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
use strict;
use Getopt::Std;
use Net::Google;
use constant GOOGLEKEY => 'insert key here';
use Socket;
my $help = <<"EOH";
----------------------------------------------------------------------------
Geotargeting trend analysis spider
----------------------------------------------------------------------------
Options:
-h prints this help
-q query in utf8, e.g. 'Spidering Hacks'
-l language codes, e.g. 'en fr jp'
-d domains, e.g. '.com'
-s which result should be returned first (count starts from 0), e.g. 0
-n how many results should be returned, e.g. 700
----------------------------------------------------------------------------
EOH
# Define our arguments and show the
# help if asked, or if missing query.
my %args; getopts("hq:l:d:s:n:", \%args);
die $help if exists $args{h};
die $help unless $args{'q'};
# Create the Google object.
my $google = Net::Google->new(key=>GOOGLEKEY);
my $search = $google->search( );
# Language, defaulting to English.
$search->lr(qw($args{l}) || "en");
# What search result to start at, defaulting to 0.
$search->starts_at($args{'s'} || 0);
# How many results, defaulting to 10.
$search->starts_at($args{'n'} || 10);
# Input and output encoding.
$search->ie(qw(utf8)); $search->oe(qw(utf8));
my $querystr; # our final string for searching.
if ($args{d}) { $querystr = "$args{q} .site:$args{d}"; }
else { $querystr = $args{'q'} } # domain specific searching.
# Load in our lookup list from
# http://ip-to-country.directi.com/.
my $file = "ip-to-country.csv";
print STDERR "Trying to open $file... \n";
open (FILE, "<$file") or die "[error] Couldn't open $file: $!\n";
# Now load the whole shebang into memory.
print STDERR "Database opened, loading... \n";
my (%ip_from, %ip_to, %code2, %code3, %country);
my $counter=0; while (<FILE>) {
chomp; my $line = $_; $line =~ s/"//g; # strip all quotes.
my ($ip_from, $ip_to, $code2, $code3, $country) = split(/,/, $line);
# Remove trailing zeros.
$ip_from =~ s/^0{0,10}//g;
$ip_to =~ s/^0{0,10}//g;
# And assign to our permanents.
$ip_from{$counter} = $ip_from;
$ip_to{$counter} = $ip_to;
$code2{$counter} = $code2;
$code3{$counter} = $code3;
$country{$counter} = $country;
$counter++; # move on to next line.
}
$search->query(qq($querystr));
print STDERR "Querying Google with $querystr... \n";
print STDERR "Processing results from Google... \n";
# For each result from Google, display
# the geographic information we've found.
foreach my $result (@{$search->response( )}) {
print "-" x 80 . "\n";
print " Search time: " . $result->searchTime( ) . "s\n";
print " Query: $querystr\n";
print " Languages: " . ( $args{l} || "en" ) . "\n";
print " Domain: " . ( $args{d} || " ) . "\n";
print " Start at: " . ( $args{'s'} || 0 ) . "\n";
print "Return items: " . ( $args{n} || 10 ) . "\n";
print "-" x 80 . "\n";
map {
print "url: " . $_->URL( ) . "\n";
my @addresses = get_host($_->URL( ));
if (scalar @addresses != 0) {
match_ip(get_host($_->URL( )));
} else {
print "address: unknown\n";
print "country: unknown\n";
print "code3: unknown\n";
print "code2: unknown\n";
} print "-" x 50 . "\n";
} @{$result->resultElements( )};
}
# Get the IPs for
# matching hostnames.
sub get_host {
my ($url) = @_;
# Chop the URL down to just the hostname.
my $name = substr($url, 7); $name =~ m/\//g;
$name = substr($name, 0, pos($name) - 1);
print "host: $name\n";
# And get the matching IPs.
my @addresses = gethostbyname($name);
if (scalar @addresses != 0) {
@addresses = map { inet_ntoa($_) } @addresses[4 .. $#addresses];
} else { return undef; }
return "@addresses";
}
# Check our IP in the
# Directi list in memory.
sub match_ip {
my (@addresses) = split(/ /, "@_");
foreach my $address (@addresses) {
print "address: $address\n";
my @classes = split(/\./, $address);
my $p; foreach my $class (@classes) {
$p .= pack("C", int($class));
} $p = unpack("N", $p);
my $counter = 0;
foreach (keys %ip_to) {
if ($p <= int($ip_to{$counter})) {
print "country: " . $country{$counter} . "\n";
print "code3: " . $code3{$counter} . "\n";
print "code2: " . $code2{$counter} . "\n";
last;
} else { ++$counter; }
}
}
} Be sure to replace insert key here with
your Google API key.
2.28.2. Running the Hack
Here, we're querying to see how much worldly
penetration AmphetaDesk, a popular news aggregator, has, according to
Google's top search results: % perl geospider.pl -q "amphetadesk"
Trying to open ip-to-country.csv...
Database opened, loading...
Querying Google with amphetadesk...
Processing results from Google...
--------------------------------------------------------------
Search time: 0.081432s
Query: amphetadesk
Languages: en
Domain:
Start at: 0
Return items: 10
--------------------------------------------------------------
url: http://www.macupdate.com/info.php/id/9787
host: www.macupdate.com
host: www.macupdate.com
address: 64.5.48.152
country: UNITED STATES
code3: USA
code2: US
--------------------------------------------------
url: http://allmacintosh.forthnet.gr/preview/214706l
host: allmacintosh.forthnet.gr
host: allmacintosh.forthnet.gr
address: 193.92.150.100
country: GREECE
code3: GRC
code2: GR
--------------------------------------------------
...etc...
2.28.3. Hacking the Hack
This script is only a simple tool. You will make it better, no doubt.
The first thing you could do is implement a more efficient way to
query the IP-to-Country database. Storing data from
ip-to-country.csv in a database would speed
script startup time by several seconds. Also, the answers to
address-to-country queries could be obtained much faster.You might ask if it wouldn't be easier to write a
spider that doesn't use the Google API and instead
downloads page after page of results returned by Google at
http://www.google.com. Yes, it is
possible, and it is also the quickest way to get your script
blacklisted for the breach of the Google's user
agreement. Google is not only the best search engine, it is also one
of the best-monitored sites on the Internet. Jacek Artymiak