Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Google Hacks 2Nd Edition [Electronic resources] - نسخه متنی

Tara Calishain

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Hack 83. Scrape Google AdWords

Scrape the AdWords from a saved Google results
page into a form suitable for importing into a spreadsheet or
database .

Google's

AdWordsthe
text ads that appear to the right of the regular search
resultsare delivered on a cost-per-click basis, and purchasers
of the AdWords are allowed to set a ceiling on the amount of money
that they spend on their ad. This means that, even if you run a
search for the same query word multiple times, you
won't necessarily get the same set of ads each time.

If you're considering using Google AdWords to run
ads, you might want to gather up and save the ads that are running
for the query words that interest you. Google AdWords is not included
in the functionality provided by the Google API, so
you're left to a little scraping to get at that
data.


Be sure to read "A Note on Spidering and
Scraping" in Chapter 9 for
some understanding of what scraping means.

This hack will let you scrape the AdWords from a saved Google results
page and export them to a
comma-separated (CSV) file, which you can then import into Excel or
your favorite spreadsheet program.


This hack requires a Perl module called
HTML::TokeParser (http://search.cpan.org/search?query=htmL%3A%3Atokeparser&mode=all).
You'll need to install it before the hack will run.


7.6.1. The Code


Save this code to a text file named adwords.pl:

#!/usr/bin/perl
# usage: perl adwords.pl resultsl
#
use strict;
use HTML::TokeParser;
die "I need at least one file: $!\n"
unless @ARGV;
my @Ads;
for my $file (@ARGV){
# skip if the file doesn't exist
# you could add more file testing here.
# errors go to STDERR so they won't
# pollute our csv file
unless (-e $file) {
warn "What??: $file -- $! \n-- skipping --\n";
next;
}
# now parse the file
my $p = HTML::TokeParser->new($file);
while(my $token = $p->get_token) {
next unless $token->[0] eq 'S'
and $token->[1] eq 'a'
and $token->[2]{id} =~ /^aw\d$/;
my $link = $token->[2]{href};
my $ad;
if($link =~ /pagead/) {
my($url) = $link =~ /adurl=([^\&]+)/;
$ad->{href} = $url;
} elsif($link =~ m{^/url\?}) {
my($url) = $link =~ /\&q=([^&]+)/;
$url =~ s/%3F/\?/;
$url =~ s/%3D/=/g;
$url =~ s/%25/%/g;
$ad->{href} = $url;
}
$ad->{adwords} = $p->get_trimmed_text('/a');
$ad->{desc} = $p->get_trimmed_text('/font');
($ad->{url}) = $ad->{desc} =~ /([\S]+)$/;
push(@Ads,$ad);
}
}
print quoted( qw( AdWords HREF Description URL Interest ) );
for my $ad (@Ads) {
print quoted( @$ad{qw( adwords href desc url )} );
}
sub quoted {
return join( ",", map { "\"$_\" } @_ )."\n";
}

7.6.2. How It Works


Call this script on the command line ["How to Run
the Hacks" in the Preface], providing the name of
the saved Google results page and a file in which to put the CSV
results:

% perl adwords.pl inputl > output.csv inputl is the name of the Google results
page that you've saved.
output.csv is the name of the comma-delimited
file to which you want to save your results. You can also provide
multiple input files on the command line if you'd
like:

% perl adwords.pl inputl input2l > output.csv

7.6.3. The Results


The results will appear in a comma-delimited format that looks like
this:

"AdWords","HREF","Description","URL","Interest"
"Free Blogging Site","http://www.1sound.com/ix",
" The ultimate blog spot Start your journal now ","www.1sound.com/ix","40"
"New Webaga Blog","http://www.webaga.com/blog.php",
" Fully customizable. Fairly inexpensive. ","www.webaga.com","24"
"Blog this","http://edebates.e-thepeople.org/a-national/article/10245/view&",
" Will online diarists rule the Net strewn with failed dotcoms? ",
"e-thePeople.org","26"
"Ford - Ford Cars","http://quickquote.forddirect.com/FordDirect.jsp",
" Build a Ford online here and get a price quote from your local dealer! ",
"www.forddirect.com","40"
"See Ford Dealer's Invoice","http://buyingadvice.com/search/",
" Save $1,400 in hidden dealership profits on your next new car. ",
"buyingadvice.com","28"
"New Ford Dealer Prices","http://www.pricequotes.com/",
" Compare Low Price Quotes on a New Ford from Local Dealers and Save! ",
"www.pricequotes.com","25"


Each line is prematurely broken in this code listing for the purposes
of publication.

You'll see that the hack returns the AdWords
headline, the link URL, the description in the ad, the URL on the ad
(this is the URL that appears in the ad text, while the HREF is what
the URL links to), and the Interest, which is the size of the
Interest bar on the text ad. The Interest bar gives an idea of how
many click-throughs an ad has had, showing how popular it is.

Tim Allwine and Tara Calishain

/ 209