Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Google Hacks 2Nd Edition [Electronic resources] - نسخه متنی

Tara Calishain

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید







Hack 44. Yahoo! Directory Mindshare in Google

How does link popularity compare in
Yahoo!'s searchable subject index versus
Google's full-text index? Find out by calculating
mindshare!

Yahoo! and Google are two very different
animals. Yahoo! indexes only a site's main URL,
title, and description, while Google builds full-text indexes of
entire sites. Surely there's some interesting
cross-pollination when you combine results from the two.

This hack scrapes all the URLs in a specified subcategory of the
Yahoo! directory. It then takes each URL and gets its link count from
Google. Each link count provides a nice snapshot of how a particular
Yahoo! category and its listed sites stack up on the popularity
scale.


What's a link count ?
It's simply the total number of pages in
Google's index that link to a specific URL.

There are a couple of ways you can use your knowledge of a
subcategory's link count. If you find a subcategory
whose URLs have only a few links each in Google, you may have found a
subcategory that isn't getting a lot of attention
from Yahoo!'s editors. Consider going elsewhere for
your research. If you're a webmaster and
you're considering paying to have Yahoo! add you to
their directory, run this hack on the category in which you want to
be listed. Are most of the links really popular? If they are, are you
sure your site will stand out and get clicks? Maybe you should choose
a different category.

We got this idea from a similar experiment Jon Udell (http://weblog.infoworld.com/udell/) did in
2001. He used AltaVista instead of Google; see http://udell.roninhouse.com/download/mindshare-script.txt.
We appreciate the inspiration, Jon!


2.26.1. The Code


You will need a Google API account (http://api.google.com), as well as the
SOAP::Lite

(http://www.soaplite.com) and
HTML::LinkExtor

(http://search.cpan.org/author/GAAS/HTML-Parser/lib/HTML/LinkExtor.pm)
Perl modules to run this hack.

Save the code as mindshare_calculator.pl,
remembering to replace insert
key here with your Google API key:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;
my $google_key = 'insert key here';
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir = shift || "/Computers_and_Internet/Data_Formats/XML_ _".
"eXtensible_Markup_Language_/RSS/News_Aggregators/";
# Download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;
# Create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.
# Extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);
sub mindshare { # for each link we find...
my ($tag, %attr) = @_;
# Continue on only if the tag was a link,
# and the URL matches Yahoo!'s redirectory.
return if $tag ne 'a';
return unless $attr{href} =~ /rds.yahoo/;
return unless $attr{href} =~ /\*http/;
# Now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;
$url =~ s/%3A/:/; # turn encoding into legits.
# And process each URL through Google.
my $results = $google_search->doGoogleSearch(
$google_key, "link:$url", 0, 1,
"true", ", "false", ", ", "
); # wheee, that was easy, guvner.
$urls{$url} = $results->{estimatedTotalResultsCount};
}
# Now sort and display.
my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

2.26.2. Running the Hack


The hack has its only configurationthe Yahoo! directory
you're interested inpassed as a single
argument (in quotes) on the command line ["How to
Run the Scripts" in the Preface]. If you
don't pass one of your own, a default directory will
be used instead.

% perl mindshare_calculator.pl "/Entertainment/Humor/Procrastination/" Your results show the URLs in those directories, sorted by total
Google links:

340: http://www.p45.net/
246: http://www.ishouldbeworking.com/
81: http://www.india.com/
33: http://www.jlc.net/~useless/
23: http://www.geocities.com/SouthBeach/1915/
18: http://www.eskimo.com/~spban/creedl
13: http://www.black-schaffer.org/scp/
3: http://www.angelfire.com/mi/psociety
2: http://www.geocities.com/wastingstatetime/

2.26.3. Hacking the Hack


Yahoo! isn't the only searchable subject index out
there, of course. There's also the
Open Directory Project (DMOZ, http://www.dmoz.org), which is the product of
thousands of volunteers busily cataloging and categorizing sites on
the Webthe web community's Yahoo!, if you
will. This hack works just as well on DMOZ as it does on Yahoo!;
they're very similar in structure.

Replace the default Yahoo! directory with its DMOZ equivalent:

my $dmoz_dir = shift || "/Reference/Libraries/Library_and_Information_RETURN
Science/Technical_Services/Cataloguing/Metadata/RDF/Applications/RSS/RETURN
News_Readers/";

You'll also need to change the download instructions:

# Download the Dmoz.org directory.
my $data = get("http://dmoz.org" . $dmoz_dir) or die $!;

Next, replace the lines that check whether a URL should be measured
for mindshare. When we were scraping Yahoo! in our original script,
all directory entries were always prepended with http://srd.yahoo.com/ and then the URL
itself. Thus, to ensure we received a proper URL, we skipped over the
link unless it matched that criteria:

return unless $attr{href} =~ /srd.yahoo/;
return unless $attr{href} =~ /\*http/;

Since DMOZ is an entirely different site, our checks for validity
have to change. DMOZ doesn't modify the outgoing
URL, so our previous Yahoo! checks have no relevance here. Instead,
we'll make sure it's a full-blooded
location (i.e., it starts with http:// ) and it
doesn't match any of DMOZ's
internal page links. Likewise, we'll ignore searches
on other engines:

return unless $attr{href} =~ /^http/;
return if $attr{href} =~ /dmoz|google|altavista|lycos|yahoo|alltheweb/; Our last change is to modify the bit of code that gets the real URL
from Yahoo!'s modified version. Instead of
"finding the URL within the URL":

# Now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;

we simply assign the URL that HTML::LinkExtor
has found:

# Now get our real URL.
my $url = $attr{href};

Can you go even further with this? Sure! You might want to search a
more specialized directory, such as the

FishHoo! fishing search engine
(http://www.fishhoo.com).

You might want to return only the most linked-to URL from the
directory, which is quite easy, by piping the results
["How to Run the Hacks" in the
Preface] to another common Unix utility:

% perl mindshare_calculator.pl | head 1 Alternatively, you might want to go ahead and grab the top 10 Google
matches for the URL that has the most mindshare. To do so, add the
following code to the bottom of the script:

print "\nMost popular URLs for the strongest mindshare:\n";
my $most_popular = shift @sorted_urls;
my $results = $google_search->doGoogleSearch(
$google_key, "$most_popular", 0, 10,
"true", ", "false", ", ", " );
foreach my $element (@{$results->{resultElements}}) {
next if $element->{URL} eq $most_popular;
print " * $element->{URL}\n";
print " \"$element->{title}\"\n\n";
} Then, run the script as usual (the output here uses the default
hardcoded directory).

% perl mindshare_calculator.pl
27800: http://radio.userland.com/
6670: http://www.oreillynet.com/meerkat/
5460: http://www.newsisfree.com/
3280: http://ranchero.com/software/netnewswire/
1840: http://www.disobey.com/amphetadesk/
847: http://www.feedreader.com/
797: http://www.serence.com/site.php?page=prod_klipfolio
674: http://bitworking.org/Aggiel
492: http://www.newzcrawler.com/
387: http://www.sharpreader.net/
112: http://www.awasu.com/
102: http://www.bloglines.com/
67: http://www.blueelephantsoftware.com/
57: http://www.blogtrack.com/
50: http://www.proggle.com/novobot/
Most popular URLs for the strongest mindshare:
* http://groups.yahoo.com/group/radio-userland/
"Yahoo! Groups : radio-userland"
* http://groups.yahoo.com/group/radio-userland-francophone/message/76
"Yahoo! Groupes : radio-userland-francophone Messages : Message 76 ... "
* http://www.fuzzygroup.com/writing/radiouserland_faq
"Fuzzygroup :: Radio UserLand FAQ"
...

Kevin Hemenway and Tara Calishain

/ 209