Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Google Hacks 2Nd Edition [Electronic resources] - نسخه متنی

Tara Calishain

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








9.8. A Note on Spidering and Scraping


Some small share of the hacks in this book involve
spidering , or meandering through sites and
scraping data from their web pages to be used outside of their
intended context. Given that we have the Google API at our disposal,
why then do we resort at times to spidering and
scraping?

The main reason is simply that you can't gain access
to everything Google through the API. While it nicely serves the
purposes of searching the Web programmatically, the API (at the time
of this writing) doesn't go any further than
Google's main web search index. And
it's even limited in what you can pull from the
index. You can't do a phonebook search, trawl Google
News, leaf through Google Catalogs, or interact in any way with any
of Google's other specialty search properties.

So, while Google provides a good start in its API, there are more
often than not situations in which you can't get to
the Google data that you're most interested in. Not
to mention combining what you can get through the Google API with
data from other sites without such a convenience.
That's where spidering and scraping comes in.

That said, there are a few things that you need to keep in mind when
resorting to scraping:

Scrapers are brittle
The shelf life of a scraper is only as long as the page it is
scraping remains formatted in about the same manner. When the page
changes, your scraper canand most likely willbreak.


Tread lightly
Tread lightly, taking only as much as you need and no more. If all
you need is the data from the page that you already have open in your
browser, save the source and scrape that.


Maximize your effectiveness
Make the most out of every page you scrape. Rather than hitting
Google again and again for the next 10 results and the next 10, set
your preferences ["Setting
Preferences" in Chapter 1] so
that you get all you can on a single page. For instance, set your
preferred number of results to 100 rather than the default 10.


Mind the terms of service
It might be tempting to go one step further and create programs that
automate retrieving and scraping, but you're more
likely to tread on the toes of the site owner (Google or otherwise)
and be asked to leave or simply locked out.



So use the API whenever you can, scrape only when you absolutely
must, and mind your p's and q's
when fiddling about with other people's data.


/ 209