Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

With over 100,000 free electronic resource in Persian, Arabic and English

9.8. A Note on Spidering and Scraping

Some small share of the hacks in this book involve
spidering , or meandering through sites and
scraping data from their web pages to be used outside of their
intended context. Given that we have the Google API at our disposal,
why then do we resort at times to spidering and
scraping?

The main reason is simply that you can't gain access
to everything Google through the API. While it nicely serves the
purposes of searching the Web programmatically, the API (at the time
of this writing) doesn't go any further than
Google's main web search index. And
it's even limited in what you can pull from the
index. You can't do a phonebook search, trawl Google
News, leaf through Google Catalogs, or interact in any way with any
of Google's other specialty search properties.

So, while Google provides a good start in its API, there are more
often than not situations in which you can't get to
the Google data that you're most interested in. Not
to mention combining what you can get through the Google API with
data from other sites without such a convenience.
That's where spidering and scraping comes in.

That said, there are a few things that you need to keep in mind when
resorting to scraping:

Scrapers are brittle
The shelf life of a scraper is only as long as the page it is
scraping remains formatted in about the same manner. When the page
changes, your scraper canand most likely willbreak.

Tread lightly
Tread lightly, taking only as much as you need and no more. If all
you need is the data from the page that you already have open in your
browser, save the source and scrape that.

Maximize your effectiveness
Make the most out of every page you scrape. Rather than hitting
Google again and again for the next 10 results and the next 10, set
your preferences ["Setting
Preferences" in Chapter 1] so
that you get all you can on a single page. For instance, set your
preferred number of results to 100 rather than the default 10.

Mind the terms of service
It might be tempting to go one step further and create programs that
automate retrieving and scraping, but you're more
likely to tread on the toes of the site owner (Google or otherwise)
and be asked to leave or simply locked out.

So use the API whenever you can, scrape only when you absolutely
must, and mind your p's and q's
when fiddling about with other people's data.

Google Hacks 2Nd Edition [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

This is a Digital Library

With over 100,000 free electronic resource in Persian, Arabic and English

Google Hacks 2Nd Edition [Electronic resources] - نسخه متنی

Tara Calishain

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

فونت

اندازه قلم

حالت نمایش

Sitemap

9.8. A Note on Spidering and Scraping