Sounds easy, right? Searchers enter queries, and then the search engine looks up the search terms in its organic index, it ranks the best matches first, and then displays the results. But how did all those pages get into the index in the first place? That is what Figure 2-5 shows, and the rest of this chapter explains. This information is critical to you, the search marketer, because if your pages are not in the index, no searcher can ever find them.
spider (sometimes called a crawler). Spiders start by examining Web pages in a seed list, because the spider needs to start somewhere. But after the spider gets started, it discovers sites on its own by following links.
A spider uses the same links you click in your Web browser. When the spider examines the page, it sees the Hypertext Markup Language (HTML) code that indicates a link to another page (see
crawling the Web. Later in this chapter, we explain what the spider does with th115 it collects from all of those pages that it crawls.
It sounds easy. Spiders visit the pages and send them to the search index. Those little spiders keep crawling until they index the entire Web, right? Wrong. The truth is that the great majority of Web pages are not indexed in search engines.Chapter 10, "Get Your Site Indexed," shows you how to find out how many pages are indexed from your organization's site and some simple ways to get more of them indexed.
Following links is important because it is the best way for a spider to comprehensively crawl the Web. But it is important for another reason, too. Spiders must carefully catalog every link they findchecking which pages link to your page and checking the words displayed that describe the link (the anchor text). Earlier in this chapter, we discussed how search engines rank search results; they do so with this information. Figure 2-7 shows how spiders collect the link information that is so important to ranking the results.
As you can imagine, Web crawling is not the most efficient way to keep up with changes to those billions of Web pages. New pages can be added, old pages removed, and existing pages changed at any timethe spider will not immediately know that anything has changed. It can be days or weeks before the spider returns to see what happened. That is why a searcher sometimes gets a "page not found" message when clicking a search result. The spider found that page during its last crawl, but it has since been removed or given a new address.Chapter 3, "How Search Marketing Works," covers a service some search engines offer called paid inclusion that can help address this problem.
Even without paid inclusion, however, the best spiders try to compensate to keep their indexes "fresh" by varying their rates of revisiting sites. Spiders return more frequently to sites that change more quickly. If a spider comes to two pages on the same day and then returns to both exactly a month later, if one of them has changed and one has not, the spider can decide to revisit the changed page in two weeks, but wait six weeks to return to the unchanged page. Over time, this technique can greatly vary the return rate for the spider, raising the freshness of the index by revisiting volatile pages most frequently.
Spiders also revisit more often to sites that have the highest-quality pages. Google, for example, tends to revisit pages with higher PageRank more frequently (perhaps once per week) than other pages. The Yahoo! spider, in general, does not return to sites as frequently as Google, but also pays more attention to well-linked pages.
By far, the most pages in organic search engines are gathered by the search engine's spider, but it is not the only way to get your data into the search engine.trusted feed; that is, your site sends pages to the search engine, which are processed and stored in the index as soon as they are received.Chapter 10, we examine the use of trusted feeds as part of your search marketing program.