After you have determined that your site is indexed, and you have calculated how many pages you have indexed, you are certain to be greedy for more. The number of pages you can have indexed is limited only by the number of pages on your site. Many sites have millions of pages included, whereas some prominent Web sites have only their home page indexed.
You can take several steps to raise the inclusion ratio of your site, including the following:
Eliminate spider traps. Your Web site might actually prevent the spiders from indexing your pages. You will learn what the traps are and how you can spring the spider from each one.
Reduce ignored content. Spiders have certain rules they live by, and if your content breaks the rules, you lose. Find out what those rules are and how to reduce the amount of content spiders ignore on your site.
Create spider paths. You can coax spiders to index more of your site by creating site maps and other navigation that simplifies the link structure for all of your site's pages.
Use paid inclusion. One way to ensure inclusion is to pay your way in, for those search engines that allow it.
Many complain about the inability of spiders to index certain content. Although we are the first to agree that spiders can improve their crawling techniques, there are good reasons why spiders stay away from some of this content. You have a choice as to whether you wring your hands and complain about the spiders or set to work pleasing them so your pages are indexed. You can guess which path will be more successful.
If your site is suffering from a low inclusion ratio, you can take several steps, but eliminating spider traps is the most promising place to start.
As we have said before, spiders cannot index all pages. But we have yet to say what causes problems with the spiders. That's where we're going now.spider traps. Spider traps are barriers that prevent spiders from crawling a site, usually stemming from technical approaches to displaying Web pages that work fineb for browsers, but do not work for spiders. By eliminating these techniques from your site, you allow spiders to index more of your pages.
Unfortunately, many spider traps are the product of highly advanced technical approaches and highly creative user-experience designswhich were frightfully expensive to develop. No one wants to hear, after all the money was spent, that your site has been shut out of search. Yet that is the bad news that you might need to convey.
Luckily, spiders become more sophisticated every year. Designs that trapped spiders a few years ago are now okay. But you need to keep up with spider advances to employ some cutting-edge techniques.
So here they come! Here is how you eliminate the most popular spider traps.
Pretend that you are the Webmaster of your site, and you just learned that there is a software probe that has entered your Web site and appears to be examining every page on the site. And it seems to come back over and over again. Sounds like a security problem, doesn't it? Even if you could assure yourself that nothing nefarious is afoot, it is wasting the time of your servers.
Too often, that is how Webmasters view search spiders: a menace that needs to be controlled. And the robots.txt file is the way to control spiders.
It is a remarkably innocuous-looking file, a simple text file that is placed in the root directory of a Web server. Your robots.txt file tells the spider what files it is allowed to look at on that server. No technical reasons prevent spiders from looking at the disallowed files, but there is a gentleman's agreement that spiders will be polite and abide by the instructions.
A robots.txt file contains only two operative statements:
user-agent. The user agent statement defines which spiders the next disallow statement applies to. If you code an asterisk for the user agent, you are referring to all spiders, but you can also specify the name of just a particular spider using the list provided in Table 10-1.
disallow. The disallow statement specifies which files the spider is not permitted to crawl. You can specify a precise filename or any part of a name or directory namethe spider will treat that as a matching expression and disallow any file that matches that part of the name. So, specifying e eliminates all files starting with e from the crawl, as well as all files in any directory that begins with e. Specifying / disallows all files.
Figure 10-5 shows a robots.txt file with explanations of what each line means.
Webmasters have a legitimate reason to keep spiders out of certain directories on their serversserver performance. Most Web servers have programs stored in the cgi-bin directory, so it is a good idea to have your robots.txt file say "disallow: /cgi-bin/" to save the server from having to send the spider all those program files the spider does not want to see anyway. The trouble comes when an unsuspecting Webmaster does not understand the implications of disallowing other files, or all files.
Although many Webmasters use the robots.txt file to deliberately exclude spiders, accidental exclusion is all too common. Imagine a case where this file was used on a beta site to hide it from spiders before the site was launched. Unfortunately, the exclusionary robots.txt file might be left in place after launch, causing the entire Web site to disappear from all search indexes.
In addition to the robots.txt that controls spiders across your entire site, there is a way to instruct spiders on every pagethe robots metatag. In the <head> section of th115 of your page, a series of metatags are typically found in the form <meta name="type"> (where the "type" is the kind of metatag). One such metatag type is the robots tag (<meta name="robots">), which can control whether the page should be indexed and whether links from the page should be followed.
If the robots.txt file disallows a particular page, it does not matter what the robots metatag on that page says because the spider will not look at the page at all. If the page is allowed by the robts.txt instructions, however, the robots metatag is consulted by the spider as it looks at the page.
Figure 10-6 shows the variations available in the robots metatag for restricting indexing (placing the content in the index) and "link following" (using pages linked from this page as the next page to crawl). If the robots metatag is missing, the page is treated as if "index, follow" was specified.
Although you would normally want your pages to be coded without robots metatags (or with robots metatags specified as "index,follow"), there are legitimate reasons to use a robots tag to suppress spiders. Some pages on your site should be viewed only from the beginning of the sequence, such as a visual tour or a presentation. Although there is no problem with allowing searchers to land in the middle of such sequences, some site owners might not want them to, so they could code a robots tag on the first page of the presentation that says "index,nofollow" and specify "noindex,nofollow" on all the other pages.Chapter 7, Snap's search landing page for the keyword phrase "best digital camera" was not in any of the indexes. Examining the pages showed that a number of Snap pages had restrictive robots tags. The product directory page was using the <meta name="robots" content="index,nofollow"> version of the tag. This caused the spider not to follow any of the directory page's links to the actual product pages. Moreover, even if this problem had not existed, each of the actual product pages had the <meta name="robots" content=" noindex,nofollow"> version of the tag. The Web developers indicated it was done so that the commerce system would not be overloaded by search engine spiders. After educating the developers about search marketing, the tags were removed and the pages were indexed.
Most Web users dislike pop-up windows, those annoying little ads that get in your face when you are trying to do something else. Pop-up ads are so universally reviled that pop-up blockers are in wide use. Many sites still use pop-ups, however, believing that drawing attention to the window is more important than what Web users want.
Many Web sites use pop-up windows for more than ads. So, if user hatred is not enough to cure you of pop-up windows, maybe this is: Spiders cannot see them. If your site uses pop-ups to display related content, that content will not get indexed. Even worse, if your site uses pop-ups to show menus of links to other pages, the spider cannot follow those links, and those pages cannot be reached by the spider.
If your site uses pop-ups to display complementary content, the only way to get that content indexed is to stop using pop-up windows. You must add that content to the pages that it complements, or you must create a standard Web page with a normal link to it. If you are having trouble convincing your extended search team to dump pop-ups, remind them that the rise of pop-up blockers means that many of your visitors are not seeing this content either.
If you are using pop-up windows for navigation menus, you can correct this spider trap in the same way, by adding the links to each page that requires them and removing the pop-up, but you have another choice, too. You can decide to leave your existing pop-up navigation in place, but provide alternative paths to your pages that the spiders can follow. We cover these so-called spider paths later in the chapter.
As with navigation displayed through pop-up windows, spiders are trapped by pull-down navigation shown with JavaScript coding, as you see in
Enter the <noscript> tag. Page designers can add this tag to provide alternative code for any browser that does not support JavaScript. Spiders will not execute JavaScript, so they process the <noscript> code instead. If you must use JavaScript navigation, you can place standar114 link code in your <noscript> section. However, for search spiders to follow the links, they must contain the full path names (starting with http) for each linked page. To further ensure the spiders can find these pages, list these pages in your site map.
So-called dynamic pages are those whos115 code is not stored permanently in files on your Web server. Instead, for a dynamic page, a program creates th115 "on-the-fly"whenever a visitor requests to view that pageand the browser displays tha130 just as if it had been stored in a file.static Web pages, to distinguish them from the dynamic pages possible today.
It did not take long to bump into the limitations of static pagesthey contained the exact same information every time they were viewed. Soon the first technique for dynamic pages was defined, called the Common Gateway Interface (CGI), which allowed a Web server to run a program to dynamically create the page'129 and return it to the visitor's Web browser. That way, there never needs to be a file containing th115the program can generate th115 the moment the page is requested for viewing.
You have probably noticed that some URLs look "different"they contain special characters that would not occur in the name of a directory or file. Figure 10-8 dissects a dynamic URL and shows what each part of it means.
parameters in each dynamic URL (the words that start with the ampersand character [&]) is what causes complications for spiders. Because just about any value (the words that follow the equals sign character [=]) can be passed to the variable, search spiders have no way of knowing how many different variations of the same page can be shown. Sometimes different values passed to each parameter indicate a legitimate difference in the pages, such as in Figure 10-8each book has a different number. But other times, the values do not have anything to do with what content is displayed, such as so-called "track codes," in which the Web site is designed to log visitors coming from certain places for measurement purposes. A spider could look at the exact same page thousands of times because the tracking parameter in the URL is different each time. Not only does this waste the spider's time (when it could be looking at truly new pages from other sites), but sometimes it causes these pages to be stored in the index, resulting in massive duplication of content. Clearly spiders must be wary of how they crawl dynamic sites.
In the early days of dynamic pages, spiders had a simple solution for this dynamic site problemthey refused to crawl any page with one of the tell-tale characters (? or & or others) in its URL. But CGI programs were just the first of a long list of techniques allowing programs to generate Web pages dynamically. Over time, more and more Web pages have become dynamic, especially on corporate sites. Highly personalized sites consist of almost 100 percent dynamic pages. Most e-Commerce catalogs consist of dynamic pages.
Because so much important Web content has become dynamic, the search engines have tried to adjust. Search spiders now index dynamic pages under certain circumstances:
The URL has no more than two dynamic parameters. Okay, it's not really that simple. There are circumstances where even two dynamic parameters are too many (see the "rule" on session identifiers below), and there are other circumstances in which pages with URLs having more than three parameters are still indexed. If you must use more than two parameters in your URL, you might be able to use a technique known as URL rewrite, as explained below.
The URL has fewer than 1,000 characters. Ridiculously long URLs are ignored, but shorter ones seem okay. There is no reason to have URLs anywhere near 1,000 characters, so make them as short and as readable as possible.
The URL does not contain a session identifier. Session identifiers are parameters named "ID=" or "Session=" (or some other similar name) that are used to keep track of which visitor is looking at the page. Spiders hate this kind of parameter because the exact same content uses a different URL every time it is displayeda spider could put thousands of copies of identical pages in its index because they all have different URLs. If your pages contain this parameter, have your programmers use an alternative approach, as we describe belowspiders will not (and should not) index all of these duplicate pages.
Every valid URL is linked from category lists or site maps. Because some dynamic pages can use almost any value for its parameters, there is no way for the search spider to know every valid product number for your product catalog. You must ensure that there are spider paths to every valid dynamic page on your site. This technique reduces risks for the spiders in crawling the pages, so it encourages many to index your pages.
If your site relies on passing more than two parameters in the URL, you might benefit from the URL rewrite technique, which causes your dynamic URL to be shown as a static URL in appearance. For example, the page in ). Snap Electronics used mod_rewrite to convert its dynamic URLs for its e-Commerce catalog to appear to be static URLs. This allowed many of its product pages to be crawled that were missing from search indexes previously.
As noted above, pages with session identifiers cause problems for spidersyour pages will not be indexed unless you remove them from your URL parameters. You might be wondering, "Why did my Web developers use session identifiers in the first place?" It's not that complicated. As visitors move from page to page on your site, each program that displays a new page wants to "remember" what your visitor did on prior pages. So, for example, the order confirmation page wants to remember that your visitor signed in and provided credit card information on the checkout page. Simple enough, but where does the session identifier come in?
Your developer decided (correctly) that the best way to share information between these separate programs that display different pages was to store the information in a database that each program can read and change. So, the program that displays the checkout page can store the credit card and sign-in data in the databasethen the program that displays the order confirmation page can read that data from the database. But how does the order confirmation page know which record in the database has the information for each person that views the page? That is where the session identifier comes in.
When the visitor reaches the checkout page, the checkout page program creates a session identifiera unique number that no other visitor getsthat it will associate with that visitor for the rest of the session (that visit to the Web site). When it stores information in the database, that program stores it with a "key" of the session identifier and any of the other programs can read that information if they know the key. Which brings us back to the original problemthe developer is passing the key to each program in the URL session identifier parameter.
Your developers can provide this function without using session identifier parameter, however. If your programmers are using a sophisticated Web application environment, a "session layer" usually provides a mechanism for programs to pass information to one anotherthat is the best solution for the session identifier problem. If your Web infrastructure is not so sophisticated, you can use a cookie to hold the session information. If you go the cookie route, be careful not to trap the spider by forcing all visitors to have cookies enabled. We discuss why that is a problem in our next section.
Web sites are marvels of technology, but sometimes they are a bit too marvelous. Your Web developers might create such exciting pages that they require visitors' Web browsers to support the latest and greatest technology. Or have their privacy settings set a bit low. Or to reveal information. In short, your Web site might require visitors to take certain actions or to enable certain browser capabilities in order to operate. And although that is merely annoying for your visitors, it can be deadly for search spiders, because they might not be up to the task of viewing your Web site.
If you are as old as we are, you might remember the Pac-Man video game, in which a hungry yellow dot roamed the screen eating other dots, but changed course every time it hit a wall or another impediment. Search spiders are very similar. They will hungrily eat your spider food pages until they hit an impedimentthen they will turn tail and go in a different direction. Let's look at some of the most popular technical dependencies:
Requiring cookies. "Cookies" are information stored on the visitor's computer that Web pages can use to remember things about the visitor. For example, if your site says "Welcome Jane" at the top of the page every time she returns to your site, the name Jane is probably stored in a cookie on Jane's computer. When Jane views your page, her browser reads the cookie and displays her name in the right spot. Normally this works just fine, but what if your Web page requires that Jane's browser use cookies for this function, or else it displays an error page? First, some of your site visitors turn off cookies (for privacy reasons) and would not be able to view your site. But search spiders cannot accept cookies either, so they are also blocked from your pages. The bottom line is that your site can use cookies all it wants, but it should not require them to view a page. If your site's design absolutely depends on all visitors accepting cookies (such as to pass a required session identifier), this is a legitimate reason to use the IP delivery technique we discussed earlier in this chapter. By detecting a spider's user agent name and IP address, your program could allow spiders to look at the page without accepting the cookie, while still forcing cookies on all Web browsers. Make sure that your developers are careful to deliver the same page content to the spider as to the visitor, so you are not accused of spamming.
Requiring software downloads. If your site requires certain technology to view it, such as Macromedia Flash, Java, or something else, your visitors must download the software before entering your site. In addition to being somewhat inconvenient for your visitors, it completely blocks spiders. Spiders are not Web browsers, so they cannot interact with your site to download the required software. In addition, spiders can only read document formats, such a129 and PDF filesfiles that contain lots of text to index for searchso when they run into software download requirements, they go elsewhere. Your entire site might be blocked from indexing if you have this spider trap on your home page.
Requiring information. Frequently, sites are designed to be personalized in some way, which can be very good for visitors, but sometimes the designers go too far. Sites that require visitors to answer questions before viewing your pages are annoying to your visitors, and (you are getting the idea now) unusable by spiders, because all the spiders see is th115 form that requests the input, and they cannot enter any words to get your site to show the actual pages. If visitors must enter their e-mail address before they download a case study or their country and language before seeing your product catalog, you are asking for something that spiders cannot do. So the spiders cannot enter the required information to see the case study or the product catalog, and they will mosey on down to your competitor's site. Similarly, if your pages require an ID and password to "sign in" before you show them, the spider is unable to. The simplest way to think about this issue is that if your site prompts the visitor to do anything more than click a standard hypertext link, the spider will be at a loss and move on.
Requiring JavaScript. By far the most common dependency for Web pages is on JavaScript. JavaScript is a very useful programming language that allows your Web pages to be more interactive, responding to the visitor's cursor, for example, and JavaScript also allow your Web pages to use cookies, as discussed earlier. Used properly, JavaScript causes no problems for spiders, but frequently it is misused. In the next section, we discuss the pitfalls of JavaScript usage, but for now, just understand that your page should not require JavaScript in order to be displayed. Spiders cannot execute JavaScript, and some Web visitors also turn it off for security reasons. If your page tests for JavaScript before it allows itself to be displayed, it will not display itself to spiders, and none of its links to other pages can be followed.
To see these problems for yourself, turn off graphics, cookies, and JavaScript in your browser or use the text-only Lynx browser (lynx.browser.org)if you do not want to download Lynx, you can use the Lynx Viewer (http://www.delorie.com/web/lynxvie133). You will see which pages force the use of certain technologies, and you will get a good look at what a spider actually sees. Any time you need to do anything more complicated than clicking a link to continue, the spider is probably blocked.
The inventor of the Web, Tim Berners-Lee once observed that "URLs do not change, people change them." The best advice to search marketers is never to change your URLs, but at some point you will probably find it necessary to change the URL for one of your pages. Perhaps your Webmaster might want to host that page on a different server, which requires the URL to change. At other times the content of a page changes so that the old URL does not make sense anymore, such as when you change the brand name of your product and the old name is still in the URL.redirectan instruction to Web browsers to display a different URL from the one the browser requested. Redirects allow old URLs to be "redirected" to the current URL, so that your visitors do not get a "page not found" message (known as an HTTP 404 error) when they use the old URL.
A visitor might be using an old URL for any number of reasons, but here are the most common ones:
Bookmarks. If a visitor bookmarked your old URL, that bookmark will yield a 404 error the first time the visitor tries to use it after your change that URL.
Links. Other pages on the Web (on your site and other sites) link to that old URL. All of those links will become broken if you change the URL with no redirect in place.
Search results. As you can imagine, search spiders found your page using the old URL and indexed the page using that URL. When searchers find your page, they are clicking the old URL that is stored in the search index, so they will get a 404 if no redirect is in place.
Now that you understand that URLs will often change for your pages, and that redirects are required so that visitors can continue to find those pages, you need to know a little bit about spiders and redirects.server-side redirectyou might hear it called a "301" redirect, from the HTTP status code returned to the spider. A 301 status code tells the spider that the page has permanently changed to a new URL, which causes the spider to do two vitally important things:
Crawl the page at the new URL. The spider will use the new URL provided in the 301 redirect instruction to go to that new location and crawl the page just as you want it to. It will index all the content on the page, and it will index it using the new URL, so all searches that bring up that page will lead searchers to the new URL, not the old one.
Transfer the value of all links to the old page. You have learned how important it is to have links to your pagethe search engine ranks your page much higher when other pages (especially other important pages) link to your page. When the spider sees a 301 redirect, it updates all the linking information in its index; your page retains under its new URL all the link value that it had under its old URL.
Unfortunately, not all Webmasters use server-side redirects. There are several methods of redirecting pages, two of which are especially damaging to your search marketing efforts:
JavaScript redirects. One way of executing a redirect embeds the new URL in JavaScript code. So, your Web developer moves the page's rea122 to the new URL and codes a very simple page for the old URL that includes JavaScript code sending the browser to the new URL (such as <script language="JavaScript" type="text/javascript"> window.location="http://www.yourdomain.com/newURL"</script>).
Meta refresh redirects. A meta tag in the <head> section of you128 can also redirect a pageit is commonly called a "meta refresh" redirect (such as <meta http-equiv="Refresh" content="5; URL= http://www.yourdomain.com/newURL" />). This tag flashes a screen (in this case for five seconds) before displaying the new URL.
Search spiders normally cannot follow JavaScript, and in any case, both of these techniques are commonly used by search spammers so they can get the search spider to index the content on the old URL page while taking visitors to the new URL page (which might have entirely different content). These kinds of redirects will not take the spider to your new URL and they will not get your new URL indexed, which is what you want. Make sure that your Webmaster uses 301 redirects for all page redirection, and make sure that your Web developers are not using JavaScript and "meta refresh" redirects.
How your Webmaster implements a 301 redirect depends on what kind of Web server displays the URL. For the most common Web server, Apache, the Webmaster might add a line to the .htaccess file, like so:
Redirect 301 /OldDirectory/OldNam115 http://www.YourDomain.com/NewDirectory/NewNam115
You would obviously substitute your real directory and filenames. Understand, however, that some Apache servers are configured to ignore .htaccess files, and other kinds of Web servers have different means of setting up permanent redirects, so what your Webmaster does might vary. The point is that your Webmaster probably knows how to implement server-side redirects, and search spiders know how to follow them.
Server-side redirects are also used for temporary URL changes using an HTTP 302 status code. A 302 temporary redirect can be followed by the spider just as easily as a 301. Webmasters have various reasons for implementing 302s, but one that is important to search marketers, so-called vanity URLs. Sometimes it is nice to have a URL that is easy to remember, such as www.yourdomain.com/product that shows the home page for one of your products. You tell everyone linking to your product page to use that vanity URL. But behind the scenes, your Webmaster can move that page to a different server whenever needed for load balancing and other reasons. By using a 302 redirect, the spider uses your vanity URL in the search index but indexes the content on the page it redirects to.
Before implementing any 301 or 302 redirect, your Webmaster should take care not to add "hops" to the URLin other words, not adding a redirect on top of a previous redirect. For example, if the vanity URL has been temporarily directed (302) to the current URL and now needs to be directed to a new URL, the existing 302 redirect should generally be changed to the new URL. If, instead, the Webmaster implements a permanent (301) redirect from the current URL to the new URL, you now have two "hops" from your vanity URL to the real page. Not only does this slow performance for your visitors, but spiders are known to abandon pages with too many hops (possibly as few as four). Use a free tool at www.searchengineworld.com/cgi-bin/servercheck.cgi to check how your URLs redirect.
Make sure that your Webmaster is intimately familiar with search-safe methods of redirection, and confirm that the proper procedures are explained in your site standards so that all redirects are performed with care. Make sure that redirects are regularly reviewed and purged when no longer needed so that the path to your page is as direct as possible.
If it sounds basic, well, it is; however, it is a problem on all too many Web sites. When the spider comes to call, your Web server must be up. If your server is down, the spider receives no response from your Web site. At best, the spider moves along to a new server and leaves your pages in its search index (without seeing any page changes you have made, of course). At worst, the spider might conclude (after a few such incidents over several crawls) that your site no longer exists, and then deletes all of the missing pages from the search index.
Don't let this happen to you. Your Webmaster obviously wants to keep your Web site available to serve your visitors anyway, but sometimes hardware problems and other crises cause long and frequent outages for a period of time, possibly causing your pages to be deleted from one or more search indexes.
A less-severe but related problem is slow page loading. Although your site is technically up, the pages might be displayed so slowly that the spider soon abandons the site. Few spiders will wait 10 seconds for a page. Spiders are in a hurry, so if good performance for your visitors is not enough of a motivation, speed up your site for the spider's sake.
After you have eliminated your spider traps and the spiders can crawl your pages, the next issue you might encounter is that they ignore some of your content. Spiders have refined tastes, and if your content is not the kind of food they like, they will move on to the next page or the next site. Let's see what you should do to make your spider food as tasty as possible.
Like most of us, spiders do not want to do any unnecessary work. If you128 pages routinely consist of thousands and thousands of lines, spiders are less likely to index them all, or will index them less frequently. For the same time they spend crawling your bloated site, they could crawl two others.
In fact, every spider will stop crawling a page when it gets to a certain size. The Google and Yahoo! spiders seem to stop at about 100,000 characters, but every spider has a limit programmed into it. If you have very large pages, they might not be getting crawled or not crawled completely.
Once in a while, someone decides to put all 264 pages of the SnapShot DLR200 User's Guide on one Web page. Obviously, the 264-page manual belongs on dozens of separate Web pages with navigation from the table of contents. Breaking up a large page also helps improve keyword density by making the primary keywords stand our more in the sea of words. Not only is this better for search engines, your visitors will be happier, too.
The most frequent cause of fat pages, however, is embedded JavaScript code. No matter what the cause, there is no technical reason to have pages this large, and you should insist they be fixed. It is even easier to fix JavaScript bloat than large text pagesall you need to do is to move the JavaScript from your Web page to an external file. The code works just as well, but the spider does not have to crawl through it.
We recently reviewed the home page of a large consulting company and found their home page source code equal to 21 printed pages of text. Ninety percent of that content was JavaScript, much of which could be placed in external files and called when the page is loaded. Doing so leaves the remaining 10 percent of real content, which becomes tasty spider food. If your Web pages suffer from this kind of bloating, cutting them down to size will improve the number of pages indexed (and often their search ranking).
When you surf your Web site with your browser, you rarely see an error message. The Web pages load properly and they look okay to you. It is understandable for you to think that th115 that presents each page on your site has no errors. But you would be wrong.
Here is why. Web browsers, especially Internet Explorer, are designed to make visitors' lives easier by overlookin117 problems on your pages. Browsers are very tolerant of flaws in th115 code, striving to always present the page as best as possible, even though there might be many coding errors. Unfortunately, spiders are not so tolerant. Spiders are sticklers for correc130 code.
And most Web sites are rife with coding errors. Web developers are under pressure to make changes quickly, and the moment it looks correct in the browser, they declare victory and move on to the next task. Very few developers take the time to test that the code is valid.
You must get your developers to validate thei128 code. They must understand that coding errors provide the wrong information to the search spider. Consider something as seemingly minor as misspelling the <title> tag as <tilte> in you128. Browsers will not display your title in the title bar at the top of the window, but because the rest of the page looks fine, your developers and your visitors probably will not notice the error. The title tag, however, is an extremely important tag for the search enginea missing title makes it much harder (sometimes impossible) for that page to be found by searchers. Validating the code catches this kind of error before it hurts your search marketing.
Sometimes the errors are more subtle than a broken <title> tag. Comments in you128 code might not be ended properly, causing the spider to ignore real page text that you meant to be indexed because it takes that text as part of the malformed comment. In addition, browsers will sometimes correctly display pages with slight markup errors, such as missing tags to end tables, but sometimes search spiders might lose some of your text. So, the page might look okay, but not all of your words got indexed, so searchers cannot find your page when they use those words. Occasionally, HTML linksespecially those using relative addresses where the full URL of the link is not spelled outwork fine in a browser but trip up the spider.
It is easy for your developers to validate their code. Just send them to http://validator.w3.org/ and they can enter the URL of any page they want to test. There are several flavors of vali114, from the strictest compliance with the standards to looser compliance that uses some older tags. As long as your page states what flavor it adheres to in the <doctype> tag, it will be validated correctly, and search spiders can read any flavor of vali114 code. Make sure that your everyday development process requires that each page'129 be validated before promotion to your production Web site.
Macromedia is a very successful company that has brought a far richer user experience to the Web than drab ol114, allowing animation and other interactive features that spice up visual tours and demonstrations. This technology, called Flash, is supported on 98 percent of all browsers and can make your Web site far more appealing. (There are other graphical user environments similar to Flash, but Flash content is the vast majority, so we will just refer to everything as Flash, which is not far off.)
But (and you knew there was a but coming) spiders cannot index Flash content. Because Flash content is a lot closer to a video than a document, it is not clear how to index that content even if the spider could read it. Clearly, there is a lot less printed information in Flash content than on the averag115 page. So does that mean that you should not use Flash on your site? No. But it does mean you should use it wisely.
Reserve your use of Flash for content that you are happy not to be indexedthat 3D interactive view of your product or the walking tour of your museum's latest exhibit. You can also use Flash for application development, such as your online ordering systemsomething that you would not want indexed anyway. Do not use Flash to jazz up your annual report, unless you accept that no one can search for any words in the report to find it. And do not make your home page a Flash experience, unless you are exceedingly careful to ensure that spiders have another way into your site besides walking through the Flash doorgive the spiders a plain ol114 link to boring ol114 pages. (Remember, you cannot pop up a question asking whether visitors want Flash or non-Flash because spiders cannot answer that question either.)
When you do use Flash, make sure you always have a124 landing page to kick off any Flash experience. That way you can have a short page that describes the great walking tour of your museum and allows visitors to click the Flash content. By using this technique, you will give the search engines a page to index that might be found by searchers looking for your walking tourthey will find the dowd135 page that leads to the exciting Flash tour.
If you have a Web site built entirely in Flash content and you absolutely cannot change it t125, you can legitimately use the IP delivery technique discussed earlier to get your content into the search index. Here's how. Your Webmaster must implement an IP detection program that runs whenever a page requiring Flash is to be displayed. That program uses the user agent name and IP address to recognize the difference between when a spider is calling and when a Web browser is calling. The Flash content is served up as usual for Web browsers (for your visitors), but spiders get a different mealthey are served a124 page that has the same text on it as the Flash content. This use of IP delivery is entirely legitimate because you are serving the same text content to visitors and spiders. Be extremely careful, however, never to serve different text to visitors and spiders, because that would (rightly) be considered spamming. Ensure that your publishing process forces your Flash and you128 content to be synchronized after every update so that you do not inadvertently violate spam guidelines.
So remember, use Flash for things that are truly interactive and visualnot documents. Or if you must use Flash for documents, make sure there is a124 version of the document as well for spiders.
If your site's design has not been updated in a while, you might still have pages that use frames. Frames are an old technique o116 coding that can display multiple sources of content in separate scrollable windows in the sam115 page. Frames have many usability problems for visitors, and have been replaced with better ways of integrating content on the same pageusing content management systems and dynamic pages. But some sites still have pages coded with frames.
If you are among the unlucky to have frame-based pages on your site, the best thing to do is to replace them. Your visitors will have a better experience, and you will improve search marketing, too, because spiders have a devil of a time interpreting frame-based pages. Typically spiders ignore everything in the "frameset" and look for a124 tag called <noframes> that was designed for (ancient) browsers that do not support frames.
There are techniques that people use to try to load the pertinent content for search into the <noframes> tag, but it is a lot of work to create and maintain. Our advice is to ditch frames completely. Creating a new frame-free page will end up being a lot less work in the long run and will improve the usability of your site, too.
Now that you have learned all about removing spider traps, let's look at the opposite approach, too. Sometimes it is very difficult, costly, or expensive to remove a spider trap. In those cases, your only option is to provide an alternative way for the spider to traverse your site, so it can go around your trap. That's where spider paths come in.
Spider paths are just easy-to-follow routes through your site, such as site maps, category maps, country maps, or even text links at the bottom of the key pages. Quite simply, spider paths are any means that allow the spider to get to all the pages on your site. The ultimate spider path is a well thought-out and easy to navigate Web siteif your Web site has no spider traps, you might already have a wonderful set of spider paths. With today's ever-more-complex sites full of Flash, dynamic pages, and other spider-blocking technology, however, you need to make accommodations for spiders trapped by your regular navigation.
Site maps are very important, especially for larger sites. Human visitors like them because they enable them to see the breadth of information available to them, and spiders love them for the same reason.
Not only do site maps make it easier for spiders to get access to your site's pages, they also serve as very powerful clues to the search engine as to the thematic content of the site. The words you use as the anchor text for links from your site map can sometimes carry a lot of weight. Site maps often use the generic name for a product, whereas the product page itself uses the brand namesearchers for the generic name might be brought to your product page because the site map linked to it using that generic name. Work closely with your information architects to develop your site map, and you will reap large search dividends.
For a small site, your site map can have direct links to every page on your site. You can categorize each page under a certain subject, similar to the way Yahoo! categorizes Web sites in its directory, so that your site map lists a dozen or so topics with links to a few pages under each one. Your site map does not need to follow your folder structuresometimes the site map can offer an alternative way of navigating the site that helps some visitors. This simple approach probably works until you have about 100 pages.
When your site reaches several hundred pages, you cannot fit that many links on one site map page. You should modify your site map to link to category hub pages (maybe corresponding to the same topics that you used for your original site map). Because you might have just 10 to 15 links on your page (1 for each category), you might want to add a descriptive paragraph for each category to augment the link. From each category hub page your visitor can link deeper into the site to see all other pages. This approach can work even for sites with 10,000 pages or more.
Very large Web sites (100,000 pages or more) frequently have multiple top-level hub pages that, taken together, form an overall site map, because they cannot fit all of their topics on one site map page. IBM's Web site (www.ibm.com) uses this approach, with its top three hub pages for "Products," "Services & Solutions," and "Support & Downloads," as you can see in Figure 10-9. These three pages are shown in a navigation bar at the top of every page on the site, including the home page, making it very easy for spiders. Each page lists a number of categories relevant to the pagethe "Products" page lists all of IBM's product categories, with a similar list on the "Services & Solutions" page, and the "Support & Downloads" page provides links to the support centers for IBM products. Taken together, these pages form an extensive site map that spiders feast on, returning at least weekly to see whether any important links have been added to IBM's site. In addition, search engines consider these pages to be highly authoritative, with Google assigning them a PageRank of 9 or 10 at times.
Your site map might not do as well as that of a popular site such as ibm.com, but it shows the importance of a site map page as the key page on your site for spiders. If you have new content that you want the spiders to find quickly, add a link to it from your site map page. Remember, too, that because some search engine spiders limit the number of links that they index on a page, links within the site map should be ordered by level of importance. You should also try to include text on your site map page, rather than just a list of bare links. Adding text to this page provides the spiders with more valuable content to index and more clues as to what your content is about.
As you have seen, different Web sites have different versions of site maps that might list product categories, services, or anything else that appears on your Web site. You can always categorize your pages in an organized manner and display them as a kind of site map. A particular kind of spider path that is a bit different from a site map is a country map.
Figure 10-11 shows a similar technique used by Castrol. While the country map looks different from that of Iams, it is just as easy to follow for a spider, and equally effective for search marketing.
Regardless of what kind of Web site you have, spider paths are an invaluable way to get more pages from your site indexed. Whether you use country maps, site maps, or a related technique, you will provide the spiders with easy access to every page on your site, leaving an escape hatch to avoid those pesky spider traps. Next up, we look at one last way of getting more of your pages in the organic search indexpaying your way in.
As discussed in Chapter 3, paid inclusion is a technique you can use to get your pages added to the search index, and to get the index updated rapidly every time the content on your pages change. Not many years ago, almost every major search engine (except Google) had paid inclusion programs, but today only Yahoo! has one (among worldwide search engines). MSN Search withdrew its paid inclusion program in 2004, although some observers believe that MSN will eventually reinstate its program. Despite that trend, most experts believe that paid inclusion will grow. JupiterResearch, for example, projects the current $110 million market will surpass $500 million by 2008.
There are two related types of paid inclusion programs:
Single URL submission. It is fast and simple. You enter your URL into a form, and the search spider comes every two days and indexes the page. The spider will not follow any links on the pageif you want more pages to be crawled every two days, you must pay for them individually and submit them URL by URL. Because a spider is visiting your page, you might have work to do. Although paid inclusion can overcome some technical problems with free crawling, it is not a panacea. (Why is it that nothing is ever a panacea? Not sure why we even need the word in the dictionary.) Although you do not need to fix spider traps that affect links (such as pop-up windows and JavaScript navigation), you do need to make sure that your page has vali114 and is optimized for search (as discussed in Chapter 12).
Trusted feeds. To handle large volumes, trusted feeds make more sense. Rather than asking the spider to crawl each URL, you can send your pages directly to the search engine. Although very efficient for large Web sites, you have some technical tasks to perform to make it all work, which we explain later. Some specialty search and almost all shopping search engines require the use of trusted feeds to load your data into their search indexes. Trusted feeds can also be sent whenever the data changes (so you do not have to wait for the spider to come back), and they can prove a godsend for a site riddled with spider traps.
Yahoo! offers both kinds of paid inclusion programs, called Site Match and Site Match Xchange. Site Match is a single URL submission program, designed for submitting fewer than 1,000 URLs. Site Match Xchange handles more, allowing you to provide either a trusted feed or a single URL from which the spider can crawl all of your pages. (Remember, if you opt to provide a single URL, such as a site map, to be crawled by the spider, you must be sure that your site is free of the spider traps listed earlier in the chapter, whereas trusted feeds avoid these spider problems.)
Both single URL submission and trusted feed programs have similar cost structures, although the actual prices might differ. You should expect the following costs for both kinds of programs:
Annual fee. For each URL you submit, there is a yearly charge, usually discounted by volume.
Per-click fee. Each time that a searcher clicks a page you paid to be included, the search engine charges a fee. (Many shopping search engines and Yahoo! charge per-click fees.)
Per-action fee. Each time that a searcher purchases a product you included, the search engine charges a fee. (Only some shopping search engines charge this fee.) Search engines charge either a per-click or a per-action fee, never both for the same campaign.
Taking the Yahoo! program as an example, we see that Site Match (the single URL submission program) charges an annual fee based on the number of URLs submitted, as shown in Table 10-3. Site Match subscribers also pay a fixed cost per click for each searcher choosing their page (with no cost per action). Most content categories are charged at 15¢ per click, although selected categories are priced at 30¢ each.
| URLs | Annual Fee | 
|---|---|
| 1 | $49 | 
| 210 | $29 | 
| 11999 | $10 | 
Turning to trusted feed programs, we see that Site Match Xchange is open to search marketers submitting more than 1,000 URLs or spending more than $5,000 per month. The per-click fee is the same as for Site Match, and there is no annual fee.
Paid inclusion can improve your organic search marketing in several ways if you have the budget to pay for it, including the following:
It indexes more of your site. If you have some intractable spider traps making it impossible to get your site crawled, and it costs too much to fix them, paid inclusion can help. If you have the budget to pay for it, you can get most pages indexed in Yahoo! that spiders cannot process, but your content must also be optimizedand it is time-consuming to create compelling titles and descriptions for your pages. In addition, you can use trusted feeds to get your products included in shopping search engines, but remember that you must automate the transmission of your data to the shopping engines. It will cost you some time and money up front to write the software to send your data to the engines every day.
It is cheaper than paid placement. Per-click charges for paid inclusion are usually substantially lower than for paid placement, but you must be careful that you are not paying for clicks that you could get for free. If you can get your pages crawled by Yahoo! without paying for inclusion, that is obviously better. With shopping search engines, you typically must pay for inclusion to be in their indexthere is no free lunch here.
It responds quickly to changes. If your site is highly volatilecontent changes rapidly as inventory and prices changethen paid inclusion allows you to add new pages to the index and delete old ones much more rapidly than you can waiting for the spider. You can control what products appear in the search index, even rotating your offers if desired, and you can do that for both Yahoo! and for shopping search engines. Also, because paid inclusion metrics are up-to-the-minute, you can respond to drops in your search ranking or lower clickthroughs as they occur. You can also find new keywords that your pages should be optimized for.
It lets you test changes to your site quickly. Paid inclusion's 48-hour turnaround lets you make frequent changes to your site to see how Yahoo! changes your ranking. You can test many different combinations of content and check your corresponding rankings and traffic. Rather than waiting for weeks to see how a few different changes worked, you can test three different combinations in one week. When you find the best version of the page, you will likely find it was the best one for Google and other organic search results, too.
Signing up with Yahoo! for Site Match (the single URL submission program) is very simple, but implementing trusted feeds for Yahoo! and for shopping search engines takes quite a bit more work.
Site Match submission requires just one stepfilling out the submission form shown in Figure 10-12. All you need to do is enter the URLs for each page that you want included, along with the subject category of your sitethe category chosen determines whether you are charged 15¢ or 30¢ per click.
Reproduced with permission of Yahoo! Inc. © 2005 by Yahoo! Inc. YAHOO! and the YAHOO! logo are trademarks of Yahoo! Inc.
What data you must put in your feed depends on the search engine you are sending it to, because each engine has different data requirements. For example, Yahoo! requires the title, description, URL, and other text from the page. Shopping search engines typically expect the price, availability, and features of your products, in addition to the product's name and description. Most data feeds include some or all of these items:
Page URL. The actual URL for the Web page for this search result. It can be a static or a dynamic URL, but it must be workingno "page not found" messagesor the search engine will delete the page from its index.
Tracking URL. The URL that the searcher should go to when clicking this result. It can be the same as the page URL, but sometimes your Web metrics software needs a different URL to help measure clicks from search to your page.
Product name. All variants of your product's name, including acronyms and its full name. Pay special attention to what searchers might type in to find your product name and include them here.
Product description. A lengthy description of your product that should include multiple occurrences of the keywords you expect searchers to enter. (See Chapter 12 for more information on how to optimize your content for search.) Every search engine is different, but most allow 250 words for your product description.
Model number. The number you expect most searchers to enter to find this product. If a retailer and a manufacturer have different numbers for the same product, you can sometimes include both, depending on the search engine you are submitting the feed to.
Manufacturer. The complete name of the manufacturer of the product, with any short names or acronyms that searchers might use.
Product category. The type of product, according to a valid list of products maintained by the search engine. Each search engine has somewhat different product lines that they support, with different names. You need to use the exact name for your product's category that each search engine uses.
Price. A critical piece of information for shopping search engine feeds, which typically require tax and shipping costs, too. Be sure that your prices are accurate each time you submit your feed, because price is one of the main ways that shopping searchers find your product.
Inktomi, now owned by Yahoo!, pioneered the concept of feeding large amounts of data from commerce Web sites directly into its search index. Inktomi defined a custom XML format for supplying documents named IDIF (Inktomi Document Interchange Format), which is still used by Yahoo! today and is depicted in Figure 10-13.
It's not all that complicated to get started with paid inclusion, especially if you start with a single URL submission program, but most medium-to-large sites probably need to use trusted feed programs. You also need to use trusted feeds to send your data to shopping search engines, because they cannot be fed any other way. And anyone feeding shopping search engines needs trusted feed programs. They are a bit more complex to set up, as you have seen, so you want to make sure you get the most out of them and that you avoid any pitfalls along the way. Here are some tips to make your paid inclusion program a success:
Avoid off-limits subjects. Most search engines have off-limits subjects that they refuse to be associated with. Because of local laws in various countries, some search engines reject "adult" (pornographic) content, sites with controversial themes, gambling sites, and pharmacy and drug information. All content submitted undergoes initial and ongoing quality reviews. If pages are rejected, some search engines, including Yahoo!, do not refund your fees.
Take advantage of "on-the-fly" optimization. The better your source data and the better your feed-creation program, the better your feed can be. One advantage of trusted feeds over crawling is on-the-fly optimizationhaving your feed-creation program add additional relevant words to the feed that were not in your original source, such as keywords to your titles and descriptions. For example, Snap Electronics discovered that all of its product pages contain the words Snap and SnapShot, and the model number, and a picture, but they actually do not all contain the words digital camera. Snap made sure that the program that produces its trusted feed optimized its data on-the-fly, by adding the generic product keywords digital camera to the titles and descriptions of the trusted feed. That way the search engine has that information even though Snap forgot to optimize the original page to contain digital camera.
Do not add keywords unrelated to your content. 
Make your feed-creation program flexible. Even if you start using trusted feeds for just a single search engine, be prepared to expand to work with others in the future. Each engine uses a slightly different format, so make sure your programmer is prepared to change the program to create feeds for other search engines when they are needed.
Seek out feed specialists if needed. If your programmers cannot do the job, several vendors will be happy to step in and do it for you, including Position Tech (www.positiontech.com), MarketLeap (www.marketleap.com), Global Strategies International (www.globalstrategies.com), and Business Research (www.bizresearch.com). Each of these companies is "certified" by Yahoo! and some shopping search engines to produce trusted feeds. Companies that are not certified typically have to partner with a certified company, so you are better of working with a certified company directly.
Stay on top of daily operations. In addition to the work of creating the program, you must ensure that your operations personnel run the program to send the data whenever it changes, or else the search engine will not have the most up-to-date information for your site. Don't do all the expensive upfront work and then fall down by not operating reliably.
Paid inclusion, especially trusted feeds, can require some work upfront, but they can pay off handsomely when executed properly. If your site would benefit from sales from shopping search engines, or you need to boost your pages indexed in Yahoo!, paid inclusion could be the extra organic lottery ticket it makes sense to buy.