Chapter 82. Blocking Parts of Your Site from Search Engines
It may seem counterintuitive to want to restrict search engines from any part of your site. You probably take great pains to get your site listed on as many search engines as possible. But follow the logic here: Maybe you want to control exactly how a visitor finds your site. You would rather new visitors come in by the front page, say, instead of some page three levels deep. Or maybe you don't want visitors to come in on a page that is supposed to be a popup window, where you may not offer the full range of navigation choices. The more you think about it, the more restricting certain areas from search engines makes good sense.
TIPSome Web designers who prefer cheap solutions instead of good ones think that the methods in this topic give them a sure-fire way to secure sensitive information on their sites. These Web designers would do well to skip this topic entirely. The best solution for security is not to keep sensitive information on your Web server, period. If you can't avoid it, you need to research and implement actual security and authorization protocols, such as password-protected directories. |
GEEKSPEAKA robot is a special piece of software that catalogs or sniffs (or spiders) your site for a search engine. |
User-agent: *
Disallow: /popups/
Disallow: popup
Disallow: /images/
Disallow: /js/
Disallow: /css/
TIPMake sure you use a text editor to create your site's robot.txt file, and save the result with the extension .txt. Don't create an HTML file and then change the extension to .txt. |
In this scenario, google's googlebot can't look at the popups directory or the popup file, while Roverdog can't get at the images, js, or css directories in addition to the popups folder and the popup file.The values in the Disallow lines are root-relative paths, by the way. So, if you want to hide a subfolder but not the top-level folder, make sure you give the entire path to the subfolder:
User-agent: googlebot
Disallow: /popups/
Disallow: popup
User-agent: Roverdog
Disallow: /popups/
Disallow: popup
Disallow: /images/
Disallow: /js/
Disallow: /css/
If you want to hide absolutely everything (in this case, from all robots), use:
User-agent: Roverdog
Disallow: /swf/source/image/library/english/10077_
User-agent: *
Disallow: /
TIPThe asterisk character in robots.txt is not a wildcard. For example, you can't disallow *.gif to bar search engines from all GIF image filesfor that, you have to put all your GIFs in a folder and then disallow that folder. The asterisk only works in the User-agent line, and only then as shorthand for all robots. |
If you want to make everything on your site available to all robots, use:
User-agent: googlebot
Disallow: /
User-agent: *
Disallow:
TIPFor more information about robots.txt and to look up the names of the various robots out there, see www.robotstxt.org/. |
Now go back to the example at the beginning of this topic, where you want to try to force new visitors to come in through the front page. Say your site has five top-level directories: products, services, aboutus, images, and apps, along with an HTML file called contact. Your robots.txt file looks like this:
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /
Put this file in the top-level directory of your remote site, and search engines will only index your home page (index).
User-agent: *
Disallow: /products/
Disallow: /services/
Disallow: /aboutus/
Disallow: /images/
Disallow: /apps/
Disallow: contact