When primitive robots were first created, some of them would crash servers by
requesting too many pages too quickly. A robots exclusion standard was crafted to
allow you to tell any robot (or all of them) that you do not want some of your
pages indexed or that you do not want your links followed. You can do this via a
meta tag on the page copy
or create a robots.txt file that gets placed in the root of your website. The goal of
either of these methods is to tell the robots where NOT to go.
The official robots exclusion protocol document is located at the following URL.
You do not need to use a robots.txt file. By default, search engines will index your
site. The robot.txt file goes in the root level of your domain using robots.txt as the
This allows all robots to index everything:
This disallows all robots to your site:
You also can disallow a folder or a single file in the robots txt file. This disallows a folder:
This disallows a file:
One problem many dynamic sites have is sending search engines multiple URLs
with nearly identical content. If you have products in different sizes and colors, or
other small differences, it is likely that you could generate lots of near-duplicate
content, which will prevent search engines from fully indexing your sites.
If you place your variables at the start of your URLs, then you can easily block all
of the sorting options using only a few disallow lines. For example, the following
would block search engines from indexing any URLs that start with ‘cart.php?size’ or ‘cart.php?color’.
Notice how there is no trailing slash at the end of the above disallow lines. That
means the engines will not index anything that starts with that in the URL. If there
were a trailing slash, search engines would only block a specific folder.
If the sort options were at the end of the URL, you would either need to create an
exceptionally long robots.txt file or place the robots noindex meta tags inside the
sort pages. You also can specify any specific user agent, such as Googlebot, instead
of using the asterisk wild card. Many bad bots will ignore your robots txt files
and/or harvest the blocked information, so you do not want to use robots.txt to
block individuals from finding confidential information.
Googlebot also supports wildcards in the robots.txt. The following would stop Googlebot from reading any URL that includes the string ‘sort=’ no matter where
that string occurs in the URL:
In 2006 Yahoo! also added robots.txt wildcard support.
Next» Rel="NoFollow" Link Exchange