Logo Background

Robots.txt File Setting

  • By on January 4, 2009 | No Comments

    When primitive robots were first created, some of them would crash servers by
    requesting too many pages too quickly. A robots exclusion standard was crafted to
    allow you to tell any robot (or all of them) that you do not want some of your
    pages indexed or that you do not want your links followed. You can do this via a
    meta tag on the page copy

    or create a robots.txt file that gets placed in the root of your website. The goal of
    either of these methods is to tell the robots where NOT to go.

    The official robots exclusion protocol document is located at the following URL.
    http://www.robotstxt.org/wc/exclusion.html

    You do not need to use a robots.txt file. By default, search engines will index your
    site. The robot.txt file goes in the root level of your domain using robots.txt as the
    file name.

    This allows all robots to index everything:
    User-agent: *
    Disallow:

    This disallows all robots to your site:
    User-agent: *
    Disallow: /

    You also can disallow a folder or a single file in the robots txt file. This disallows a folder:
    User-agent: *
    Disallow: /projects/

    This disallows a file:
    User-agent: *
    Disallow: /cheese/please.html

    One problem many dynamic sites have is sending search engines multiple URLs
    with nearly identical content. If you have products in different sizes and colors, or
    other small differences, it is likely that you could generate lots of near-duplicate
    content, which will prevent search engines from fully indexing your sites.

    If you place your variables at the start of your URLs, then you can easily block all
    of the sorting options using only a few disallow lines. For example, the following
    would block search engines from indexing any URLs that start with ‘cart.php?size’ or ‘cart.php?color’.

    User-agent: *
    Disallow: /cart.php?size
    Disallow: /cart.php?color

    Notice how there is no trailing slash at the end of the above disallow lines. That
    means the engines will not index anything that starts with that in the URL. If there
    were a trailing slash, search engines would only block a specific folder.

    If the sort options were at the end of the URL, you would either need to create an
    exceptionally long robots.txt file or place the robots noindex meta tags inside the
    sort pages. You also can specify any specific user agent, such as Googlebot, instead
    of using the asterisk wild card. Many bad bots will ignore your robots txt files
    and/or harvest the blocked information, so you do not want to use robots.txt to
    block individuals from finding confidential information.

    Googlebot also supports wildcards in the robots.txt. The following would stop Googlebot from reading any URL that includes the string ‘sort=’ no matter where
    that string occurs in the URL:
    User-agent: Googlebot
    Disallow: /*sort=

    In 2006 Yahoo! also added robots.txt wildcard support.

    Previous
    Next
    » Rel="NoFollow" Link Exchange
Leave a Comment