Putting GoogleBot on a leash

As I noted earlier, I get a lot of hits from the Google web indexer GoogleBot (252 visits last month), in fact various web crawlers are the most frequent visitors to my site. Whilst most webwranglers know about the robots.txt file and how to use it to control the activities of robots when they visit your site, it is a bit of a blunt instrument, as it can only exclude entire subtrees from being indexed.

There is a more fine-grained way of controlling the way Google and other robots index your site using a <meta> tag to direct the robot. This is mentioned on the GoogleBot page linked to above, and the official specification is available here. The basic principle is very simple, you need to add a line of the form

<meta name="robots" content="noindex,follow" />

in the <head> section of your HTML documents. The content attribute has just four possible permutations:

  • content="index,follow"
    Index the page itself, and follow all links from the page.
  • content="noindex,follow"
    Don't index the page itself, but follow all links from the page.
  • content="index,nofollow"
    Index the page itself, but don't follow any links from the page.
  • content="noindex,nofollow"
    Don't index the page itself, and don't follow any links from the page.

Not all robots take notice of this directive, but Google certainly does, and you can use it to prevent it indexing rapidly-changing and low-content pages such as your main index page and your TrackBack entries.

Tags : , ,
Categories : Web, Tech