<?xml version="1.0"?><rss version="2.0">
<channel>
  <title>Alan&#039;s Ramblings - google tag</title>
  <link>http://bleaklow.com:80/tags/google/</link>
  <description>My opinions may be incorrect, but they are my own</description>
  <language>en</language>
  <copyright>Alan Burlison</copyright>
  <lastBuildDate>Wed, 29 Feb 2012 20:50:00 GMT</lastBuildDate>
  <generator>Pebble (http://pebble.sourceforge.net)</generator>
  <docs>http://backend.userland.com/rss</docs>
  <image>
    <url>http://bleaklow.com/images/misc/logo.gif</url>
    <title>Alan&#039;s Ramblings</title>
    <link>http://bleaklow.com:80/</link>
  </image>
  <item>
    <title>Putting GoogleBot on a leash</title>
    <link>http://bleaklow.com:80/2003/11/07/putting_googlebot_on_a_leash.html</link>
    <description>
          &lt;p&gt;
As I noted earlier, I get a lot of hits from the &lt;a href=&#034;http://www.google.com&#034;&gt;Google&lt;/a&gt; web indexer &lt;a href=&#034;http://www.google.com/bot.html&#034;&gt;GoogleBot&lt;/a&gt; (252 visits last month), in fact various web crawlers are the most frequent visitors to my site.  Whilst most webwranglers know about the &lt;a href=&#034;http://www.robotstxt.org/wc/exclusion.html#robotstxt&#034;&gt;robots.txt&lt;/a&gt; file and how to use it to control the activities of robots when they visit your site, it is a bit of a blunt instrument, as it can only exclude entire subtrees from being indexed.
&lt;/p&gt;&lt;p&gt;
There is a more fine-grained way of controlling the way Google and other robots index your site using a &lt;code&gt;&amp;lt;meta&amp;gt;&lt;/code&gt; tag to direct the robot.  This is mentioned on the GoogleBot page linked to above, and the official specification is available &lt;a href=&#034;http://www.robotstxt.org/wc/exclusion.html#meta&#034;&gt;here&lt;/a&gt;.  The basic principle is very simple, you need to add a line of the form
&lt;/p&gt;
&lt;pre&gt;
&amp;lt;meta name=&#034;robots&#034; content=&#034;noindex,follow&#034; /&amp;gt;
&lt;/pre&gt;
&lt;p&gt;
in the &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt; section of your HTML documents.  The &lt;code&gt;content&lt;/code&gt; attribute has just four possible permutations:
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;content=&#034;index,follow&#034;&lt;/code&gt;&lt;br /&gt;
Index the page itself, and follow all links from the page.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;content=&#034;noindex,follow&#034;&lt;/code&gt;&lt;br /&gt;
Don&#039;t index the page itself, but follow all links from the page.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;content=&#034;index,nofollow&#034;&lt;/code&gt;&lt;br /&gt;
Index the page itself, but don&#039;t follow any links from the page.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;content=&#034;noindex,nofollow&#034;&lt;/code&gt;&lt;br /&gt;
Don&#039;t index the page itself, and don&#039;t follow any links from the page.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
Not all robots take notice of this directive, but Google certainly does, and you can use it to prevent it indexing rapidly-changing and low-content pages such as your main index page and your TrackBack entries.
&lt;/p&gt;</description>
      <category>Web</category>
    <category>Tech</category>
    <comments>http://bleaklow.com:80/2003/11/07/putting_googlebot_on_a_leash.html#comments</comments>
    <guid isPermaLink="true">http://bleaklow.com:80/2003/11/07/putting_googlebot_on_a_leash.html</guid>
    <pubDate>Fri, 07 Nov 2003 02:09:56 GMT</pubDate>
  </item>
  </channel>
</rss>

