While none of what I am about to discuss is ground breaking or new, I have found that many sites, including those optimised by SEO specialist, make the common mistake of inappropriately using the robots.txt file to block search engine access to content.
Both the robots.txt file and meta robots tag can be used to block search engines from accessing content on a site, but which approach is better for SEO?
The robots.txt file offers a convenient and easy approach to block content from web crawlers but does not allow for the flow of link equity from the blocked content to other pages. On the other hand, the meta robots tag offers more flexibility and allows for the flow of link equity to other pages but can be harder to implement.
Lets take a brief look at each approach before comparing the two.
The Robots.txt file
The Robots Exclusion Protocol (REP) was first defined in 1994 and extended in 1997.
The robots.txt file is a simple text file placed in the root directory of a site and includes instructions for web crawlers on which files or sections of the site NOT to crawl (read) . An exmaple of a web crawler or robot off course includes the Googlebot.
Below is an example of commands you might find in a common robots.txt file. You can also specify a sitemap URL within the robots.txt file, which can be useful if your sitemap is not located within the root directory of the site or has an obscure name.
user-agent: *
Disallow: /javascript/
Disallow: /style/
Disallow: /private/
Disallow:/private-file.html
Sitemap: http://www.zansule.com/sitemap.xml
The Robots Meta Tag
The Meta robots tag is supported by most search engines and can be used to instruct robots not to index or crawl a page. The robots meta tag supports other commands such as noarchive, nosnippet, noodp and noydir but for the purpose of this discussion we will focus on only the four commands: noindex, index, follow and nofollow.
noindex:
The “noindex” command instructs web crawlers not to index the page and not to display the page in the search engine results.
nofollow:
The “nofollow” command instructs web crawlers not to crawl the content of the page and not to follow links embedded within the page.
index:
The “index” command instructs web crawlers to index the page and to include the page within the search engine results page.
follow:
The “follow” command instructs web crawlers to crawl the page and follow links within the page.
Below are examples of how the meta robots tag can be used.
<meta name=”robots” content=”noindex nofollow” />
<meta name=”robots” content=”noindex follow” />
<meta name=”robots” content=”index follow” />
<meta name=”robots” content=”index nofollow” />
Robots.txt Vs Meta Robots
The convention adopted by most search engines in regards to the robots.txt “disallow” command as noted in this video by Matt Cutt is not to crawl the content of the page however, in some cases search engines might still include the page within their search results based on links to the page from external sources. This off course, for many is not the anticipated result of using the disallow command.
On the other hand, the convention adopted by most search engines for the meta tag “noindex” command is not to index the page at all and the page is excluded from the search engine results page regardless of the number of links pointing the page.
When the “noindex” command is used in conjunction with the “follow” command however, then the page is crawled and links embedded within the page are followed but the page itself is still excluded from the search engine results page. This approach of using the “noindex follow” command is best for SEO as not only does it ensure the exclusion of the page from the search engines as required, it also allows for the flow of link equity, including PageRank, to the other pages linked to from the page.
Another slightly less noticeable downside of using the robots.txt file is that it provides a central location for anyone to uncover what pages on the site are not intended to be indexed by the search engines.
Merits of Robots.txt
The robots.txt file is not completely without merit, the robots.txt file provides an easy way to block whole sections of a site using a single command for example Disallow: /admin/.
The robots.txt disallow command can also be used to quickly block access to a site which is under development and in which the flow of PageRank is not importance.
The robots.txt file can be used to block access to specific robots such as Googlebot and as we have seen previously the robots.txt file can be used to specify a sitemap URL for the search engines.
It is important to note that many web crawlers do not respect the robots exclusion commands and different robots might interpret these commands differently. The robots exclusion commands also does not block direct access by anyone.
Finally, you might find this video where Matt Cutts talks about how Google handles the robots.txt command useful. This approve is similar across all the major search engines.
7 April 2011
Topics: News and Insights