Robots.txt and canonical tag

seoug_2005

In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -

If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.

Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?

seoug_2005

Thanks Ryan for explaining things very clearly.

RyanKent

What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.

seoug_2005

If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?

RyanKent

Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.

This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.

seoug_2005

Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.

Daylan

Thats correct in most cases:

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

More information available here about:

http://www.robotstxt.org/

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt and canonical tag

Browse Questions

Explore more categories

Related Questions

Duplicate title while setting canonical tag.

How to use robots.txt to block areas on page?

Duplicate Title Tags

Canonical tag in the Michael Torbert SEO plugin

Site blocked by robots.txt and 301 redirected still in SERPs

Is there a reason to set a crawl-delay in the robots.txt?

Google (GWT) says my homepage and posts are blocked by Robots.txt

Canonical pagination content