Crawl Errors Confusing Me

mjtaylor

The SEOMoz crawl tool is telling me that I have a slew of crawl errors on the blog of one domain. All are related to the MSNbot. And related to trackbacks (which we do want to block, right?) and attachments (makes sense to block those, too) ... any idea why these are crawl issues with MSNbot and not Google? My robots.txt is here: http://www.wevegotthekeys.com/robots.txt.

Thanks, MJ

Cyrus-Shepard

I'm a little late to the party, but I want to summarize what I see as the answer.

1. The "Search Engine Blocked by Robots.txt" is only a warning, and not an error. If you intend for these pages not to get crawled (and it does seem like you have a good reason for this), then there is nothing to worry about.

2. The reason the warning appears for MSNbot and not Google is that currently, your robots.txt allows Google to crawl those files. As Daniel pointed out, you would need to add the identical directives to your robots.txt file to make this happen. Does that make sense? Or you could just add all of these files under the * directive to apply to all robots.

mjtaylor

Yes, I thought that's what you meant ... thanks!

DanDeceuster

I am saying this:

User-agent: Googlebot
Noindex: /key-west-blog/*?*
Noindex: /key-west-blog/*.rss
Noindex: /key-west-blog/*feed
Noindex: /key-west-blog/*trackback
Noindex: /key-west-blog/*wp-
Noindex: /key-west-blog/tag/
Noindex: /key-west-blog/search/
Noindex: /key-west-blog/archives/
Noindex: /key-west-blog/category/
Noindex: /key-west-blog/2009
Noindex: /key-west-blog/2010

and this:

User-agent: Googlebot-Mobile
Noindex: /key-west-blog/?
Noindex: /key-west-blog/*.rss
Noindex: /key-west-blog/*feed
Noindex: /key-west-blog/*trackback
Noindex: /key-west-blog/*wp-
Noindex: /key-west-blog/tag/
Noindex: /key-west-blog/search/
Noindex: /key-west-blog/archives/
Noindex: /key-west-blog/category/
Noindex: /key-west-blog/2009
Noindex: /key-west-blog/2010


They use Noindex which is a syntax I am unfamiliar with in robots.txt. So you can check out http://www.robotstxt.org/robotstxt.html for more info on robots.txt and proper syntaxt. I would change Noindex: to Disallow: and that should fix the error in the robots.txt file.

mjtaylor

The robots.txt file DOES contain

User-agent: Msnbot
Crawl-delay: 120
Disallow: /key-west-blog/*?*
Disallow: /key-west-blog/*.rss
Disallow: /key-west-blog/*feed
Disallow: /key-west-blog/*trackback
Disallow: /key-west-blog/*wp-
Disallow: /key-west-blog/*login.php
Disallow: /key-west-blog/tag/
Disallow: /key-west-blog/search/
Disallow: /key-west-blog/archives/
Disallow: /key-west-blog/category/
Disallow: /key-west-blog/2009
Disallow: /key-west-blog/2010

But you are saying I should remove the lines with noindex?

DanDeceuster

In your robots.txt file, you have the Disallow: command under MSNbot and Noindex: under Googlebot. Noindex is not a robots.txt command. Change Noindex: to Disallow: and those pages will be blocked for all bots. Not sure if that is what is causing the issue, but that would explain the discrepancy. If you want to noindex a page, you do it with a meta tag like this:

You can change follow to nofollow if you want, really doesn't matter much.

ENSO

I have the same problem looks like MSN bot is disallowed from accessing wordpress content. So pages show up as ?page=111 so from what I understand so far anything that shows as below is blocked from MSNbot. I don't have a definite answer for you as to what to do, but I can tell you will need to "allow" msn bot the googlebot is.

Disallow: /key-west-blog/*?*

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Crawl Errors Confusing Me

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Rogerbot did not crawl my site ! What might be the problem?

"Loading" error in Open Site Explorer

Who wants to help go over my crawl diagnostics via skype?

Crawl Diagnostics - unexpected results

Our Duplicate Content Crawled by SEOMoz Roger, but Not in Google Webmaster Tools

Set crawl frequency

Very confused on site.com/ or not using a /

Can I change the crawl day ?