Spider Indexed Disallowed URLs

ooseoo

Hi there,

In order to reduce the huge amount of duplicate content and titles for a cliënt, we have disallowed all spiders for some areas of the site in August via the robots.txt-file. This was followed by a huge decrease in errors in our SEOmoz crawl report, which, of course, made us satisfied.

In the meanwhile, we haven't changed anything in the back-end, robots.txt-file, FTP, website or anything. But our crawl report came in this November and all of a sudden all the errors where back. We've checked the errors and noticed URLs that are definitly disallowed. The disallowment of these URLs is also verified by our Google Webmaster Tools, other robots.txt-checkers and when we search for a disallowed URL in Google, it says that it's blocked for spiders. Where did these errors came from? Was it the SEOmoz spider that broke our disallowment or something? You can see the drop and the increase in errors in the attached image.

Thanks in advance.

[](<a href=)" target="_blank">a> [](<a href=)" target="_blank">a> LAAFj.jpg

ooseoo

This was what I was looking for! The pages are indexed by Google, yes, but they aren't being crawled by the Googlebot (as my Webmaster Tool and the Matt Cutts Video is telling me), but they are occasionally being crawled by the Rogerbot probably (not monthly). Thank you very much!

ooseoo

Yes yes, canonicalization or meta noindex-tag would be better of course to pass the possible link juice, but we aren't worried about that. I was worried Google would still see the pages as duplicates. (couldn't really distile that out of the article, although it was useful!) Barry Smith answered that last issue in the answer below, but i do want to thank you for your insight.

StalkerB

The directives issued in a robots.txt file are just a suggestion to bots. One that Google does follow though.

Malicious bots will ignore them and occasionally even bots that follow the directives may mess up (probably what's happened here).

Google may also index pages that you've blocked as they've found them via a link as explained here - http://www.youtube.com/watch?v=KBdEwpRQRD0 - or for an overview of what Google does with robots.txt files you can read here - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

I'd suggest you look at other ways of fixing the problem than just blocking 1500 pages but I see you've considered what would be required to fix the issues without removing the pages from a crawl and decided the value isn't there.

If WMT is telling you the pages are blocked from being crawled I'd believe that.

Try searching for a url that should be blocked in Google and see if it's indexed or do site:http://yoursitehere.com and see if blocked pages come up.

josh-riley

The assumptions of what to expect from using robots.txt may not be in line with the realities. Crawling a page isn't the same thing as indexing the content to appear in SERPs and even with robots, your pages can be crawled.

http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions

ooseoo

Thanks mister Goyal. Of course we have been thinking about ways and figured out some options in doing so, but implementing these solutions would be disastreous from a time/financial perspective. The pages that we have blocked from the spiders aren't needed for visibility in the search engines and don't carry much link juice, they are only there for the visitors, so we decided we don't really need them for our SEO-efforts in a positive way. But when these pages do get crawled and the engines notice the huge amount of duplicates, i recogn this would have a negative influence on our site as a whole.

So, the problem we have is focused on the doubts we have on the legitimacy of the report. If SEOMoz can crawl it, the Googlebot could probably too, right, since we've used: User-agent: *

NakulGoyal

Mark

Are you blocking all your bots to spider these erroneous URLs ? Is there a way for you to fix these such that either they don't exist or they are not duplicate anymore.

I'd just recommend looking from that perspective as well. Not just the intent of making those errors disappear from the SEOMoz report.

I hope this helps.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Spider Indexed Disallowed URLs

Browse Questions

Explore more categories

Related Questions

Google selecting incorrect URL as canonical: 'Duplicate, submitted URL not selected as canonical'

Canonical sitemap URL different to website URL architecture

Index subpages but not homepage

Having Problems to Index all URLs on Sitemap

Magento URL change

Should me URLs be uppercase or lowercase

URL Rewrite

HTML and no index, follow