Spider Indexed Disallowed URLs
-
Hi there,
In order to reduce the huge amount of duplicate content and titles for a cliënt, we have disallowed all spiders for some areas of the site in August via the robots.txt-file. This was followed by a huge decrease in errors in our SEOmoz crawl report, which, of course, made us satisfied.
In the meanwhile, we haven't changed anything in the back-end, robots.txt-file, FTP, website or anything. But our crawl report came in this November and all of a sudden all the errors where back. We've checked the errors and noticed URLs that are definitly disallowed. The disallowment of these URLs is also verified by our Google Webmaster Tools, other robots.txt-checkers and when we search for a disallowed URL in Google, it says that it's blocked for spiders. Where did these errors came from? Was it the SEOmoz spider that broke our disallowment or something? You can see the drop and the increase in errors in the attached image.
Thanks in advance.
[
](<a href=)" target="_blank">a> [
](<a href=)" target="_blank">a> LAAFj.jpg
-
This was what I was looking for! The pages are indexed by Google, yes, but they aren't being crawled by the Googlebot (as my Webmaster Tool and the Matt Cutts Video is telling me), but they are occasionally being crawled by the Rogerbot probably (not monthly). Thank you very much!
-
Yes yes, canonicalization or meta noindex-tag would be better of course to pass the possible link juice, but we aren't worried about that. I was worried Google would still see the pages as duplicates. (couldn't really distile that out of the article, although it was useful!) Barry Smith answered that last issue in the answer below, but i do want to thank you for your insight.
-
The directives issued in a robots.txt file are just a suggestion to bots. One that Google does follow though.
Malicious bots will ignore them and occasionally even bots that follow the directives may mess up (probably what's happened here).
Google may also index pages that you've blocked as they've found them via a link as explained here - http://www.youtube.com/watch?v=KBdEwpRQRD0 - or for an overview of what Google does with robots.txt files you can read here - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
I'd suggest you look at other ways of fixing the problem than just blocking 1500 pages but I see you've considered what would be required to fix the issues without removing the pages from a crawl and decided the value isn't there.
If WMT is telling you the pages are blocked from being crawled I'd believe that.
Try searching for a url that should be blocked in Google and see if it's indexed or do site:http://yoursitehere.com and see if blocked pages come up.
-
The assumptions of what to expect from using robots.txt may not be in line with the realities. Crawling a page isn't the same thing as indexing the content to appear in SERPs and even with robots, your pages can be crawled.
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions
-
Thanks mister Goyal. Of course we have been thinking about ways and figured out some options in doing so, but implementing these solutions would be disastreous from a time/financial perspective. The pages that we have blocked from the spiders aren't needed for visibility in the search engines and don't carry much link juice, they are only there for the visitors, so we decided we don't really need them for our SEO-efforts in a positive way. But when these pages do get crawled and the engines notice the huge amount of duplicates, i recogn this would have a negative influence on our site as a whole.
So, the problem we have is focused on the doubts we have on the legitimacy of the report. If SEOMoz can crawl it, the Googlebot could probably too, right, since we've used: User-agent: *
-
Mark
Are you blocking all your bots to spider these erroneous URLs ? Is there a way for you to fix these such that either they don't exist or they are not duplicate anymore.
I'd just recommend looking from that perspective as well. Not just the intent of making those errors disappear from the SEOMoz report.
I hope this helps.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
URL Indexing with Keyword
Hi, My webpage url is indexed in Google but don't show when searching the Main Keyword. How can i index it with keyword. It should show on any SERP when the keyword is searched. Any suggestions.
Technical SEO | | green.h1 -
Why my website does not index?
I made some changes in my website after that I try webmaster tool FETCH AS GOOGLE but this is 2nd day and my new pages does not index www. astrologersktantrik .com
Technical SEO | | ramansaab0 -
Which URL would you choose?
1 – www.company.com/subfolder/subfolder/keyword-keyword-product (I’m able to keyword match with this url) or 2. www.company.com/subfolder/subfolder/product (no url keyword match) What would you choose? A url which is "short" but still relevant, or, a url which is more descriptive allowing “keyword” match? Be great to get your feedback guys. Many thanks Gary
Technical SEO | | GaryVictory0 -
Pages not being indexed
Hi Moz community! We have a client for whom some of their pages are not ranking at all, although they do seem to be indexed by Google. They are in the real estate sector and this is an example of one: http://www.myhome.ie/residential/brochure/102-iveagh-gardens-crumlin-dublin-12/2289087 In the example above if you search for "102 iveagh gardens crumlin" on Google then they do not rank for that exact URL above - it's a similar one. And this page has been live for quite some time. Anyone got any thoughts on what might be at play here? Kind regards. Gavin
Technical SEO | | IrishTimes0 -
Update index date
If I update the content of a page without changing the initial url and google crawls my new page, will the index date (that appears in the SERP) change to the latest update? In positive case how many change should I do to consider an update? tks
Technical SEO | | fabrico230 -
Will rel canonical tags remove previously indexed URLs?
Hello, 7 days ago, we implemented canonical tags to resolve duplicate content issues that had been caused by URL parameters. These "duplicate content" had already been indexed. Now that the URLs have rel canonical tags in place, will Google automatically remove from its index the other URLs with the URL parameters? I ask because we have been tracking the approximate number of URLs indexed by doing a site: search in Google, and we have barely noticed a decrease in URLs indexed. Thanks.
Technical SEO | | yacpro130 -
Changing url structure
We are an ecommerce site established in 2005 and currently have some great rankings. We are about to move away from our existing platform, actinic and move on to Magento. This will change all our url's. What are the steps we should be asking our web developers to follow in order to minimize the consequences of moving? Thank you.
Technical SEO | | LadyApollo0 -
De-indexing thin content & Panda--any advantage to immediate de-indexing?
We added the nonidex, follow tag to our site about a week ago on several hundred URLs, and they are still in Google's index. I know de-indexing takes time, but I am wondering if having those URLs in the index will continue to "pandalize" the site. Would it be better to use the URL removal request? Or, should we just wait for the noindex tags to remove the URLs from the index?
Technical SEO | | nicole.healthline0