Pages getting into Google Index, blocked by Robots.txt??

bjs2010

Hi all,

So yesterday we set up to Remove URL's that got into the Google index that were not supposed to be there, due to faceted navigation... We searched for the URL's by using this in Google Search.
site:www.sekretza.com inurl:price=
site:www.sekretza.com inurl:artists=

So it brings up a list of "duplicate" pages, and they have the usual: "A description for this result is not available because of this site's robots.txt – learn more."

So we removed them all, and google removed them all, every single one.

This morning I do a check, and I find that more are creeping in - If i take one of the suspecting dupes to the Robots.txt tester, Google tells me it's Blocked. - and yet it's appearing in their index??

I'm confused as to why a path that is blocked is able to get into the index?? I'm thinking of lifting the Robots block so that Google can see that these pages also have a Meta NOINDEX,FOLLOW tag on - but surely that will waste my crawl budget on unnecessary pages?

Any ideas?

thanks.

Devanur-Rafi

Oh, ok. If that's the case, pls don't worry about those in the index. You can get them removed using remove URL feature in webmaster tools account.

bjs2010

It doesn't show any result for the "blocked page" when I do that in Google.

Devanur-Rafi

Hi,

Please try this and let us know the results:

Suppose this is one of the pages in discussion:

http://www.yourdomain.com/blocked-page.html

Go to Google, type the following along with double quotes. Replace with the actual page:

"yourdomain.com/blocked-page.html" -site:yourdomain.com

AndersS

Hi!

From what I could tell, it wasn't that many pages already in the index, so it could be worth trying to lift the block, at least for a short while, to see if it will have an impact.

In addition - how about configuring how GoogleBot should threat your URLs via the URL parameter tool in Google Webmaster Tools. Here's what Google has to say about this. https://support.google.com/webmasters/answer/1235687

Best regards,Anders

AndersS

Hi Devanur.

What I'm guessing is the problem here, is that as of now, GoogleBot is restricted from accessing the pages (because of robots.txt), leading to it never going into the page and updateing its index regarding the "noindex, follow" declaration in the that seems to be in place.

One other thing that could be considered, is to add "rel=nofollow" to all the faceted navigation links on the left.

Fully agreeing with you on the "crawl budget" part

Anders

bjs2010

Hi guys,

Appreciate your replies, but as far as I checked last time, if the URL is blocked by a Robots.txt file, it cannot read the Meta Noindex, Follow tag within the page.

There are no external references to these URL's, so Google is finding them within the site itself.

In essence, what you are recommending is that I lift the robots block and let google crawl these pages (which could be infinite as it is faceted navigation).

This will waste my crawl budget.

Any other ideas?

Devanur-Rafi

Anderss has pointed out to the right article. With robots.txt blocking, Google bot will not do the crawl (link discovery) from within the website but what if references to these blocked pages are found else where on third-party websites? This is the case you have been into. So to fully block Google from doing the link discovery and indexing these blocked pages, you should go in for the page-level meta robots tag to block these pages. Once this is in place, this issue will fade away.

This issue has been addressed many times here on Moz.

Coming to your concern about the crawl budget. There is nothing to worry about this as Google will not crawl those blocked pages while its on your website as these are already been blocked using robots.txt file.

Hope it helps my friend.

Best regards,

Devanur Rafi

AndersS

Hi!

It could be that that pages has already been indexed before you added the directives to robots.txt.

I see that you have added the rel=canonical for the pages and that you now have noindex,follow. Is that recently added? If so, it could be wise to actually let GoogleBot access and crawl the pages again - and then they'll go away after a while. Then you could add the directive again later. See https://support.google.com/webmasters/answer/93710?hl=en&ref_topic=4598466 for more about this.

Hope this helps!
Anders

bjs2010

For example:
http://www.sekretza.com/eng/best-sellers-sekretza-products.html?price=1%2C1000

Is blocked by using:
Disallow: /*price=

.... ?

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Pages getting into Google Index, blocked by Robots.txt??

Browse Questions

Explore more categories

Related Questions

Will google be able to crawl all of the pages given that the pages displayed or the info on a page varies according to the city of a user?

How to get a large number of urls out of Google's Index when there are no pages to noindex tag?

Pages blocked by robots

Duplicate Page getting indexed and not the main page!

"noindex, follow" or "robots.txt" for thin content pages

Page Indexed but not Cached

Why will google not index my pages?

Robots.txt disallow subdomain