Pages getting into Google Index, blocked by Robots.txt??
-
Hi all,
So yesterday we set up to Remove URL's that got into the Google index that were not supposed to be there, due to faceted navigation... We searched for the URL's by using this in Google Search.
site:www.sekretza.com inurl:price=
site:www.sekretza.com inurl:artists=So it brings up a list of "duplicate" pages, and they have the usual: "A description for this result is not available because of this site's robots.txt – learn more."
So we removed them all, and google removed them all, every single one.
This morning I do a check, and I find that more are creeping in - If i take one of the suspecting dupes to the Robots.txt tester, Google tells me it's Blocked. - and yet it's appearing in their index??
I'm confused as to why a path that is blocked is able to get into the index?? I'm thinking of lifting the Robots block so that Google can see that these pages also have a Meta NOINDEX,FOLLOW tag on - but surely that will waste my crawl budget on unnecessary pages?
Any ideas?
thanks.
-
Oh, ok. If that's the case, pls don't worry about those in the index. You can get them removed using remove URL feature in webmaster tools account.
-
It doesn't show any result for the "blocked page" when I do that in Google.
-
Hi,
Please try this and let us know the results:
Suppose this is one of the pages in discussion:
http://www.yourdomain.com/blocked-page.html
Go to Google, type the following along with double quotes. Replace with the actual page:
"yourdomain.com/blocked-page.html" -site:yourdomain.com
-
Hi!
From what I could tell, it wasn't that many pages already in the index, so it could be worth trying to lift the block, at least for a short while, to see if it will have an impact.
In addition - how about configuring how GoogleBot should threat your URLs via the URL parameter tool in Google Webmaster Tools. Here's what Google has to say about this. https://support.google.com/webmasters/answer/1235687
Best regards,Anders
-
Hi Devanur.
What I'm guessing is the problem here, is that as of now, GoogleBot is restricted from accessing the pages (because of robots.txt), leading to it never going into the page and updateing its index regarding the "noindex, follow" declaration in the that seems to be in place.
One other thing that could be considered, is to add "rel=nofollow" to all the faceted navigation links on the left.
Fully agreeing with you on the "crawl budget" part
Anders
-
Hi guys,
Appreciate your replies, but as far as I checked last time, if the URL is blocked by a Robots.txt file, it cannot read the Meta Noindex, Follow tag within the page.
There are no external references to these URL's, so Google is finding them within the site itself.
In essence, what you are recommending is that I lift the robots block and let google crawl these pages (which could be infinite as it is faceted navigation).
This will waste my crawl budget.
Any other ideas?
-
Anderss has pointed out to the right article. With robots.txt blocking, Google bot will not do the crawl (link discovery) from within the website but what if references to these blocked pages are found else where on third-party websites? This is the case you have been into. So to fully block Google from doing the link discovery and indexing these blocked pages, you should go in for the page-level meta robots tag to block these pages. Once this is in place, this issue will fade away.
This issue has been addressed many times here on Moz.
Coming to your concern about the crawl budget. There is nothing to worry about this as Google will not crawl those blocked pages while its on your website as these are already been blocked using robots.txt file.
Hope it helps my friend.
Best regards,
Devanur Rafi
-
Hi!
It could be that that pages has already been indexed before you added the directives to robots.txt.
I see that you have added the rel=canonical for the pages and that you now have noindex,follow. Is that recently added? If so, it could be wise to actually let GoogleBot access and crawl the pages again - and then they'll go away after a while. Then you could add the directive again later. See https://support.google.com/webmasters/answer/93710?hl=en&ref_topic=4598466 for more about this.
Hope this helps!
Anders -
For example:
http://www.sekretza.com/eng/best-sellers-sekretza-products.html?price=1%2C1000Is blocked by using:
Disallow: /*price=.... ?
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Sitemap Indexed Pages, Google Glitch or Problem With Site?
Hello, I have a quick question about our Sitemap Web Pages Indexed status in Google Search Console. Because of the drastic drop I can't tell if this is a glitch or a serious issue. When you look at the attached image you can see that under Sitemaps Web Pages Indexed has dropped suddenly on 3/12/17 from 6029 to 540. Our Index status shows 7K+ indexed. Other than product updates/additions and homepage layout updates there have been no significant changes to this website. If it helps we are operating on the Volusion platform. Thanks for your help! -Ryan rou1zMs
Intermediate & Advanced SEO | | rrhansen0 -
Have a Robots.txt Issue
I have a robots.txt file error that is causing me loads of headaches and is making my website fall off the SE grid. on MOZ and other sites its saying that I blocked all websites from finding it. Could it be as simple as I created a new website and forgot to re-create a robots.txt file for the new site or it was trying to find the old one? I just created a new one. Google's website still shows in the search console that there are severe health issues found in the property and that it is the robots.txt is blocking important pages. Does this take time to refresh? Is there something I'm missing that someone here in the MOZ community could help me with?
Intermediate & Advanced SEO | | primemediaconsultants0 -
If Robots.txt have blocked an Image (Image URL) but the other page which can be indexed has this image, how is the image treated?
Hi MOZers, This probably is a dumb question but I have a case where the robots.tags has an image url blocked but this image is used on a page (lets call it Page A) which can be indexed. If the image on Page A has an Alt tags, then how is this information digested by crawlers? A) would Google totally ignore the image and the ALT tags information? OR B) Google would consider the ALT tags information? I am asking this because all the images on the website are blocked by robots.txt at the moment but I would really like website crawlers to crawl the alt tags information. Chances are that I will ask the webmaster to allow indexing of images too but I would like to understand what's happening currently. Looking forward to all your responses 🙂 Malika
Intermediate & Advanced SEO | | Malika11 -
Robot.txt error
I currently have this under my robot txt file: User-agent: *
Intermediate & Advanced SEO | | Rubix
Disallow: /authenticated/
Disallow: /css/
Disallow: /images/
Disallow: /js/
Disallow: /PayPal/
Disallow: /Reporting/
Disallow: /RegistrationComplete.aspx WebMatrix 2.0 On webmaster > Health Check > Blocked URL I copy and paste above code then click on Test, everything looks ok but then logout and log back in then I see below code under Blocked URL: User-agent: * Disallow: / WebMatrix 2.0 Currently, Google doesn't index my domain and i don't understand why this happening. Any ideas? Thanks Seda0 -
Should I block temporary pages
I need some SEO advice on an odd scenario: We are launching a new product line (party supplies) on it's own domain (PartySuperCenter.com). Due to some internal/technical reasons we will not be able to launch the site until the summer. We already have the product in our warehouse so the owners want to created a section on our current site (CostumeSuperCenter.com) for the new products. Once the new site is up the product will be removed from our current site and moved to the new site. I am concerned about the effect this will have on our SEO - having thousands of product pages appear and then disappear after a few months. I was thinking about blocking the pages using the "noindex" tag. Is this how you would handle it? Thanks in advance for your help!
Intermediate & Advanced SEO | | costume0 -
Google indexing issue?
Hey Guys, After a lot of hard work, we finally fixed the problem on our site that didn't seem to show Meta Descriptions in Google, as well as "noindex, follow" on tags. Here's my question: In our source code, I am seeing both Meta descriptions on pages, and posts, as well as noindex, follow on tag pages, however, they are still showing the old results and tags are also still showing in Google search after about 36 hours. Is it just a matter of time now or is something else wrong?
Intermediate & Advanced SEO | | ttb0 -
Does Google index url with hashtags?
We are setting up some Jquery tabs in a page that will produce the same url with hashtags. For example: index.php#aboutus, index.php#ourguarantee, etc. We don't want that content to be crawled as we'd like to prevent duplicate content. Does Google normally crawl such urls or does it just ignore them? Thanks in advance.
Intermediate & Advanced SEO | | seoppc20120 -
10,000 New Pages of New Content - Should I Block in Robots.txt?
I'm almost ready to launch a redesign of a client's website. The new site has over 10,000 new product pages, which contain unique product descriptions, but do feature some similar text to other products throughout the site. An example of the page similarities would be the following two products: Brown leather 2 seat sofa Brown leather 4 seat corner sofa Obviously, the products are different, but the pages feature very similar terms and phrases. I'm worried that the Panda update will mean that these pages are sand-boxed and/or penalised. Would you block the new pages? Add them gradually? What would you recommend in this situation?
Intermediate & Advanced SEO | | cmaddison0