What happens to crawled URLs subsequently blocked by robots.txt?
-
We have a very large store with 278,146 individual product pages. Since these are all various sizes and packaging quantities of less than 200 product categories my feeling is that Google would be better off making sure our category pages are indexed.
I would like to block all product pages via robots.txt until we are sure all category pages are indexed, then unblock them. Our product pages rarely change, no ratings or product reviews so there is little reason for a search engine to revisit a product page.
The sales team is afraid blocking a previously indexed product page will result in in it being removed from the Google index and would prefer to submit the categories by hand, 10 per day via requested crawling.
Which is the better practice?
-
@aspenfasteners To my understanding, disallowing a page or folder in robots.txt does not remove pages from Google's index. It merely gives a directive to not crawl those pages/folders. In fact, when pages are accidentally indexed and one wants to remove them from the index, it is important to actually NOT disallow them in robots.txt, so that Google can crawl those pages and discover the meta NOINDEX tags on the pages. The meta NOINDEX tags are the directive to remove a page from the index, or to not index it in the first place. This is different than a robots.txt directive, whcih is intended to allow or disallow crawling. Crawling does not equal indexing.
So, you could keep the pages indexable, and simply block them in your robots.txt file, if you want. If they've already been indexed, they should not disappear quickly (they might, over time though). BUT if they haven't been indexed yet, this would prevent them from being discovered.
All of that said, from reading your notes, I don't think any of this is warranted. The speed at which Google discovers pages on a website is very fast. And existing indexed pages shouldn't really get in the way of new discovery. In fact, they might help the category pages be discovered, if they contain links to the categories.
I would create a categories sitemap xml file, link to that in your robots.txt, and let that do the work of prioritizing the categories for crawling/discovery and indexation.
-
@aspenfasteners to answer your question: "do we KNOW that Google will immediately de-index URL's blocked by robots.txt?"
Google will not immediately de-index URLs that are blocked by robots.txt, based on my experience. I've dealt with very similar situation but with much greater scale - around 8M automatically generated pages that got into Google index. It may take a year or more to de-index these pages completely. Of course, every case is different, but based on my understanding, if you block these low-quality product pages, Google will slowly start re-evaluating these pages, and it will start with the ones that get some traffic.
Here is what happens when Google re-evaluates your individual product pages:
When deciding, whether to keep a page in its index or not, Google takes into account multiple factors, and one of the most important ones is how many backlinks (both internal and external) are leading to a page. Other factors - content quality, if the page is similar or duplicate to another page, Core Web Vitals score, amount of your crawl budget, and, of course, external backlinks (which is irrelevant for your case).
If you are afraid of loosing some traffic that comes to these product pages, or you have other concerns, just do a smaller experiment: take a sample of 1000-2000 pages, block them in robots.txt or by adding meta robots "noindex, follow" directive, and observe Google's reaction in 1-6 weeks, depending on your crawl budget.
Another thing to check:
If you use Screaming Frog, it has a nice feature to show internal pagerank and the number of internal incoming links that lead to every page. As a rule of thumb, if an individual product page has at least 10 internal incoming links from canonicalized pages, there is a high probability it will get indexed.
-
@terentyev - sorry, can't edit my questions once submitted and I wait for approval (why?) the statement should read my question SHOULD be very specific, whereas my original question was much more general - you answered that question very nicely. Sorry for any misunderstanding
-
@terentyev thanks for the reply. We have no reason to believe these URL's are backlinked. These aren't consumer products that individual are interested in, our site is a wholesale B2B selling very narrow categories in bulk quantities typically for manufacturing. Therefore, almost zero chance for backlinks anywhere for something as specific as a particular size/material/package quantity of a product.
We have already initiated a canonicalization project started but we are stuck between two concerns from sales, 1) we can't wait for canonicalization (which is complex) we need sales now and 2) don't touch robots.txt because MAYBE the individual products are indexed.
So that is why my question is very specific - do we KNOW that Google will immediately de-index URL's blocked by robots.txt?
-
@aspenfasteners thanks for interesting question.
to summarize my understanding:- you have ~300K individual product pages, many of them are duplicates; eg. a single product can have multiple characteristics (eg. size or quantity) but the pages are essentially the same.
- your goal is to index 200 product categories that contain a collection of these products, and remove the low-quality duplicate individual pages from Google index in the long run.
- my assumption is that these 300K product pages have been historically accumulating some backlinks, which is one of the reasons why they are indexed.
If I am right about the 1 and 2, then you should not block these individual product pages, but rather add canonical URLs to them, which should point to the respective category page that you want to get indexed.
Once you have these canonicals implemented, you should wait for a few months or more for Google to pass the link equity to your 200 product category pages, and once it is done, you are free to block them from indexing on robots.txt + meta tag on the page itself, and maybe even x-robots-tag. The way how to block them - it is a different discussion. Let me know if you want to learn more on the best approach.
So, here is my checklist for this URL migration:
- add canonicals pointing from product pages to category pages.
- make sure that all category pages are well interlinked between each other, and the individual product pages are linked to several category pages (eg. a product A should be linked to category A, and also to similar categories B & C). As a rule of thumb, make sure that each category page has at least 10 incoming links from other category pages.
- Make sure that all these category pages are linked from your homepage
- Make sure that sitemap contains only self-canonicalized pages.
- Make sure that these category pages have good core web vitals metrics, compared to your competitors on SERP.
- In 2-3 months, when you see that Google indexes the category pages, and crawling of product pages have been reduced significantly, and the ranks of the category pages have gone up, it is ok to block these 300K pages from crawling.
As to manually submitting the categories by hand, I doubt it will help, especially if the product pages have a lot of backlinks. I've seen many cases when Google disregards the robots.txt directives if a page has good backlinks and traffic.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt was set to disallow for 14 days
We updated our website and accidentally overwrote our robots file with a version that prevented crawling ( "Disallow: /") We realized the issue 14 days later and replaced after our organic visits began to drop significantly and we quickly replace the robots file with the correct version to begin crawling again. With the impact to our organic visits, we have a few and any help would be greatly appreciated - Will the site get back to its original status/ranking ? If so .. how long would that take? Is there anything we can do to speed up the process ? Thanks
Intermediate & Advanced SEO | | jc42540 -
Migrating From Parameter-Driven URL's to 'SEO Friendly URL's (Slugs)
Hi all, hope you're all good and having a wonderful Friday morning. At the moment we have over 20,000+ live products on our ecomms site, however, all of the products are using non-seo friendly URL's (/product?p=1738 etc) and we're looking at deploying SEO friendly url's such as (/product/this-is-product-one) etc. As you could imagine, making such a change on a big ecomms site will be a difficult task and we will have to take on A LOT of content changes, href-lang changes, affiliate link tests and a big 301 task. I'm trying to get some analysis together to pitch the Tech guys, but it's difficult, I do understand that this change has it's benefits for SEO, usability and CTR - but I need some more info. Keywords in the slugs - what is it's actual SEO weight? Has anyone here recently converted from using parameter based URL's to keyword-based slugs and seen results? Also, what are the best ways of deploying this? Add a canonical and 301? All comments greatly appreciated! Brett
Intermediate & Advanced SEO | | Brett-S0 -
Weird 404 URL Problem - domain name being placed at end of urls
Hey there. For some reason when doing crawl tests I'm finding pages with the domain name being tacked on the end and causing 404 errors.
Intermediate & Advanced SEO | | Jay328
For example: http://domainname.com/page-name/http://domainname.com This is happening to all pages, posts and even category type 1. Site is in Wordpress
2. Using Yoast SEO plugin Any suggestions? Thanks!0 -
Short Url vs Medium Urls ?
Hello Moooooooooooz ! I got a SEO fight today and though the best would be to involve more people into the fight ! 😛 Do you think it's better to get A- company.com/services/service1.html or B- company/service1.html I was for A as services is also googled to find the service1. I also think that it's better to help google to understand where the service is on the website My friend was for B as URL has to stay as short as possible What do you think ? ps: I can create the URL I want using Joomla and Sh404. The websites has 4 different categoies: /about, /services/ products, /projects Tks ! 🙂
Intermediate & Advanced SEO | | AymanH0 -
URL blocked
Hi there, I have recently noticed that we have a link from an authoritative website, however when I looked at the code, it looked like this: <a <span="">href</a><a <span="">="http://www.mydomain.com/" title="blocked::http://www.mydomain.com/">keyword</a> You will notice that in the code there is 'blocked::' What is this? has it the same effect as a nofollow tag? Thanks for any help
Intermediate & Advanced SEO | | Paul780 -
What content should I block in wodpress with robots.txt?
I need to know if anyone has tips on creating a good robots.txt. I have read a lot of info, but I am just not clear on what I should allow and not allow on wordpress. For example there are pages and posts, then attachments, wp-admin, wp-content and so on. Does anyone have a good robots.txt guideline?
Intermediate & Advanced SEO | | ENSO0 -
Robots.txt disallow subdomain
Hi all, I have a development subdomain, which gets copied to the live domain. Because I don't want this dev domain to get crawled, I'd like to implement a robots.txt for this domain only. The problem is that I don't want this robots.txt to disallow the live domain. Is there a way to create a robots.txt for this development subdomain only? Thanks in advance!
Intermediate & Advanced SEO | | Partouter0