What happens to crawled URLs subsequently blocked by robots.txt?

AspenFasteners

We have a very large store with 278,146 individual product pages. Since these are all various sizes and packaging quantities of less than 200 product categories my feeling is that Google would be better off making sure our category pages are indexed.

I would like to block all product pages via robots.txt until we are sure all category pages are indexed, then unblock them. Our product pages rarely change, no ratings or product reviews so there is little reason for a search engine to revisit a product page.

The sales team is afraid blocking a previously indexed product page will result in in it being removed from the Google index and would prefer to submit the categories by hand, 10 per day via requested crawling.

Which is the better practice?

seoelevated

@aspenfasteners To my understanding, disallowing a page or folder in robots.txt does not remove pages from Google's index. It merely gives a directive to not crawl those pages/folders. In fact, when pages are accidentally indexed and one wants to remove them from the index, it is important to actually NOT disallow them in robots.txt, so that Google can crawl those pages and discover the meta NOINDEX tags on the pages. The meta NOINDEX tags are the directive to remove a page from the index, or to not index it in the first place. This is different than a robots.txt directive, whcih is intended to allow or disallow crawling. Crawling does not equal indexing.

So, you could keep the pages indexable, and simply block them in your robots.txt file, if you want. If they've already been indexed, they should not disappear quickly (they might, over time though). BUT if they haven't been indexed yet, this would prevent them from being discovered.

All of that said, from reading your notes, I don't think any of this is warranted. The speed at which Google discovers pages on a website is very fast. And existing indexed pages shouldn't really get in the way of new discovery. In fact, they might help the category pages be discovered, if they contain links to the categories.

I would create a categories sitemap xml file, link to that in your robots.txt, and let that do the work of prioritizing the categories for crawling/discovery and indexation.

terentyev

@aspenfasteners to answer your question: "do we KNOW that Google will immediately de-index URL's blocked by robots.txt?"

Google will not immediately de-index URLs that are blocked by robots.txt, based on my experience. I've dealt with very similar situation but with much greater scale - around 8M automatically generated pages that got into Google index. It may take a year or more to de-index these pages completely. Of course, every case is different, but based on my understanding, if you block these low-quality product pages, Google will slowly start re-evaluating these pages, and it will start with the ones that get some traffic.

Here is what happens when Google re-evaluates your individual product pages:

When deciding, whether to keep a page in its index or not, Google takes into account multiple factors, and one of the most important ones is how many backlinks (both internal and external) are leading to a page. Other factors - content quality, if the page is similar or duplicate to another page, Core Web Vitals score, amount of your crawl budget, and, of course, external backlinks (which is irrelevant for your case).

If you are afraid of loosing some traffic that comes to these product pages, or you have other concerns, just do a smaller experiment: take a sample of 1000-2000 pages, block them in robots.txt or by adding meta robots "noindex, follow" directive, and observe Google's reaction in 1-6 weeks, depending on your crawl budget.

Another thing to check:

If you use Screaming Frog, it has a nice feature to show internal pagerank and the number of internal incoming links that lead to every page. As a rule of thumb, if an individual product page has at least 10 internal incoming links from canonicalized pages, there is a high probability it will get indexed.

AspenFasteners

@terentyev - sorry, can't edit my questions once submitted and I wait for approval (why?) the statement should read my question SHOULD be very specific, whereas my original question was much more general - you answered that question very nicely. Sorry for any misunderstanding

AspenFasteners

@terentyev thanks for the reply. We have no reason to believe these URL's are backlinked. These aren't consumer products that individual are interested in, our site is a wholesale B2B selling very narrow categories in bulk quantities typically for manufacturing. Therefore, almost zero chance for backlinks anywhere for something as specific as a particular size/material/package quantity of a product.

We have already initiated a canonicalization project started but we are stuck between two concerns from sales, 1) we can't wait for canonicalization (which is complex) we need sales now and 2) don't touch robots.txt because MAYBE the individual products are indexed.

So that is why my question is very specific - do we KNOW that Google will immediately de-index URL's blocked by robots.txt?

terentyev

@aspenfasteners thanks for interesting question.
to summarize my understanding:

you have ~300K individual product pages, many of them are duplicates; eg. a single product can have multiple characteristics (eg. size or quantity) but the pages are essentially the same.
your goal is to index 200 product categories that contain a collection of these products, and remove the low-quality duplicate individual pages from Google index in the long run.
my assumption is that these 300K product pages have been historically accumulating some backlinks, which is one of the reasons why they are indexed.

If I am right about the 1 and 2, then you should not block these individual product pages, but rather add canonical URLs to them, which should point to the respective category page that you want to get indexed.

Once you have these canonicals implemented, you should wait for a few months or more for Google to pass the link equity to your 200 product category pages, and once it is done, you are free to block them from indexing on robots.txt + meta tag on the page itself, and maybe even x-robots-tag. The way how to block them - it is a different discussion. Let me know if you want to learn more on the best approach.

So, here is my checklist for this URL migration:

add canonicals pointing from product pages to category pages.
make sure that all category pages are well interlinked between each other, and the individual product pages are linked to several category pages (eg. a product A should be linked to category A, and also to similar categories B & C). As a rule of thumb, make sure that each category page has at least 10 incoming links from other category pages.
Make sure that all these category pages are linked from your homepage
Make sure that sitemap contains only self-canonicalized pages.
Make sure that these category pages have good core web vitals metrics, compared to your competitors on SERP.
In 2-3 months, when you see that Google indexes the category pages, and crawling of product pages have been reduced significantly, and the ranks of the category pages have gone up, it is ok to block these 300K pages from crawling.

As to manually submitting the categories by hand, I doubt it will help, especially if the product pages have a lot of backlinks. I've seen many cases when Google disregards the robots.txt directives if a page has good backlinks and traffic.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

What happens to crawled URLs subsequently blocked by robots.txt?

Browse Questions

Explore more categories

Related Questions

Robots.txt was set to disallow for 14 days

Migrating From Parameter-Driven URL's to 'SEO Friendly URL's (Slugs)

How to block search bots in crawling my site except for homepage?

Weird 404 URL Problem - domain name being placed at end of urls

Short Url vs Medium Urls ?

URL blocked

What content should I block in wodpress with robots.txt?

Robots.txt disallow subdomain