Can I Block https URLs using Host directive in robots.txt?

TJC.co.uk

Hello Moz Community,

Recently, I have found that Google bots has started crawling HTTPs urls of my website which is increasing the number of duplicate pages at our website.

Instead of creating a separate robots.txt file for https version of my website, can I use Host directive in the robots.txt to suggest Google bots which is the original version of the website.

Host: http://www.example.com

I was wondering if this method will work and suggest Google bots that HTTPs URLs are the mirror of this website.

Thanks for all of the great responses!

Regards,
Ramendra

LoganRay

Hi Ramendra,

To my knowledge, you can only provide directives in the robots.txt file for the domain on which it lives. This goes for both http/https and www/non-www versions of domains. This is why it's important to handle all preferred domain formatting with redirects, that point to your canonicalized version. So if you want http://www to index, all other versions redirect to that.

There might be a work around of some sort, but honestly, what I described above with redirection towards preferred versions is the direction you should take. Then you can manage one robots.txt file and your indexing will align with what you want better.

TJC.co.uk

Thanks Logan,

I have read somewhere that using Host directive in the robots.txt file we can suggest Google bots which is the original version of the website if there are number of mirror sites. So, I was wondering if we can prevent indexing/crawling of HTTPS URLs by using Host directive in robots.txt of HTTP site.

We are using an ecommerce SAAS platform for our website where we have only one robots.txt file that we can use for HTTP site.

Is there any other way to prevent indexing/crawling of HTTPS URLs?

Regards,
Ramendra

LoganRay

Hi Ramendra,

Based on what you said, it sounds like both versions of your site exist and are indexed, and you want to mitigate your duplicate content risk. If that's accurate, here are my recommendations on this:

Robots.txt cannot be used on a HTTP site to prevent indexing/crawling of HTTPS URLs
Google crawls HTTPS by default, so if your site is fully secure, then you need to redirect (this can be done with a redirect rule in HTACCESS, you don't need to do one-to-one redirects) HTTP URLs over to their HTTPS twin
In addition to your HTTP>HTTPS redirects, you should also use canonical tags to push your preferred version to search engines
Your HTTPS site should have its own robots.txt file

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Can I Block https URLs using Host directive in robots.txt?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Which Version Url to Use for Canonical Tags and in General for Homepage.

Http urls on a new https website

Sub Domains and Robot.txt files...

Can hosting blog posts with keyword anchor text on outbound links cause a penalty?

What if I point my canonicals to a URL version that is not used in internal links

Hit hard by EMD update, used to be #1 now not in top 50, what can I do?

Crawl reveals hundreds of urls with multiple urls in the url string

I am trying to block robots from indexing parts of my site..