Can I Block https URLs using Host directive in robots.txt?
-
Hello Moz Community,
Recently, I have found that Google bots has started crawling HTTPs urls of my website which is increasing the number of duplicate pages at our website.
Instead of creating a separate robots.txt file for https version of my website, can I use Host directive in the robots.txt to suggest Google bots which is the original version of the website.
Host: http://www.example.com
I was wondering if this method will work and suggest Google bots that HTTPs URLs are the mirror of this website.
Thanks for all of the great responses!
Regards,
Ramendra -
Hi Ramendra,
To my knowledge, you can only provide directives in the robots.txt file for the domain on which it lives. This goes for both http/https and www/non-www versions of domains. This is why it's important to handle all preferred domain formatting with redirects, that point to your canonicalized version. So if you want http://www to index, all other versions redirect to that.
There might be a work around of some sort, but honestly, what I described above with redirection towards preferred versions is the direction you should take. Then you can manage one robots.txt file and your indexing will align with what you want better.
-
Thanks Logan,
I have read somewhere that using Host directive in the robots.txt file we can suggest Google bots which is the original version of the website if there are number of mirror sites. So, I was wondering if we can prevent indexing/crawling of HTTPS URLs by using Host directive in robots.txt of HTTP site.
We are using an ecommerce SAAS platform for our website where we have only one robots.txt file that we can use for HTTP site.
Is there any other way to prevent indexing/crawling of HTTPS URLs?
Regards,
Ramendra -
Hi Ramendra,
Based on what you said, it sounds like both versions of your site exist and are indexed, and you want to mitigate your duplicate content risk. If that's accurate, here are my recommendations on this:
- Robots.txt cannot be used on a HTTP site to prevent indexing/crawling of HTTPS URLs
- Google crawls HTTPS by default, so if your site is fully secure, then you need to redirect (this can be done with a redirect rule in HTACCESS, you don't need to do one-to-one redirects) HTTP URLs over to their HTTPS twin
- In addition to your HTTP>HTTPS redirects, you should also use canonical tags to push your preferred version to search engines
- Your HTTPS site should have its own robots.txt file
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is 301 redirect the only way when using Vanity URLs?
We have been using vanity urls for some of our pages. Mostly the pages that have a vanity URL have a long URL length. But now the problem is, the vanity URL is getting displayed on the search engine when the particular keyword related to the page is entered. I checked the google search console, the vanity URL is indexed and the original URL remains unindexed. What should I do? Is adding 301 redirect to the vanity URLs are solution? Since some of vanity URLs are not redirecting to the original. Some of the original pages are not getting traffic. Also, can using canonical tag help?
Technical SEO | | tejasbansode0 -
We just can't figure out the right anchor text to use
We have been trying everything we can with anchor text. We have read here that we should try naturalistic language. Our competitors who are above us in Google search results don't do any of this. They only use their names or a single term like "austin web design". Is what we are doing hurting our listings? We don't have any black hat links. Here's what we are doing now. We are going crazy trying to figure this out. We are afraid to do anything in fear it will damage our position. Bob | pallasart web design | 31 | 1,730 |
Technical SEO | | pallasart
| website by pallasart a texas web design company in austin | 15 | 1,526 |
| website by the austin design company pallasart | 14 | 1,525 |
| created by pallasart a web design company in austin texas | 13 | 1,528 |
| created by an austin web design company pallasart | 12 | 1,499 |
| website by pallasart web design an austin web design company | 12 | 1,389 |
| website by pallasart an austin web design company | 11 | 1,463 |
| pallasart austin web design | 9 | 2,717 |
| website created by pallasart a web design company in austin texas | 9 | 1,369 |
| website by pallasart | 8 | 910 |
| austin web design | 5 | 63 |
| pallasart website design austin |0 -
Using the Google Remove URL Tool to remove https pages
I have found a way to get a list of 'some' of my 180,000+ garbage URLs now, and I'm going through the tedious task of using the URL removal tool to put them in one at a time. Between that and my robots.txt file and the URL Parameters, I'm hoping to see some change each week. I have noticed when I put URL's starting with https:// in to the removal tool, it adds the http:// main URL at the front. For example, I add to the removal tool:- https://www.mydomain.com/blah.html?search_garbage_url_addition On the confirmation page, the URL actually shows as:- http://www.mydomain.com/https://www.mydomain.com/blah.html?search_garbage_url_addition I don't want to accidentally remove my main URL or cause problems. Is this the right way this should look? AND PART 2 OF MY QUESTION If you see the search description in Google for a page you want removed that says the following in the SERP results, should I still go to the trouble of putting in the removal request? www.domain.com/url.html?xsearch_... A description for this result is not available because of this site's robots.txt – learn more.
Technical SEO | | sparrowdog1 -
The use of robots.txt
Could someone please confirm that if I do not want to block any pages from my URL, then I do not need a robots.txt file on my site? Thanks
Technical SEO | | ICON_Malta0 -
Robots.txt to disallow /index.php/ path
Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
Technical SEO | | Mikkehl0 -
Google (GWT) says my homepage and posts are blocked by Robots.txt
I guys.. I have a very annoying issue.. My Wordpress-blog over at www.Trovatten.com has some indexation-problems.. Google Webmaster Tools data:
Technical SEO | | FrederikTrovatten22
GWT says the following: "Sitemap contains urls which are blocked by robots.txt." and shows me my homepage and my blogposts.. This is my Robots.txt: http://www.trovatten.com/robots.txt
"User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/ Do you have any idea why it says that the URL's are being blocked by robots.txt when that looks how it should?
I've read a couple of places that it can be because of a Wordpress Plugin that is creating a virtuel robots.txt, but I can't validate it.. 1. I have set WP-Privacy to crawl my site
2. I have deactivated all WP-plugins and I still get same GWT-Warnings. Looking forward to hear if you have an idea that might work!0 -
Robots.txt Showing in SERP Results
Currently doing a technical audit for a website and when I search "Site:website.com -www" the only result is website.com/robots.txt I was wondering if anyone else has come across this before -- or what this may mean from a technical audit standpoint. Thank you!
Technical SEO | | vectormedia0 -
Can anyone recommend a good hosting company in the UK
Hi can anyone recommend some good hosting companies in the UK as my hosting company are now considering charging for technical advice. I use some standard packages as well as a dedicated server and would be grateful if you could give me some examples on prices as well as service which includes technical service.
Technical SEO | | ClaireH-1848860