Can I Block https URLs using Host directive in robots.txt?
-
Hello Moz Community,
Recently, I have found that Google bots has started crawling HTTPs urls of my website which is increasing the number of duplicate pages at our website.
Instead of creating a separate robots.txt file for https version of my website, can I use Host directive in the robots.txt to suggest Google bots which is the original version of the website.
Host: http://www.example.com
I was wondering if this method will work and suggest Google bots that HTTPs URLs are the mirror of this website.
Thanks for all of the great responses!
Regards,
Ramendra -
Hi Ramendra,
To my knowledge, you can only provide directives in the robots.txt file for the domain on which it lives. This goes for both http/https and www/non-www versions of domains. This is why it's important to handle all preferred domain formatting with redirects, that point to your canonicalized version. So if you want http://www to index, all other versions redirect to that.
There might be a work around of some sort, but honestly, what I described above with redirection towards preferred versions is the direction you should take. Then you can manage one robots.txt file and your indexing will align with what you want better.
-
Thanks Logan,
I have read somewhere that using Host directive in the robots.txt file we can suggest Google bots which is the original version of the website if there are number of mirror sites. So, I was wondering if we can prevent indexing/crawling of HTTPS URLs by using Host directive in robots.txt of HTTP site.
We are using an ecommerce SAAS platform for our website where we have only one robots.txt file that we can use for HTTP site.
Is there any other way to prevent indexing/crawling of HTTPS URLs?
Regards,
Ramendra -
Hi Ramendra,
Based on what you said, it sounds like both versions of your site exist and are indexed, and you want to mitigate your duplicate content risk. If that's accurate, here are my recommendations on this:
- Robots.txt cannot be used on a HTTP site to prevent indexing/crawling of HTTPS URLs
- Google crawls HTTPS by default, so if your site is fully secure, then you need to redirect (this can be done with a redirect rule in HTACCESS, you don't need to do one-to-one redirects) HTTP URLs over to their HTTPS twin
- In addition to your HTTP>HTTPS redirects, you should also use canonical tags to push your preferred version to search engines
- Your HTTPS site should have its own robots.txt file
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Which Version Url to Use for Canonical Tags and in General for Homepage.
I want to put canonical tags on the homepage of a site. cant figure out the version of URL of the homepage should be with a / at the end or without the / ( www.example.com of www.example.com/ ) if I put into the google the URL with / I get the URL without the / in my browser, and it isn't showing as a redirect in my moz extension or other tools. But when I copy the URL from browser and paste elsewhere it pastes with a / I have two questions 1 - in general how does it work with URLs of homepages - I see this happening with lots of sites? 2 - which URL should I set as the canonical version of my homepage? Thanks so much
Technical SEO | | Ruchy0 -
Http urls on a new https website
Hi, If a site is quite new and setup as https from the beginning why would http variations exist? There are 301 redirects in place from the http to the https variation and also canonical tags pointing back to the http variation? This seems contradictory to me. I'm not sure why the http variations exist at all but they have gone to the trouble of redirecting these to the https variation indicating that it is the variation of choice but at the same time using a canonical tag that indicates the http variation is the original/main url? Thanks
Technical SEO | | MVIreland0 -
Sub Domains and Robot.txt files...
This is going to seem like a stupid question, and perhaps it is but I am pulling out what little hair I have left. I have a sub level domain on which a website sits. The Main domain has a robots.txt file that disallows all robots. It has been two weeks, I submitted the sitemap through webmaster tools and still, Google has not indexed the sub domain website. My question is, could the robots.txt file on the main domain be affecting the crawlability of the website on the sub domain? I wouldn't have thought so but I can find nothing else. Thanks in advance.
Technical SEO | | Vizergy0 -
Can hosting blog posts with keyword anchor text on outbound links cause a penalty?
My site received a Google penalty for having inbound links from blog posts with over-optimized ("spammy") anchor text. I spent months getting these links removed. Yesterday - I received a link deletion request from a site that my site had linked out to (three links via keyword anchor text relevant to their company) in a blog post. The "unnatural link" penalty still hasn't been removed from my site. My question is: Does the penalty work both ways? For having inbound "unnatural" links ... AND for having outbound "unnatural" links?
Technical SEO | | RedNovaLabs910 -
What if I point my canonicals to a URL version that is not used in internal links
My web developer has pointed the "good" URLs that I use in my internal link structure (top-nav/footer) to another duplicate version of my pages. Now the URLs that receive all the canonical link value are not the ones I use on my website. is this a problem and why??? In theory the implementation is good because both have equal content. But does it harm my link equity if it directs to a URL which is not included in my internal link architecture.
Technical SEO | | DeptAgency0 -
Hit hard by EMD update, used to be #1 now not in top 50, what can I do?
We have what I think is a pretty good site, unique articles a few widgets, lots of reviews, decent enough bounce rates and user times (60% and 2:15) based on drupal. Previous updates haven't touched us and an almost identical duplicate (same site compltely different content) of the site targetting a different but related EMD is unaffected which provides a control. I have seen some discussion on it having to do with link profiles. We did pay some backlinkers to link to us, much more on the site that has dropped, and quite a few for a partial match keyword. I'm supposing this is a lot of the issue. If we try and delete these backlinks will it make the situation better or worse? I have also notice some duplicate content warnings in seomoz that weren't there previously. Any ideas?
Technical SEO | | btrr690 -
Crawl reveals hundreds of urls with multiple urls in the url string
The latest crawl of my site revealed hundreds of duplicate page content and duplicate page title errors. When I looked it was from a large number of urls with urls appended to them at the end. For example: http://www.test-site.com/page1.html/page14.html or http://www.test-site.com/page4.html/page12.html/page16.html some of them go on for a hundred characters. I am totally stymied, as are the people at my ISP and the person who talked to me on the phone from SEOMoz. Does anyone know what's going on? Thanks So much for any help you can offer! Jean
Technical SEO | | JeanYates0 -
I am trying to block robots from indexing parts of my site..
I have a few websites that I mocked up for clients to check out my work and get a feel for the style I produce but I don't want them indexed as they have lore ipsum place holder text and not really optimized... I am in the process of optimizing them but for the time being I would like to block them. Most of my warnings and errors on my seomoz dashboard are from these sites and I was going to upload the folioing to the robot.txt file but I want to make sure this is correct: User-agent: * Disallow: /salondemo/ Disallow: /salondemo3/ Disallow: /cafedemo/ Disallow: /portfolio1/ Disallow: /portfolio2/ Disallow: /portfolio3/ Disallow: /salondemo2/ is this all i need to do? Thanks Donny
Technical SEO | | Smurkcreative0