Disallow statement - is this tiny anomaly enough to render Disallow invalid?
-
Google site search (site:'hbn.hoovers.com') indicates 171,000 results for this subdomain. That is not a desired result - this site has 100% duplicate content. We don't want SEs spending any time here.
Robots.txt is set up mostly right to disallow all search engines from indexing this site. That asterisk at the end of the disallow statement looks pretty harmless - but could that be why the site has been indexed?
User-agent: * Disallow: /*
-
Interesting. I'd never heard that before.
We've never had GA or GWT on these mirror sites before, so it's hard to say what Google is doing these days.
But the goal is definitely to make them and their contents invisible to SEs. We'll get GWT on there and start removing URLs.
Thanks!
-
The additional asterisk shouldn't do you any harm, although standard practice seems to be just putting the "/".
Does it seem like Google is still crawling this subdomain when you look at webmasters crawl stats? While the disallow function in robots.txt will usually stop bots from crawling, it doesn't prevent them from indexing or keeping pages indexed that were before the disallow was put in place. If you want these pages removed from the index, you can request it through webmasters and also use meta robots noindex as opposed to the robots.txt file. Moz has a good article about it here: http://moz.com/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts
If you're just worried about bots crawling the subdomain, it's possible they've already stopped crawling it, but continue to index it due to history or additional indicators suggesting they should index it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Invalid Microdata - How much of an impact does invalid microdata have on SERPS
Invalid Microdata - How much of an impact does invalid microdata have on SERPS?? The Low down. We are located in Australia We run our business on the Bigcommerce platform. Problem is Google is crawling our bigcommerce in USD and displaying our micro data (price in USD instead of AUD) How much of a problem is this in terms of SEO issues? We have seen a steady decline or many of our top 3 rankings shift down a few pegs to mid-bottom of top 10. We're also getting google shopping microdata warnings too. Hi, I am just wondering how we fix invalid micro data (Price) is displaying USD where we are located in Australia so it should be AUD. Solutions: Does anyone have a solution for this they can help me out with to resolve this microdata issue on the bigcommerce platform (stencil cornerstone based template)? Are there any other technical elements at first glance you note on our website that may be a potential cause in the SERP decline from top 3's to top 10's? URL https://wwww.fishingtackleshop.com.au
Technical SEO | | oceanstorm0 -
Disallow wildcard match in Robots.txt
This is in my robots.txt file, does anyone know what this is supposed to accomplish, it doesn't appear to be blocking URLs with question marks Disallow: /?crawler=1
Technical SEO | | AmandaBridge
Disallow: /?mobile=1 Thank you0 -
Implications of Disallowing A LOT of Pages
Hey everyone, I just started working on a website and there are A LOT of pages that should not be crawled - probably in the thousands. Are there any SEO risks of disallowing them all at once, or should I go through systematically and take a few dozen down at a time?
Technical SEO | | rachelmeyer1 -
All other things equal, do server rendered websites rank higher than JavaScript web apps that follow the AJAX Crawling Spec?
I instinctively feel like server rendered websites should rank higher since Google doesn't truly know that the content its getting from an AJAX site is what the user is seeing and Google isn't exactly sure of the page load time (and thus user experience). I can't find any evidence that would prove this, however. A website like Monocle.io uses pushstate, loads fast, has good page titles, etc., but it is a JavaScript single page application. Does it make any difference?
Technical SEO | | jeffwhelpley0 -
Googlebot does not obey robots.txt disallow
Hi Mozzers! We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter. We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming. But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems. What could be the reason? Best regards, Martin
Technical SEO | | TalkInThePark0 -
Is Noindex Enough To Solve My Duplicate Content Issue?
Hello SEO Gurus! I have a client who runs 7 web properties. 6 of them are satellite websites, and 7th is his company's main website. For a long while, my company has, among other things, blogged on a hosted blog at www.hismainwebsite.com/blog, and when we were optimizing for one of the other satellite websites, we would simply link to it in the article. Now, however, the client has gone ahead and set up separate blogs on every one of the satellite websites as well, and he has a nifty plug-in set up on the main website's blog that pipes in articles that we write to their corresponding satellite blog as well. My concern is duplicate content. In a sense, this is like autoblogging -- the only thing that doesn't make it heinous is that the client is autoblogging himself. He thinks that it will be a great feature for giving users to his satellite websites some great fresh content to read -- which I agree, as I think the combination of publishing and e-commerce is a thing of the future -- but I really want to avoid the duplicate content issue and a possible SEO/SERP hit. I am thinking that a noindexing of each of the satellite websites' blog pages might suffice. But I'd like to hear from all of you if you think that even this may not be a foolproof solution. Thanks in advance! Kind Regards, Mike
Technical SEO | | RCNOnlineMarketing0 -
Should I set up a disallow in the robots.txt for catalog search results?
When the crawl diagnostics came back for my site its showing around 3,000 pages of duplicate content. Almost all of them are of the catalog search results page. I also did a site search on Google and they have most of the results pages in their index too. I think I should just disallow the bots in the /catalogsearch/ sub folder, but I'm not sure if this will have any negative effect?
Technical SEO | | JordanJudson0 -
How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines. I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file. For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all? There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this? Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
Technical SEO | | SpringMountain0