Is robots met tag a more reliable than robots.txt at preventing indexing by Google?
-
What's your experience of using robots meta tag v robots.txt when it comes to a stand alone solution to prevent Google indexing?
I am pretty sure robots meta tag is more reliable - going on own experiences, I have never experience any probs with robots meta tags but plenty with robots.txt as a stand alone solution.
Thanks in advance, Luke
-
Hi there,
Regarding the X-Robots tag. We have had a couple of sites that were disallowed in the robots.txt have their PDF, Doc etc files get indexed. I understand the reasoning for this. I would like to remove the disallow in the robots.txt and use the X-robots tag to noindex all pages as well as PDF, Doc files etc. This is for a ngnix configuation. Does anyone know what the written x-robots tag would look like in this case?
-
Test for what works for your site.
Use tools below
- https://www.deepcrawl.com/ (will give you one free full crawl)
- https://www.screamingfrog.co.uk/seo-spider/ (free up to 500 URLs)
- http://urlprofiler.com/ (14 days free try)
- https://www.deepcrawl.com/blog/best-practice/noindex-disallow-nofollow/
- https://www.screamingfrog.co.uk/seo-spider/user-guide/general/#robots-txt
- https://www.deepcrawl.com/blog/best-practice/noindex-and-google/
So much info
https://www.deepcrawl.com/blog/tag/robots-txt/
Thomas
-
Hi Luke,
In order to exclude individual pages from search engine indices, the noindex meta tag
is actually superior to robots.txt.
But X-Robots-Tag header tag is the best but much hader to use.
Block all web crawlers from all content
User-agent: * Disallow: /
Using the
robots.txt
file, you can tell a spider where it cannot go on your site. You can not tell a search engine which URLs it cannot show in the search results. This means that not allowing a search engine to crawl an URL – called “blocking” it – does not mean that URL will not show up in the search results. If the search engine finds enough links to that URL, it will include it; it will just not know what’s on that page.If you want to reliably block a page from showing up in the search results, you need to use a meta robots
noindex
tag. That means the search engine has to be able to index that page and find thenoindex
tag, so the page should not be blocked byrobots.txt
a
robots.txt
file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.
Please note that using the
robots.txt
file to hide your entire website for search engines is definitely not recommended.see big photo: http://i.imgur.com/MM7hM4g.png
_(…)_ _(…)_
The robots meta tag in the above example instructs all search engines not to show the page in search results. The value of the
name
attribute (robots
) specifies that the directive applies to all crawlers. To address a specific crawler, replace therobots
value of thename
attribute with the name of the crawler that you are addressing. Specific crawlers are also known as user-agents (a crawler uses its user-agent to request a page.) Google's standard web crawler has the user-agent name.Googlebot
To prevent only Googlebot from crawling your page, update the tag as follows:This tag now instructs Google (but no other search engines) not to show this page in its web search results. Both the and
name
the attributescontent
are non-case sensitive.Search engines may have different crawlers for different properties or purposes. See the complete list of Google's crawlers. For example, to show a page in Google's web search results, but not in Google News, use the following meta tag:
If you need to specify multiple crawlers individually, it's okay to use multiple robots meta tags:
If competing directives are encountered by our crawlers we will use the most restrictive directive we find.
irective. This basically means that if you want to really hide something from the search engines, and thus from people using search,
robots.txt
won’t suffice.Indexer directives
Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.
With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.
Example uses of the X-Robots-Tag
Using the
X-Robots-Tag
HTTP headerThe
X-Robots-Tag
can be used as an element of the HTTP header response for a given URL. Any directive that can be used in an robots meta tag can also be specified as anX-Robots-Tag
. Here's an example of an HTTP response with anX-Robots-Tag
instructing crawlers not to index a page:HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: noindex** _(…)_
Multiple
X-Robots-Tag
headers can be combined within the HTTP response, or you can specify a comma-separated list of directives. Here's an example of an HTTP header response which has anoarchive
X-Robots-Tag
combined with anunavailable_after
X-Robots-Tag
.HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: noarchive X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST** _(…)_
The
X-Robots-Tag
may optionally specify a user-agent before the directives. For instance, the following set ofX-Robots-Tag
HTTP headers can be used to conditionally allow showing of a page in search results for different search engines:HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: googlebot: nofollow X-Robots-Tag: otherbot: noindex, nofollow** _(…)_
Directives specified without a user-agent are valid for all crawlers. The section below demonstrates how to handle combined directives. Both the name and the specified values are not case sensitive.
- https://mza.seotoolninja.com/learn/seo/robotstxt
- https://yoast.com/ultimate-guide-robots-txt/
- https://mza.seotoolninja.com/blog/the-wonderful-world-of-seo-metatags
- https://yoast.com/x-robots-tag-play/
- https://www.searchenginejournal.com/x-robots-tag-simple-alternate-robots-txt-meta-tag/67138/
- https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
I hope this helps,
Tom
-
If you've recently added the "noindex" meta, get the page fetched in GWT. Google can't act if it doesn't see the tag.
-
Hi Luke,
It's a pretty common misconception that the robots.txt will prevent indexing. It's only purpose is actually to prevent crawling, anything disallowed in there is still up for indexing if it's linked to elsewhere. If you want something deindexed, your best bet is the robots meta tag, but make sure you allow crawling of the URLs to give search engine bots an opportunity to see the tag.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt wildcards - the devs had a disagreement - which is correct?
Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?” The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20 So which of the developers is correct? Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?
Intermediate & Advanced SEO | | McTaggart0 -
No images in Google index
No images are indexed on this site (client of ours): http://www.rubbermagazijn.nl/. We've tried everything (descriptive alt texts, image sitemaps, fetch&render, check robots) but a site:www.rubbermagazijn.nl shows 0 image results and the sitemap report in Search Console shows 0 images indexed. We're not sure how to proceed from here. Is there anyone with an idea what the problem could be?
Intermediate & Advanced SEO | | Adriaan.Multiply0 -
Google robots.txt test - not picking up syntax errors?
I just ran a robots.txt file through "Google robots.txt Tester" as there was some unusual syntax in the file that didn't make any sense to me... e.g. /url/?*
Intermediate & Advanced SEO | | McTaggart
/url/?
/url/* and so on. I would use ? and not ? for example and what is ? for! - etc. Yet "Google robots.txt Tester" did not highlight the issues... I then fed the sitemap through http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php and that tool actually picked up my concerns. Can anybody explain why Google didn't - or perhaps it isn't supposed to pick up such errors? Thanks, Luke0 -
Not sure how we're blocking homepage in robots.txt; meta description not shown
Hi folks! We had a question come in from a client who needs assistance with their robots.txt file. Metadata for their homepage and select other pages isn't appearing in SERPs. Instead they get the usual message "A description for this result is not available because of this site's robots.txt – learn more". At first glance, we're not seeing the homepage or these other pages as being blocked by their robots.txt file: http://www.t2tea.com/robots.txt. Does anyone see what we can't? Any thoughts are massively appreciated! P.S. They used wildcards to ensure the rules were applied for all locale subdirectories, e.g. /en/au/, /en/us/, etc.
Intermediate & Advanced SEO | | SearchDeploy0 -
Does Google index url with hashtags?
We are setting up some Jquery tabs in a page that will produce the same url with hashtags. For example: index.php#aboutus, index.php#ourguarantee, etc. We don't want that content to be crawled as we'd like to prevent duplicate content. Does Google normally crawl such urls or does it just ignore them? Thanks in advance.
Intermediate & Advanced SEO | | seoppc20120 -
Reciprocal Links and nofollow/noindex/robots.txt
Hypothetical Situations: You get a guest post on another blog and it offers a great link back to your website. You want to tell your readers about it, but linking the post will turn that link into a reciprocal link instead of a one way link, which presumably has more value. Should you nofollow your link to the guest post? My intuition here, and the answer that I expect, is that if it's good for users, the link belongs there, and as such there is no trouble with linking to the post. Is this the right way to think about it? Would grey hats agree? You're working for a small local business and you want to explore some reciprocal link opportunities with other companies in your niche using a "links" page you created on your domain. You decide to get sneaky and either noindex your links page, block the links page with robots.txt, or nofollow the links on the page. What is the best practice? My intuition here, and the answer that I expect, is that this would be a sneaky practice, and could lead to bad blood with the people you're exchanging links with. Would these tactics even be effective in turning a reciprocal link into a one-way link if you could overlook the potential immorality of the practice? Would grey hats agree?
Intermediate & Advanced SEO | | AnthonyMangia0 -
Need to duplicate the index for Google in a way that's correct
Usually duplicated content is a brief to fix. I find myself in a little predicament: I have a network of career oriented websites in several countries. the problem is that for each country we use a "master" site that aggregates all ads working as a portal. The smaller nisched sites have some of the same info as the "master" sites since it is relevant for that site. The "master" sites have naturally gained the index for the majority of these ads. So the main issue is how to maintain the ads on the master sites and still make the nische sites content become indexed in a way that doesn't break Google guide lines. I can of course fix this in various ways ranging from iframes(no index though) and bullet listing and small adjustments to the headers and titles on the content on the nisched sites, but it feels like I'm cheating if I'm going down that path. So the question is: Have someone else stumbled upon a similar problem? If so...? How did you fix it.
Intermediate & Advanced SEO | | Gustav-Northclick0 -
IP address being indexed by Google in addition to canonical domain.
Our site's IP address is being indexed in addition to the canonical www.example.com domain. As soon as it was flagged a 301 was implemented in the .htaccess file to redirect the IP address to the canonical. Does this usually occur? Is it detrimental to SEO? In my time in SEO I've never heard of this being an issue, or being part of a list of things to be checked. It sounds more like a server that wasn't configured correctly when hosting was set up? It didn't seem to be affecting the site at all, but is it more common and I've just never heard of it? 😛 Should it be something I'm usually looking for in future? Responses are greatly appreciated!
Intermediate & Advanced SEO | | mikeimrie0