Robots.txt Blocking - Best Practices
-
Hi All,
We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line.
We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files?
Thanks!
User-agent: * Disallow: / User-agent: Googlebot Disallow: Crawl-delay: 5 User-agent: Yahoo-slurp Disallow: User-agent: bingbot Disallow: User-agent: rogerbot Disallow: User-agent: * Crawl-delay: 5 Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp
-
Thanks for taking the time to respond in depth, GreenStone. We appreciate the advice and have passed your response along to the web hosting company (along with a frustrated email) explaining they're not adhering to anyone's best practices. Hopefully this will convince them!
-
Thanks, Dmitrii for your response! From our research we've seen similar recommendations and it helps to have more evidence to back it up. Hopefully these guys will give in a bit!
-
Completely agree, I really wouldn't want to host my stuff with a company that can't figure out what really the best practices are ;-). This is very well layed out why you shouldn't want to set up your robots.txt like it is right now.
-
In general, I definitely wouldn't recommend the way the web-provider is handling this.
- Disallowing all while adding exceptions should never be the norm. Allowing all to crawl while adding exceptions for other crawlers aside from google would be best practice generally,
- It makes a lot more sense to just allow crawlers full access, and then add crawl delays for non google crawlers, in addition to disallowing those specific sub-folders: Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp.
- Googlebot Disallow: Crawl-delay: 5, does not do you any good, as google does not obey these commands. Only Search Console can control this.
- You can test what is visible to googlebot within search console's "robots" subsection, in order to verify what they can access.
- Disallowing all while adding exceptions should never be the norm. Allowing all to crawl while adding exceptions for other crawlers aside from google would be best practice generally,
-
Here is another video from Matt - https://www.youtube.com/watch?v=I2giR-WKUfY
Lots of good points there too.
-
Hi.
Super weird client - that's for sure.
User-agent: * Disallow: /
Every bot will be blocked off! how in the world are they ranking?
watch that video, there are good ideas of bot and crawlers controlling. As well as you can consider that as best practices. And yes, what they have now is ridiculous.
https://mza.bundledseo.com/community/q/should-we-use-google-s-crawl-delay-setting
Here is a q/a about crawler delays. As far as I know Google ignores delays anyway, plus there is nothing good about it anyway.
Hope this helps.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What are the Best SEO Website which you read daily
Hai Moz memebers, Can you pls suggest me some best seo websites that you people read articles everyday a part from MOZ
Intermediate & Advanced SEO | | SEO_GB1 -
Bad SEO Practice: in title tag?
Greetings, I just discovered that some of our content was produced with
Intermediate & Advanced SEO | | Eric_Lifescript
tags in the title tag. Example: <title>Diabetes Symptoms <br> In Women Over 40</title> My gut says this is bad for SEO, but I couldn't find a definitive answer on the web, so I thought I would ask the community of gurus here at Moz. 🙂 Thanks in advance for any reply. Kind regards, Eric0 -
What is the best way to take advantage of this keyword?
Hi SEO's! I've been checking out webmaster tools (screenshot attached) and noticed that we're getting loads of long tail searches around a search query 'arterial and venous leg ulcers' - on a side note we're a nursing organisation so excuse the content of the search!!! The trouble is that google is indexing a PDF page which we give out as a freebie:
Intermediate & Advanced SEO | | 9868john
http://www.nursesfornurses.com.au/admin/uploads/5DifferencesBetweenVenousAndArterialLegUlcers1.pdf This PDF is a couple of years old and needs updating but its got a few links pointing to it. Ok so down to the nitty gritty, we've just launched a blog:
http://news.nursesfornurses.com.au/Nursing-news/ We have a whole wound care category in which this content belongs, and i'm trying to find the best way to take advantage of the search, so I was thinking: Create an article of about 1000 words Update the PDF and re-upload it to the main domain (not the sub domain news.nursesfornurses.com.au) Attach the PDF to the article on the blog OR would it be better to host this on the blog, and setup a 301 redirect to this page? I just need some advice on how best to take advantage of this opportunity, our blog isn't getting much search traffic at the moment (despite having 300+ articles!!) and i'm looking into how we can change that. I look forward to your response and suggestions. Thanks! qtY64B10 -
Best practices for structuring an ecommerce site
I'm revamping my wife's ecommerce site. It is currently a very low traffic website that is not indexed very well in Google. So, my plan is to restructure it based upon the best practices that helps me avoid duplicate content penalties, and easier to index strategies. The store has about 7 types of products. Each product has approximately 30 different size variations that are sometimes specifically searched for. For example: 20x10x1 air filters, 20x10x2 air filters, 20x10x1 allergy reducing air filters, etc So, is it best for me to create 7 different products with 30 different size variations (size selector at the product level that changes the price) or is it better to create 210 different product pages, one for each style/size?
Intermediate & Advanced SEO | | pherbio0 -
Robots.txt & Duplicate Content
In reviewing my crawl results I have 5666 pages of duplicate content. I believe this is because many of the indexed pages are just different ways to get to the same content. There is one primary culprit. It's a series of URL's related to CatalogSearch - for example; http://www.careerbags.com/catalogsearch/result/index/?q=Mobile I have 10074 of those links indexed according to my MOZ crawl. Of those 5349 are tagged as duplicate content. Another 4725 are not. Here are some additional sample links: http://www.careerbags.com/catalogsearch/result/index/?dir=desc&order=relevance&p=2&q=Amy
Intermediate & Advanced SEO | | Careerbags
http://www.careerbags.com/catalogsearch/result/index/?color=28&q=bellemonde
http://www.careerbags.com/catalogsearch/result/index/?cat=9&color=241&dir=asc&order=relevance&q=baggallini All of these links are just different ways of searching through our product catalog. My question is should we disallow - catalogsearch via the robots file? Are these links doing more harm than good?0 -
Block Googlebot from submit button
Hi, I have a website where many searches are made by the googlebot on our internal engine. We can make noindex on result page, but we want to stop the bot to call the ajax search button - GET form (because it pass a request to an external API with associate fees). So, we want to stop crawling the form button, without noindex the search page itself. The "nofollow" tag don't seems to apply on button's submit. Any suggestion?
Intermediate & Advanced SEO | | Olivier_Lambert0 -
Best practice for site maps?
Is it necessary or good practice to list "static" site routes in the sitemap? I.e. /about, /faq, etc? Some large sites (e.g. Vimeo) only list the 'dynamic' URLs (in their case the actual videos). If there are urls NOT listed in a sitemap, will these continue to be indexed? What is the good practice for a sitemap index? When submitting a sitemap to e.g. Webmaster tools, can you just submit the index file (which links to secondary sitemaps)? Does it matter which order the individual sitemaps are listed in the index?
Intermediate & Advanced SEO | | shawn810 -
What's your best hidden SEO secret?
Don't take that question too serious but all answers are welcome 😉 Answer to all:
Intermediate & Advanced SEO | | petrakraft
"Gentlemen, I see you did you best - at least I hope so! But after all I suppose I am stuck here to go on reading the SEOmoz blog if I can't sqeeze more secrets from you!9