Robots.txt best practices & tips
-
Hey,
I was wondering if someone could give me some advice on whether I should block the robots.txt file from the average user (not from googlebot, yandex, etc)?
If so, how would I go about doing this? With .htaccess I'm guessing - but not an expert.
What can people do with the information in the file? Maybe someone can give me some "best practices"? (I have a wordpress based website)
Thanks in advance!
-
Asking about the ideal configuration for a robots.txt file for WordPress is opening a huge can of worms There's plenty of discussion and disagreement about exactly what's best, but a lot of it depends on the actual configuration and goals of your own website. That's too long a discussion to get into here, but below is what I can recommend as a pretty basic, failsafe version that should work for most sites:
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/Sitemap: http://www.yoursite.com/sitemap.xml
I always prefer to explicitly declare the location of my site map, even if it's in the default location.
There are other directives you can include, but they depend more on how you have handled other aspects of your website - e.g. trackbacks, comments and search results pages as well as feeds. This is where the list can get grey, as there are multiple ways to accomplish this, depending how your site is optimised, but here's a representative example.
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category//
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /?
Disallow: /?Sorry I can't be more specific on the above example, but it's where things really come down to how you're managing your specific site, and are a much bigger discussion. A web search for "best WordPress robots.txt file" will certainly show you the range of opinions on this.
The key thing to remember with a robots.txt file is that it does not cause blocked URLs to be removed from the index, it only stops the crawlers from traversing those pages. It's designed to help the crawlers spend their time on the pages that you have declared useful, instead of wasting their time on pages that are more administrative in nature. A crawler has a limited amount of time to spend on your site, and you want it to spend that time looking at the valuable pages, not the backend.
Paul
-
Thanks for the detailed answer Paul!
Do you think there is anything I should block for a wordpress website? I blocked /admin.
-
There is really no reason to block the robots.txt file from human users, Jazy. They'll never see it unless they actively go looking for it, and even if they do, it's just directives for where you want the search crawlers to go and where you want them to stay away from.
The only thing a human user will learn from this, is what sections of your site you consider to be nonessential to a search crawler. Even without the robots file, if they were really interested in this information, they could acquire it in other ways.
If you're trying to use your robots.txt file to block information about pages on your website you want to keep private or you don't want anyone to know about, doing it in the robots.txt file is the wrong place anyway. (That's done in .htaccess, which should be blocked from human readers.)
There's enough complexity to managing a website, there's no reason to add more by trying to block your robots file from human users.
Hope that helps?
Paul
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplicate content issue with ?utm_source=rss&utm_medium=rss&utm_campaign=
Hello,
Technical SEO | | Dinsh007
Recently, I was checking how my site content is getting indexed in Google and from today I noticed 2 links indexed on google for the same article: This is the proper link - https://techplusgame.com/hideo-kojima-not-interested-in-new-silent-hills-revival-insider-claims/ But why this URL was indexed, I don't know - https://techplusgame.com/hideo-kojima-not-interested-in-new-silent-hills-revival-insider-claims/?utm_source=rss&utm_medium=rss&utm_campaign=hideo-kojima-not-interested-in-new-silent-hills-revival-insider-claims Could you please tell me how to solve this issue? Thank you1 -
What is the best way to deal with https?
Currently, the site I am working on is using HTTPS throughout the website. The non-HTTPS pages are redirected through a 301 redirect to the HTTPS- this happens for all pages. Is this the best strategy going forward? if not, what changes would you suggest?
Technical SEO | | adarsh880 -
IIS 7.5 - Duplicate Content and Totally Wrong robot.txt
Well here goes! My very first post to SEOmoz. I have two clients that are hosted by the same hosting company. Both sites have major duplicate content issues and appear to have no internal links. I have checked this both here with our awesome SEOmoz Tools and with the IIS SEO Tool Kit. After much waiting I have heard back from the hosting company and they say that they have "implemented redirects in IIS7.5 to avoid duplicate content" based on the following article: http://blog.whitesites.com/How-to-setup-301-Redirects-in-IIS-7-for-good-SEO__634569104292703828_blog.htm. In my mind this article covers things better: www.seomoz.org/blog/what-every-seo-should-know-about-iis. What do you guys think? Next issue, both clients (as well as other sites hosted by this company) have a robot.txt file that is not their own. It appears that they have taken one client's robot.txt file and used it as a template for other client sites. I could be wrong but I believe this is causing the internal links to not be indexed. There is also a site map, again not for each client, but rather for the client that the original robot.txt file was created for. Again any input on this would be great. I have asked that the files just be deleted but that has not occurred yet. Sorry for the messy post...I'm at the hospital waiting to pick up my bro and could be called to get him any minute. Thanks so much, Tiff
Technical SEO | | TiffenyPapuc0 -
Removing links - Best practice
Hi I have noticed on webmaster that I have a lot of links to my sites from link building directories. Either I did this many years a go or somehow they've linked to me. Would links to link building directories harm my site? i.e linkspurt.com pingerati.net I have quite a few and just wondering what to do with them. Also I have some customer sites which are massive one site has 38,000 links coming to my site as I have put a credit that I built the site with a link back to mine. It has a low score in Google would this also harm my site? Any advise would be appreciated.
Technical SEO | | Cocoonfxmedia0 -
Webshop migration & SEO
Hi there I am migrating all products from my current trade website http://shop.snowbusiness.com/ to another platform. Whilst both are live, would i have to worry too much about duplicate content and no-follow the pages or is this overkill? Cheers, Ben
Technical SEO | | SnowFX0 -
Name Servers & SEO
We have decided to create a few blogs and will eventually be linking to some of our clients. I have domain privacy and different class C addresses for each of my domains. But the name servers area all the same. Ex: If we create an article for one client on all 5 blogs, will the name servers be a problem?
Technical SEO | | waqid0 -
What is the best way to broaden the reach of our website?
We were specifically targeting the Capital Region of New York and environs and now our goal is to broaden the net to reach potential clients. Should I to drop the location terms we already have baked in to the copy and add broader location terms? OR just add newer terms? We're developing a new design that sharpens our focus, but here is what we have now: http://www.behancommunications.com Thanks for any suggestions
Technical SEO | | PatDowd0