How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
-
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines.
I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file.
For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all?
There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this?
Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
-
Hi,
Usin;
User-agent: *
Disallow: /folder/subfolderis fine, however if you have information stored in your website that you certainly want crawled make sure it is in your site map and use ...
User-agent: *
allow: /folder/subfolderadding a no follow attribute to all of your pages wont be practical, if a spam crawler ignores the robots.txt it will ignore your no follow attribute. If anything new occurs with robots.txt check large website's robots.txt as they always update to new trends i.e
Hope this helps:)
-
Hi Jay,
There's actually a recent similar discussion at http://www.seomoz.org/q/what-reasons-exist-to-use-noindex-robots-txt regarding deciding what to block via robots.
For site comps for clients, you could also password-protect those to help hide them, or do a different domain that you have entirely excluded in robots. I've also seen services like Basecamp used for posting comps. It all depends on how much you want to hide the comps.
You do want your sitemap itself to be crawled, but I'm presuming this is in the root directory so that shouldn't be a problem. Folders like your sitemap generator and other purely-framework folders can certainly be disallowed. Blocking the files that list the version of your website (if you're using a CMS) can help prevent people from searching for opportunities to hack that version and finding your site.
Also, just do a site:domain.com search on your domain, see what's indexed, see what content from there you don't want indexed, and use that as a starting point.
Are you running on a content management system, or a custom site? For a CMS, here are example robots.txt files for several popular CMSs. http://www.stayonsearch.com/robots-txt-guide
-
You may also want to think about slapping a robots noindex on the individual pages as well.
-
You can type the following syntax:
after User-agent: *
Disallow: /foldername/subfoldername
also, you can name your sitemaps in the robots.txt file.
They can be defined as
Sitemap: http://www.yourdomain.com/sitemap.xml
If you have multiple sitemaps, you can have multiple sitemaps listed.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Country and Language tags-Running an SEO audit on a site that definitely has more than one language, but nothing is pulling up. I don't quite understand href lang or how to go about it. HELP HELP!
Ran an SEO audit and I don't really understand country and language tags. For example, sony.com definitely has more than one language, but how do I seo check href lang ? Do I inspect the page? etc?
Technical SEO | | Mindgruver0 -
Using http: shorthand inside canonical tag ("//" instead of "http:") can cause harm?
HI, I am planning to launch a new site, and shortly after to move to HTTPS. to save the need to change over 5,000 canonical tags in pages the webmaster suggested we implement inside the rel canonical "//" instead of the absolute path, would that do any damage or be a problem? oranges-south-dakota" />
Technical SEO | | Kung_fu_Panda0 -
Should you use the canonicalization tag when the content isn't exactly a duplicate?
We have a site that pull data from different sources with unique urls onto a main page and we are thinking about using the canonicalization tag to keep those source pages from being indexed and to give any authority to the main page. But this isn’t really what canonicalization is supposed to be used for so I’m unsure of if this is the right move.
Technical SEO | | Fuel
To give some more detail: We manage a site that has pages for individual golf courses. On the golf course page in addition to other general information we have sections on that page that show “related articles” and “course reviews”.
We may only show 4 or 5 on each of those courses pages per page, but we have hundreds of related articles and reviews for each course. So below “related articles” on the course page we have a link to “see more articles” that would take the user to a new page that is simply a aggregate page that houses all the article or review content related to that course.
Since we would rather have the overall course page rank in SERPs rather than the page that lists these articles, we are considering canonicalizing the aggregate news page up to the course page.
But, as I said earlier, this isn’t really what the canonicalization tag is intended for so I’m hesitant.
Has anyone else run across something like this before? What do you think?0 -
Correct linking to the /index of a site and subfolders: what's the best practice? link to: domain.com/ or domain.com/index.html ?
Dear all, starting with my .htaccess file: RewriteEngine On
Technical SEO | | inlinear
RewriteCond %{HTTP_HOST} ^www.inlinear.com$ [NC]
RewriteRule ^(.*)$ http://inlinear.com/$1 [R=301,L] RewriteCond %{THE_REQUEST} ^./index.html
RewriteRule ^(.)index.html$ http://inlinear.com/ [R=301,L] 1. I redirect all URL-requests with www. to the non www-version...
2. all requests with "index.html" will be redirected to "domain.com/" My questions are: A) When linking from a page to my frontpage (home) the best practice is?: "http://domain.com/" the best and NOT: "http://domain.com/index.php" B) When linking to the index of a subfolder "http://domain.com/products/index.php" I should link also to: "http://domain.com/products/" and not put also the index.php..., right? C) When I define the canonical ULR, should I also define it just: "http://domain.com/products/" or in this case I should link to the definite file: "http://domain.com/products**/index.php**" Is A) B) the best practice? and C) ? Thanks for all replies! 🙂
Holger0 -
How to solve the meta : A description for this result is not available because this site's robots.txt. ?
Hi, I have many URL for commercialization that redirects 301 to an actual page of my companies' site. My URL provider say that the load for those request by bots are too much, they put robots text on the redirection server ! Strange or not? Now I have a this META description on all my URL captains that redirect 301 : A description for this result is not available because this site's robots.txt. If you have the perfect solutions could you share it with me ? Thank You.
Technical SEO | | Vale70 -
Panda or Penquin -Website Fell - Shouldn't this Recover?
On March 23rd our site fell 47% in one day. www.TranslationSoftware4u.com but we still held quite a few #1 to #7 rankings on Google and thought it would just recover. Our top keyword "translation software" was #4 , now we are #19 Over the next week I waited to see if it recovered. We have been online 10+ years and always stayed with white hat. I admit to learning as I go over the years but always felt content was king so I focused on information. I really do not see my site as using spam techniques but maybe I am missing something on the way I have it. March 23rd, major drop -47% On April 2nd I started with SEO MOZ and the Research tools showed we had duplicate content warning. This was from a blog we were trying to start that only had 7 posts but it had about 20 tags per post. I did not realize that tags actually created that post under that tag. I went in and deleted the tags again being stupid and not realizing it was then making that come up 404. The blog was so small we do not get hits on it anyway so hoping it just clears itself up. ( still get duplicate warning on our directory due to using "php Link Directory", but it's due to how it reuses the title tag and description, 2 instances per category page"). Still trying to fix the php directory issue. Seems many others are running it and did not have a drop. April 24th, we dropped another -10% It keeps falling -70% now. I have gone through the site and tried to clean up any warnings like duplicate title tags, meta descriptions. With regards to links I put up a small web directory with some reciprocal linking. Our product translates languages but software is not the same as a human so we often set clients up with human translators, the directory is a nice place to help our customers find a translator or see online tools that can help. The links were not excessive, there were maybe 100 links. After the fall I went in and found some translators had gone out of business so I deleted those, I am down to 65 links now, about 45 are exchanges. I have submitted to some online directories manually, but looking back through the links there is not really anything that makes me concerned. The link back to my site was really the most neglected SEO thing I did. Again concentrating on content. I did find a few links that I was not happy about but I did not put those links so had no control. I have been working on cleaning up my title tags, and making sure the content just reads better. I have been hoping that my site would just start recovering but it keeps sliding. Has anyone seen recovery from the updates. Should I see anything yet? I cannot seem to get Google to return to the site and reindex. Am I doing somethign spammy on my site and I do not realize it? Thanks for any advice in advance!
Technical SEO | | Force70 -
What does it mean by 'blocked by Meta Robot'? How do I fix this?
When i get my crawl diagnostics, I am getting a blocked by Meta Robot, which means that my page is not being indexed in the search engines... obviously this is a major issue for organic traffic!!! What does it actually mean, and how can i fix it?
Technical SEO | | rolls1230 -
What tool do you use to check for URLs not indexed?
What is your favorite tool for getting a report of URLs that are not cached/indexed in Google & Bing for an entire site? Basically I want a list of URLs not cached in Google and a seperate list for Bing. Thanks, Mark
Technical SEO | | elephantseo3