Robots.txt Syntax
-
I have been having a hard time finding any decent information regarding the robots.txt syntax that has been written in the last few years and I just want to verify some things as a review for myself. I have many occasions where I need to block particular directories in the URL, parameters and parameter values. I just wanted to make sure that I am doing this in the most efficient ways possible and thought you guys could help.
So let's say I want to block a particular directory called "this" and this would be an example URL:
www.domain.com/folder1/folder2/this/file.html
or
www.domain.com/folder1/this/folder2/file.htmlIn order for me to block any URL that contains this folder anywhere in the URL I would use:
User-agent: *
Disallow: /this/Now lets say I have a parameter "that" I want to block and sometimes it is the first parameter and sometimes it isn't when it shows up in the URL. Would it look like this?
User-agent: *
Disallow: ?that=
Disallow: &that=What about if there is only one value I want to block for "that" and the value is "NotThisGuy":
User-agent: *
Disallow: ?that=NotThisGuy
Disallow: &that=NotThisGuyMy big questions here are what are the most efficient ways to block a particular parameter and block a particular parameter value. Is there a more efficient way to deal with ? and & for when the parameter and value are either first or later? Secondly is there a list somewhere that will tell me all of the syntax and meaning that can be used for a robots.txt file?
Thanks!
-
My advice is to go easy with robots.txt--it's a bit like dynamite, powerful, but can take your leg (or entire website) off.
I like this checker:
http://tool.motoricerca.info/robots-checker.phtml
If you look ok after running that checker, then use the built-in Google one.
Note that robots.txt syntax DOES NOT have wildcards. Apparently this doesn't stop a ton of people from using wildcards in them (to no effect, and clearly they didn't bother to test!).
Another reason to avoid disallow in robots.txt is that if you disallow the engines from looking at a page's contents, then you're ALSO stopping the link juice that might have flowed to other pages it links to.
So let's say you have 100 pages on your site that you're currently blocking with disallow in robots.txt. If instead, you put a meta robots "noindex,follow" in each of those pages, then every page linked to from those 100 pages (i.e. everything in your main menu) would get an extra 100 internal links worth of link juice.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Set Robots.txt file to crawl my website at specific times
Our website provider has stated that they can only 'lift' their block on our website in order for it to be crawled as specific times. Is there any way to amend a robots.txt to ensure that it crawls our website at a specific time of day/night in order to coincide with the block being lifted? Many Thanks, Charlene
Intermediate & Advanced SEO | | CharleneKennedy120 -
Syndicated content with meta robots 'noindex, nofollow': safe?
Hello, I manage, with a dedicated team, the development of a big news portal, with thousands of unique articles. To expand our audiences, we syndicate content to a number of partner websites. They can publish some of our articles, as long as (1) they put a rel=canonical in their duplicated article, pointing to our original article OR (2) they put a meta robots 'noindex, follow' in their duplicated article + a dofollow link to our original article. A new prospect, to partner with with us, wants to follow a different path: republish the articles with a meta robots 'noindex, nofollow' in each duplicated article + a dofollow link to our original article. This is because he doesn't want to pass pagerank/link authority to our website (as it is not explicitly included in the contract). In terms of visibility we'd have some advantages with this partnership (even without link authority to our site) so I would accept. My question is: considering that the partner website is much authoritative than ours, could this approach damage in some way the ranking of our articles? I know that the duplicated articles published on the partner website wouldn't be indexed (because of the meta robots noindex, nofollow). But Google crawler could still reach them. And, since they have no rel=canonical and the link to our original article wouldn't be followed, I don't know if this may cause confusion about the original source of the articles. In your opinion, is this approach safe from an SEO point of view? Do we have to take some measures to protect our content? Hope I explained myself well, any help would be very appreciated, Thank you,
Intermediate & Advanced SEO | | Fabio80
Fab0 -
Need help with Robots.txt
An eCommerce site built with Modx CMS. I found lots of auto generated duplicate page issue on that site. Now I need to disallow some pages from that category. Here is the actual product page url looks like
Intermediate & Advanced SEO | | Nahid
product_listing.php?cat=6857 And here is the auto generated url structure
product_listing.php?cat=6857&cPath=dropship&size=19 Can any one suggest how to disallow this specific category through robots.txt. I am not so familiar with Modx and this kind of link structure. Your help will be appreciated. Thanks1 -
Block subdomain directory in robots.txt
Instead of block an entire sub-domain (fr.sitegeek.com) with robots.txt, we like to block one directory (fr.sitegeek.com/blog).
Intermediate & Advanced SEO | | gamesecure
'fr.sitegeek.com/blog' and 'wwww.sitegeek.com/blog' contain the same articles in one language only labels are changed for 'fr' version and we suppose that duplicate content cause problem for SEO. We would like to crawl and index 'www.sitegee.com/blog' articles not 'fr.sitegeek.com/blog'. so, suggest us how to block single sub-domain directory (fr.sitegeek.com/blog) with robot.txt? This is only for blog directory of 'fr' version even all other directories or pages would be crawled and indexed for 'fr' version. Thanks,
Rajiv0 -
Avoiding Duplicate Content with Used Car Listings Database: Robots.txt vs Noindex vs Hash URLs (Help!)
Hi Guys, We have developed a plugin that allows us to display used vehicle listings from a centralized, third-party database. The functionality works similar to autotrader.com or cargurus.com, and there are two primary components: 1. Vehicle Listings Pages: this is the page where the user can use various filters to narrow the vehicle listings to find the vehicle they want.
Intermediate & Advanced SEO | | browndoginteractive
2. Vehicle Details Pages: this is the page where the user actually views the details about said vehicle. It is served up via Ajax, in a dialog box on the Vehicle Listings Pages. Example functionality: http://screencast.com/t/kArKm4tBo The Vehicle Listings pages (#1), we do want indexed and to rank. These pages have additional content besides the vehicle listings themselves, and those results are randomized or sliced/diced in different and unique ways. They're also updated twice per day. We do not want to index #2, the Vehicle Details pages, as these pages appear and disappear all of the time, based on dealer inventory, and don't have much value in the SERPs. Additionally, other sites such as autotrader.com, Yahoo Autos, and others draw from this same database, so we're worried about duplicate content. For instance, entering a snippet of dealer-provided content for one specific listing that Google indexed yielded 8,200+ results: Example Google query. We did not originally think that Google would even be able to index these pages, as they are served up via Ajax. However, it seems we were wrong, as Google has already begun indexing them. Not only is duplicate content an issue, but these pages are not meant for visitors to navigate to directly! If a user were to navigate to the url directly, from the SERPs, they would see a page that isn't styled right. Now we have to determine the right solution to keep these pages out of the index: robots.txt, noindex meta tags, or hash (#) internal links. Robots.txt Advantages: Super easy to implement Conserves crawl budget for large sites Ensures crawler doesn't get stuck. After all, if our website only has 500 pages that we really want indexed and ranked, and vehicle details pages constitute another 1,000,000,000 pages, it doesn't seem to make sense to make Googlebot crawl all of those pages. Robots.txt Disadvantages: Doesn't prevent pages from being indexed, as we've seen, probably because there are internal links to these pages. We could nofollow these internal links, thereby minimizing indexation, but this would lead to each 10-25 noindex internal links on each Vehicle Listings page (will Google think we're pagerank sculpting?) Noindex Advantages: Does prevent vehicle details pages from being indexed Allows ALL pages to be crawled (advantage?) Noindex Disadvantages: Difficult to implement (vehicle details pages are served using ajax, so they have no tag. Solution would have to involve X-Robots-Tag HTTP header and Apache, sending a noindex tag based on querystring variables, similar to this stackoverflow solution. This means the plugin functionality is no longer self-contained, and some hosts may not allow these types of Apache rewrites (as I understand it) Forces (or rather allows) Googlebot to crawl hundreds of thousands of noindex pages. I say "force" because of the crawl budget required. Crawler could get stuck/lost in so many pages, and my not like crawling a site with 1,000,000,000 pages, 99.9% of which are noindexed. Cannot be used in conjunction with robots.txt. After all, crawler never reads noindex meta tag if blocked by robots.txt Hash (#) URL Advantages: By using for links on Vehicle Listing pages to Vehicle Details pages (such as "Contact Seller" buttons), coupled with Javascript, crawler won't be able to follow/crawl these links. Best of both worlds: crawl budget isn't overtaxed by thousands of noindex pages, and internal links used to index robots.txt-disallowed pages are gone. Accomplishes same thing as "nofollowing" these links, but without looking like pagerank sculpting (?) Does not require complex Apache stuff Hash (#) URL Disdvantages: Is Google suspicious of sites with (some) internal links structured like this, since they can't crawl/follow them? Initially, we implemented robots.txt--the "sledgehammer solution." We figured that we'd have a happier crawler this way, as it wouldn't have to crawl zillions of partially duplicate vehicle details pages, and we wanted it to be like these pages didn't even exist. However, Google seems to be indexing many of these pages anyway, probably based on internal links pointing to them. We could nofollow the links pointing to these pages, but we don't want it to look like we're pagerank sculpting or something like that. If we implement noindex on these pages (and doing so is a difficult task itself), then we will be certain these pages aren't indexed. However, to do so we will have to remove the robots.txt disallowal, in order to let the crawler read the noindex tag on these pages. Intuitively, it doesn't make sense to me to make googlebot crawl zillions of vehicle details pages, all of which are noindexed, and it could easily get stuck/lost/etc. It seems like a waste of resources, and in some shadowy way bad for SEO. My developers are pushing for the third solution: using the hash URLs. This works on all hosts and keeps all functionality in the plugin self-contained (unlike noindex), and conserves crawl budget while keeping vehicle details page out of the index (unlike robots.txt). But I don't want Google to slap us 6-12 months from now because it doesn't like links like these (). Any thoughts or advice you guys have would be hugely appreciated, as I've been going in circles, circles, circles on this for a couple of days now. Also, I can provide a test site URL if you'd like to see the functionality in action.0 -
Does It Really Matter to Restrict Dynamic URLs by Robots.txt?
Today, I was checking Google webmaster tools and found that, there are 117 dynamic URLs are restrict by Robots.txt. I have added following syntax in my Robots.txt You can get more idea by following excel sheet. #Dynamic URLs Disallow: /?osCsidDisallow: /?q= Disallow: /?dir=Disallow: /?p= Disallow: /*?limit= Disallow: /*review-form I have concern for following kind of pages. Shorting by specification: http://www.vistastores.com/table-lamps?dir=asc&order=name Iterms per page: http://www.vistastores.com/table-lamps?dir=asc&limit=60&order=name Numbering page of products: http://www.vistastores.com/table-lamps?p=2 Will it create resistance in organic performance of my category pages?
Intermediate & Advanced SEO | | CommercePundit0 -
10,000 New Pages of New Content - Should I Block in Robots.txt?
I'm almost ready to launch a redesign of a client's website. The new site has over 10,000 new product pages, which contain unique product descriptions, but do feature some similar text to other products throughout the site. An example of the page similarities would be the following two products: Brown leather 2 seat sofa Brown leather 4 seat corner sofa Obviously, the products are different, but the pages feature very similar terms and phrases. I'm worried that the Panda update will mean that these pages are sand-boxed and/or penalised. Would you block the new pages? Add them gradually? What would you recommend in this situation?
Intermediate & Advanced SEO | | cmaddison0 -
Should we block urls like this - domainname/shop/leather-chairs.html?brand=244&cat=16&dir=ascℴ=price&price=1 within the robots.txt?
I've recently added a campaign within the SEOmoz interface and received an alarming number of errors ~9,000 on our eCommerce website. This site was built in Magento, and we are using search friendly url's however most of our errors were duplicate content / titles due to url's like: domainname/shop/leather-chairs.html?brand=244&cat=16&dir=asc&order=price&price=1 and domainname/shop/leather-chairs.html?brand=244&cat=16&dir=asc&order=price&price=4. Is this hurting us in the search engines? Is rogerbot too good? What can we do to cut off bots after the ".html?" ? Any help would be much appreciated 🙂
Intermediate & Advanced SEO | | MonsterWeb280