Robots.txt usage
-
Hey Guys,
I am about make an important improvement to our site's robots.txt
we have large number of properties on our site and we have different views for them. List, gallery and map view. By default list view shows up and user can navigate through gallery view.
We donot want gallery pages to get indexed and want to save our crawl budget for more important pages.
this is one example of our site:
http://www.holiday-rentals.co.uk/France/r31.htm
When you click on "gallery view" URL of this site will remain same in your address bar: but when you mouse over the "gallery view" tab it will show you URL with parameter "view=g". there are number of parameters: "view=g, view=l and view=m".
http://www.holiday-rentals.co.uk/France/r31.htm?view=l
http://www.holiday-rentals.co.uk/France/r31.htm?view=g
http://www.holiday-rentals.co.uk/France/r31.htm?view=m
Now my question is:
I If restrict bots by adding "Disallow: ?view=" in our robots.txt will it effect the list view too?
Will be very thankful if yo look into this for us.
Many thanks
Hassan
I will test this on some other site within our network too before putting it to important one's. to measure the impact but will be waiting for your recommendations. Thanks
-
Others are right by the way canonical may be better, but if you insist on robots restriction you should add two schemas to each parameter:
disallow:?view=m disallow:?view=m*
so that you block the urls that contain the parameter at the end and block the ones that have it in the middle as well.
-
I had a similar issue with my website: there were many ways of sorting a likst of items (date, title, etc) which ended up causing duplicate content, we solved the issue a couple of days ago by restricting the "sorted" pages using the robots.txt file. HOWEVER, this morning i found this text in the Google Webmaster Tools support section:
Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the
rel="canonical"
link element, the URL parameter handling tool, or 301 redirects. In cases where duplicate content leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools.source:
http://www.google.com/support/webmasters/bin/answer.py?answer=66359I havent seen any negative effect on my site (yet), but I would agree with SuperlativB in the sense that YOU might be better off using "canonical" tags on these links
http://www.holiday-rentals.co.uk/...?view=l
-
For these paratmeters are not at the very end os the url you should add * after the letter of the parameter as well in the restriction
you got my point, thanks for looking into this. Since our search page load with list view by default and it is not in URL but still v=l represents the list view.
I want to disallow both parameters "view=g, view=m" in any URL from bots.
If these parameters are sometimes in between and some time at the end of URL what will be the work around for for both cases, you suggest?
Thanks for looking into this...
-
You can do the restriction you want but if i get it right m stands for map view g stands for gallery view and l stands for list view. So if you want list view to be indexed and map and gallery view not to be indexed you should add two lines of distriction:
disallow:?view=m disallow:?view=g
if these paratmeters are not at the very end os the url you should add * after the letter of the parameter as well in the restriction
-
Sounds like this is something canonical could solve for you. If you disallow ?view=* you would disallow all "?view" on your homepage, if you are unsure you should go for exact match rather that all.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots file set up
The robots file looks like it has been set up in a very messy way.
Technical SEO | | mcwork
I understand the # will comment out a line, does this mean the sitemap would
not be picked up?
Disallow: /js/ should this be allowed like /*.js$
Disallow: /media/wysiwyg/ - this seems to be causing alerts in webmaster tools as it can not access
the images within.
Can anyone help me clean this up please #Sitemap: https://examplesite.com/sitemap.xml Crawlers Setup User-agent: *
Crawl-delay: 10 Allowable Index Mind that Allow is not an official standard Allow: /index.php/blog/
Allow: /catalog/seo_sitemap/category/ Allow: /catalogsearch/result/ Allow: /media/catalog/ Directories Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /js/
Disallow: /lib/
Disallow: /magento/ Disallow: /media/ Disallow: /media/captcha/ Disallow: /media/catalog/ #Disallow: /media/css/
#Disallow: /media/css_secure/
Disallow: /media/customer/
Disallow: /media/dhl/
Disallow: /media/downloadable/
Disallow: /media/import/
#Disallow: /media/js/
Disallow: /media/pdf/
Disallow: /media/sales/
Disallow: /media/tmp/
Disallow: /media/wysiwyg/
Disallow: /media/xmlconnect/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
#Disallow: /skin/
Disallow: /stats/
Disallow: /var/ Paths (clean URLs) Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalog/product/gallery/
Disallow: */catalog/product/upload/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/ Files Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
Disallow: /get.php # Magento 1.5+ Paths (no clean URLs) #Disallow: /.js$
#Disallow: /.css$
Disallow: /.php$
Disallow: /?SID=
Disallow: /rss*
Disallow: /*PHPSESSID Disallow: /:
Disallow: /😘 User-agent: Fatbot
Disallow: / User-agent: TwengaBot-2.0
Disallow: /0 -
Adding directories to robots nofollow cause pages to have Blocked Resources
In order to eliminate duplicate/missing title tag errors for a directory (and sub-directories) under www that contain our third-party chat scripts, I added the parent directory to the robots disallow list. We are now receiving a blocked resource error (in Webmaster Tools) on all of the pages that have a link to a javascript (for live chat) in the parent directory. My host is suggesting that the warning is only a notice and we can leave things as is without worrying about the page being de-ranked/penalized. I am wondering if this is true or if we should remove the one directory that contains the js from the robots file and find another way to resolve the duplicate title tags?
Technical SEO | | miamiman1000 -
3,511 Pages Indexed and 3,331 Pages Blocked by Robots
Morning, So I checked our site's index status on WMT, and I'm being told that Google is indexing 3,511 pages and the robots are blocking 3,331. This seems slightly odd as we're only disallowing 24 pages on the robots.txt file. In light of this, I have the following queries: Do these figures mean that Google is indexing 3,511 pages and blocking 3,331 other pages? Or does it mean that it's blocking 3,331 pages of the 3,511 indexed? As there are only 24 URLs being disallowed on robots.text, why are 3,331 pages being blocked? Will these be variations of the URLs we've submitted? Currently, we don't have a sitemap. I know, I know, it's pretty unforgivable but the old one didn't really work and the developers are working on the new one. Once submitted, will this help? I think I know the answer to this, but is there any way to ascertain which pages are being blocked? Thanks in advance! Lewis
Technical SEO | | PeaSoupDigital0 -
Wordpress Robots.txt Sitemap submission?
Alright, my question comes directly from this article by SEOmoz http://www.seomoz.org/learn-seo/r... Yes, I have submitted the sitemap to google, bing's webmaster tools and and I want to add the location of our site's sitemaps and does it mean that I erase everything in the robots.txt right now and replace it with? <code>User-agent: * Disallow: Sitemap: http://www.example.com/none-standard-location/sitemap.xml</code> <code>???</code> because Wordpress comes with some default disallows like wp-admin, trackback, plugins. I have also read this, but was wondering if this is the correct way to add sitemap on Wordpress Robots.txt. [http://www.seomoz.org/q/removing-...](http://www.seomoz.org/q/removing-robots-txt-on-wordpress-site-problem) I am using Multisite with Yoast plugin so I have more than one sitemap.xml to submit Do I erase everything in Robots.txt and replace it with how SEOmoz recommended? hmm that sounds not right. like <code> <code>
Technical SEO | | joony2008
<code>User-agent: *
Disallow: </code> Sitemap: http://www.example.com/sitemap_index.xml</code> <code>``` Sitemap: http://www.example.com/sub/sitemap_index.xml ```</code> <code>?????????</code> ```</code>0 -
Internal search : rel=canonical vs noindex vs robots.txt
Hi everyone, I have a website with a lot of internal search results pages indexed. I'm not asking if they should be indexed or not, I know they should not according to Google's guidelines. And they make a bunch of duplicated pages so I want to solve this problem. The thing is, if I noindex them, the site is gonna lose a non-negligible chunk of traffic : nearly 13% according to google analytics !!! I thought of blocking them in robots.txt. This solution would not keep them out of the index. But the pages appearing in GG SERPS would then look empty (no title, no description), thus their CTR would plummet and I would lose a bit of traffic too... The last idea I had was to use a rel=canonical tag pointing to the original search page (that is empty, without results), but it would probably have the same effect as noindexing them, wouldn't it ? (never tried so I'm not sure of this) Of course I did some research on the subject, but each of my finding recommanded one of the 3 methods only ! One even recommanded noindex+robots.txt block which is stupid because the noindex would then be useless... Is there somebody who can tell me which option is the best to keep this traffic ? Thanks a million
Technical SEO | | JohannCR0 -
How to allow one directory in robots.txt
Hello, is there a way to allow a certain child directory in robots.txt but keep all others blocked? For instance, we've got external links pointing to /user/password/, but we're blocking everything under /user/. And there are too many /user/somethings/ to just block every one BUT /user/password/. I hope that makes sense... Thanks!
Technical SEO | | poolguy0 -
Un-Indexing a Page without robots.txt or access to HEAD
I am in a situation where a page was pushed live (Went live for an hour and then taken down) before it was supposed to go live. Now normally I would utilize the robots.txt or but I do not have access to either and putting a request in will not suffice as it is against protocol with the CMS. So basically I am left to just utilizing the and I cannot seem to find a nice way to play with the SE to get this un-indexed. I know for this instance I could go to GWT and do it but for clients that do not have GWT and for all the other SE's how could I do this? Here is the big question here: What if I have a promotional page that I don't want indexed and am met with these same limitations? Is there anything to do here?
Technical SEO | | DRSearchEngOpt0 -
Robots.txt
Hi there, My question relates to the robots.txt file. This statement: /*/trackback Would this block domain.com/trackback and domain.com/fred/trackback ? Peter
Technical SEO | | PeterM220