Rogerbot Ignoring Robots.txt?

kellydallen

Hi guys,

We're trying to block Rogerbot from spending 8000-9000 of our 10000 pages per week for our site crawl on our zillions of PhotoGallery.asp pages. Unfortunately our e-commerce CMS isn't tremendously flexible so the only way we believe we can block rogerbot is in our robots.txt file.

Rogerbot keeps crawling all these PhotoGallery.asp pages so it's making our crawl diagnostics really useless.

I've contacted the SEOMoz support staff and they claim the problem is on our side. This is the robots.txt we are using:

User-agent: rogerbot

Disallow:/PhotoGallery.asp

Disallow:/pindex.asp

Disallow:/help.asp

Disallow:/kb.asp

Disallow:/ReviewNew.asp

User-agent: *

Disallow:/cgi-bin/

Disallow:/myaccount.asp

Disallow:/WishList.asp

Disallow:/CFreeDiamondSearch.asp

Disallow:/DiamondDetails.asp

Disallow:/ShoppingCart.asp

Disallow:/one-page-checkout.asp

Sitemap: http://store.jrdunn.com/sitemap.xml

For some reason the Wysiwyg edit is entering extra spaces but those are all single spaced.

Any suggestions? The only other thing I thought of to try is to something like "Disallow:/PhotoGallery.asp*" with a wildcard.

Mihas07

I have just encountered an interesting thing about Moz Link Search and its bot: if you do a search for Domains linking to Google.com , you find a list of about 900 000 domains, among which I was surprised to find webcache.googleusercontent.com

See the proof below in attache screen shot.

At the same time, the webcache.googleusercontent.com policy for robots is as shown in the second attachment.

In my opinion, there is only one possible explanation: Moz Bot does ignore robots.txt files...

e9f7db874c 87ce35be1c

kellydallen

Thanks Cyrus,

No, for some reason the editor double-spaced the file when I pasted. Other than that, it's the same though.

Yes, I actually tried ordering the exclusions both ways. Neither works.

The robots.txt checkers report no errors. I had actually checked them before posting.

Before I posted this, I was pretty convinced the problem wasn't in our robots.txt but the Seomoz support staff says essentially, "We don't think the problem is with Rogerbot, so it must be in your robots.txt file, but we can't look at that, so if by some chance your robots.txt file is fine, then there's nothing we can do for you because we're just going to assume the problem is on your side."

I figured, with everything I've already tried, and if the fabulous SEOMoz community can't come up with a solution, that'll be the best I can do.

Cyrus-Shepard

Hi Kelly,

Thanks for letting us know. Could be a couple of things right off the bat. Is this your exact robots.txt file? If so, it's missing some formatting like proper spacing to be perfectly compliant. You can run a check of your robots.txt file at serveral places.

http://tool.motoricerca.info/robots-checker.phtml

http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php

http://www.sxw.org.uk/computing/robots/check.html

Also, it's generally a good idea to put specific inclusions towards the bottom, so I might flip the order and put the rogerbot directives last and the User-agent: * first.

Hope this helps. Let us know if any of this points in the right direction.

kellydallen

Thanks so much for the tip. Unfortunately still unsuccessful. (shrug)

Malarowski

Try

Disallow: /PhotoGallery.asp

I put wild cards all over usually just to be sure and had no issues so far.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Rogerbot Ignoring Robots.txt?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Meta Robots query

Block Moz (or any other robot) from crawling pages with specific URLs

Blocked by Meta Robots.

Allow only Rogerbot, not googlebot nor undesired access

Rogerbot's crawl behaviour vs google spiders and other crawlers - disparate results have me confused.

The pages that add robots as noindex will Crawl and marked as duplicate page content on seo moz ?

Rogerbot does not catch all existing 4XX Errors

Its been over a month, rogerbot hasn't crawled the entire website yet. Any ideas?