Why doesn't Moz crawler follow robots.txt?

Tylerj

It is crawling the entire site, and there is stuff we do not want it to. Please advise.

Tylerj

Which I am ok with, but why am I getting duplicate content?

Andy-Halliday

Yes, it doesn't tell them which pages not to crawl - just not to index them

Tylerj

It has been used correctly. The site is a Magento site and they have it built in. There are a lot of filters for products so it uses rel=canonical to tell Google which to index.

Andy-Halliday

rel=canonical is not really an robots instruction file - rel=canonical is to help with duplicate copy where you have the same or similar pages and your telling search engines which pages is the preferred page.

If you don't want pages crawling you have to tell Search engines in the robots file

Vijay-Gaur

Hi There,

Rel=canonical tags tell robots, which page is actually to index out of many.

For SEOs, canonicalization refers to individual web pages that can be loaded from multiple URLs. This is a problem because when multiple pages have the same content but different URLs, links that are intended to go to the same page get split up among multiple URLs. This means that the popularity of the pages gets split up. Unfortunately for web developers, this happens far too often because the default settings for web servers create this problem.

https://mza.seotoolninja.com/learn/seo/canonicalization

I feel you have not used it correctly, check the above article and see if it helps.

Thanks,

Vijay

Tylerj

So I made a mistake it isn't the robots.txt that is the issue. I am getting hit with a ton of duplicate content penalties so I figured that was it. The problem is that I have pages with rel=canonical tags that it is ignoring. Does Roger not read those?

Andy-Halliday

Hi

Have to agree with the above, Rogerbot does listen to robot.txt file, unlike Bing - while they are getting better Bing ignores the robots.txt file frequently.

Ive analysed quite a few server logs over the years and Roger has always listened to the file - its usually a mistake the in the robots file.

There is an option to test your robots.txt file in GCS - while this is testing to see if Google will crawl the page - usually Roger has the same instructions as Google.

However if you are still pretty certain that Roger is ignoring robots.txt please DM your Server Logs and your website and I will take a look and analyse it for you (free of course).

Thanks

Andy

Vijay-Gaur

All major search engines, including Moz's crawler Rogerbot and Internet Archives, respect Robots.txt as a standard “robots exclusion protocol” to communicate with web crawlers and web robots.

In case you wish to exclude some specific information from all Search Engines, you can use the following sample code as reference to block specific directories.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

However, if you want to specifically block Mz's Rogerbot from crawling specific sections of your website. You may take the following reference code to block specific areas / directories in your website from rogerbot:

User-agent: Rogerbot
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

I hope this helps, If you have specific questions, please feel free to respond, I will be happy to answer them.

Regards,

Vijay

moz_support

Hi there! Moz's crawler, rogerbot, does follow robots.txt. When he's not following robots.txt, it's usually because the robots.txt protocol is formatted improperly. Learn more about formatting your page here: https://mza.seotoolninja.com/learn/seo/robotstxt

For more information on Roger, including how to block him, head here: https://mza.seotoolninja.com/help/guides/moz-procedures/what-is-rogerbot

And if you want to test your formatting, try the Robots Checker here: https://support.google.com/webmasters/answer/6062598

If you're still unable to determine why rogerbot is crawling your site, feel free to write in to [email protected]!

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Why doesn't Moz crawler follow robots.txt?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Moz was unable to crawl your site on Jun 22, 2020\. We were unable to access your site due to a page timeout on your robots.txt, which prevented us from crawling the rest of your site.

Apart from spying on competitors back link what else can be done in MOZ?

Strange error in MOZ report

Does moz pro standard give access to open site explorer & keyword explorer?

My title has a TM symbol and Moz says I don't have the keyword in my title

Error getting your data in moz ose

Why are no-follow links in my blog comments across the web showing up as "equity-passing"?

How can I use Open Site Explorer to Identify why my own URL's are appearing as inbound links