Does Rogerbot respect the robots.txt file for wildcards?
-
Hi All,
Our robots.txt file has wildcards in it, which Googlebot recognizes. Can anyone tell me whether or not Rogerbot recognizes wildcards in the robots.txt file?
We've done a Rogerbot site crawl since updating the robots.txt file and the pages that are set to disallow using the wildcards are still showing.
BTW, Googlebot is not crawling these pages according to Webmaster Tools.
Thanks in advance,
Robert
-
Thanks! RogerBot is now working. Perhaps it had a cached copy of the old robots.txt file. All is well now.
Thank you!
-
Yes, rogerbot follows robots exclusion protocol - http://www.seomoz.org/dp/rogerbot
-
Roger should obey wildcards. It sounds like he's not, so could you tattle on him to the help team and they'll see why he's not following directions? http://www.seomoz.org/help Thanks!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Unsolved Rogerbot blocked by cloudflare and not display full user agent string.
Hi, We're trying to get MOZ to crawl our site, but when we Create Your Campaign we get the error:
Moz Pro | | BB_NPG
Ooops. Our crawlers are unable to access that URL - please check to make sure it is correct. If the issue persists, check out this article for further help. robot.txt is fine and we actually see cloudflare is blocking it with block fight mode. We've added in some rules to allow rogerbot but these seem to be getting ignored. If we use a robot.txt test tool (https://technicalseo.com/tools/robots-txt/) with rogerbot as the user agent this get through fine and we can see our rule has allowed it. When viewing the cloudflare activity log (attached) it seems the Create Your Campaign is trying to crawl the site with the user agent as simply set as rogerbot 1.2 but the robot.txt testing tool uses the full user agent string rogerbot/1.0 (http://moz.com/help/pro/what-is-rogerbot-, [email protected]) albeit it's version 1.0. So seems as if cloudflare doesn't like the simple user agent. So is it correct the when MOZ is trying to crawl the site it uses the simple string of just rogerbot 1.2 now ? Thanks
Ben Cloudflare activity log, showing differences in user agent strings
2022-07-01_13-05-59.png0 -
Rogerbot did not crawl my site ! What might be the problem?
When I saw the new crawl for my site I wondered why there are no errors, no warning and 0 notices anymore. Then I saw that only 1 page was crawled. There are no Error Messages or webmasters Tools also did not report anything about crawling problems. What might be the problem? thanks for any tips!
Moz Pro | | inlinear
Holger rogerbot-did-not-crawl.PNG0 -
Broken CSV files?
Hi, I have been downloading fiels from seoMOZ for quite a while and I keep getting broken CSV files. ON my Mac, I wrote a script that cleaned them out but I cannot (don't know how) do this on the Windows side. The problem is that some of the record lines get broken up mainly in the "Anchor Text" field (maybe the anchor text in question contained a return). This of course messes up my file when I open it in Excel. Is there a way get/open these files without having them messed up like this? P..S. I am also seeing problem with text encoding, I would like to see the anchor text that is being diaplayed on the referring site rather than its HTML/Unicode/UTF-8 code value. TIA, Michel
Moz Pro | | analysestc0 -
Does SeoMoz realize about duplicated url blocked in robot.txt?
Hi there: Just a newby question... I found some duplicated url in the "SEOmoz Crawl diagnostic reports" that should not be there. They are intended to be blocked by the web robot.txt file. Here is an example url (joomla + virtuemart structure): http://www.domain.com/component/users/?view=registration and the here is the blocking content in the robots.txt file User-agent: * _ Disallow: /components/_ Question is: Will this kind of duplicated url errors be removed from the error list automatically in the future? Should I remember what errors should not really be in the error list? What is the best way to handle this kind of errors? Thanks and best regards Franky
Moz Pro | | Viada0 -
Rogerbot getting cheeky?
Hi SeoMoz, From time to time my server crashes during Rogerbot's crawling escapades, even though I have a robots.txt file with a crawl-delay 10, now just increased to 20. I looked at the Apache log and noticed Roger hitting me from from 4 different addresses 216.244.72.3, 72.11, 72.12 and 216.176.191.201, and most times whilst on each separate address, it was 10 seconds apart, ALL 4 addresses would hit 4 different pages simultaneously (example 2). At other times, it wasn't respecting robots.txt at all (see example 1 below). I wouldn't call this situation 'respecting the crawl-delay' entry in robots.txt as other question answered here by you have stated. 4 simultaneous page requests within 1 sec from Rogerbot is not what should be happening IMHO. example 1
Moz Pro | | BM7
216.244.72.12 - - [05/Sep/2012:15:54:27 +1000] "GET /store/product-info.php?mypage1.html" 200 77813
216.244.72.12 - - [05/Sep/2012:15:54:27 +1000] "GET /store/product-info.php?mypage2.html HTTP/1.1" 200 74058
216.244.72.12 - - [05/Sep/2012:15:54:28 +1000] "GET /store/product-info.php?mypage3.html HTTP/1.1" 200 69772
216.244.72.12 - - [05/Sep/2012:15:54:37 +1000] "GET /store/product-info.php?mypage4.html HTTP/1.1" 200 82441 example 2
216.244.72.12 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage1.html HTTP/1.1" 200 70209
216.244.72.11 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage2.html HTTP/1.1" 200 82384
216.244.72.12 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage3.html HTTP/1.1" 200 83683
216.244.72.3 - - [05/Sep/2012:15:46:15 +1000] "GET /store/mypage4.html HTTP/1.1" 200 82431
216.244.72.3 - - [05/Sep/2012:15:46:16 +1000] "GET /store/mypage5.html HTTP/1.1" 200 82855
216.176.191.201 - - [05/Sep/2012:15:46:26 +1000] "GET /store/mypage6.html HTTP/1.1" 200 75659 Please advise.1 -
How to get rid of the message "Search Engine blocked by robots.txt"
During the Crawl Diagnostics of my website,I got a message Search Engine blocked by robots.txt under Most common errors & warnings.Please let me know the procedure by which the SEOmoz PRO Crawler can completely crawl my website?Awaiting your reply at the earliest. Regards, Prashakth Kamath
Moz Pro | | 1prashakth0 -
Does anyone know of a crawler similar to SEOmoz's RogerBot?
As you probably know SEOmoz had some hosting and server issues recently, and this came at a terrible time for me... We are in the middle of battling some duplicate content and crawl errors and need to get a fresh crawl of some sites to test things out before we are hit with the big one? Before I get a million thumbs downs- I love and will continue to use SEOmoz, just need something to get me through this week ( or until Roger is back! )!
Moz Pro | | AaronSchinke1 -
To block with robots.txt or canonicalize?
I'm working with an apt community with a large number of communities across the US. I'm running into dup content issues where each community will have a page such as "amenities" or "community-programs", etc that are nearly identical (if not exactly identical) across all communities. I'm wondering if there are any thoughts on the best way to tackle this. The two scenarios I came up with so far are: Is it better for me to select the community page with the most authority and put a canonical on all other community pages pointing to that authoritative page? or Should i just remove the directory all-together via robots.txt to help keep the site lean and keep low quality content from impacting the site from a panda perspective? Is there an alternative I'm missing?
Moz Pro | | JonClark150