Captcha wall to access content and cloaking sanction
-
Hello, to protect our website against scrapping, visitor are redirect to a recaptcha page after 2 pages visited.
But for a SEO purpose Google bot is not included in that restriction so it could be seen as cloaking.
What is the best practice in SEO to avoid a penalty for cloaking in that case ?
I think about adding a paywall Json shema NewsArticle but the content is acceccible for free so it's not a paywall but more a captcha protection wall.What do you recommend ?
Thanks,Describe your question in detail. The more information you give, the better! It helps give context for a great answer.
-
In general, Google cares only about cloaking in the sense of treating their crawler differently to human visitors - it's not a problem to treat them differently to other crawlers.
So: if you are tracking the "2 pages visited" using cookies (which I assume you must be? there is no other reliable way to know the 2nd request is from the same user without cookies?) then you can treat googlebot exactly the same as human users - every request is stateless (without cookies) and so googlebot will be able to crawl. You can then treat non-googlebot scrapers more strictly, and rate limit / throttle / deny them as you wish.
I think that if real human users get at least one "free" visit, then you are probably OK - but you may want to consider not showing the recaptcha to real human users coming from google (but you could find yourself in an arms race with the scrapers pretending to be human visitors from google).
In general, I would expect that if it's a recaptcha ("prove you are human") step rather than a paywall / registration wall, you will likely be OK in the situation where:
- Googlebot is never shown the recaptcha
- Other scrapers are aggressively blocked
- Human visitors get at least one page without a recaptcha wall
- Human visitors can visit more pages after completing a recaptcha (but without paying / registering)
Hope that all helps. Good luck!
-
Well I'm not saying that there's no risk in what you are doing, just that I perceive the risk to be less risky than the alternatives. I think such a fundamental change like pay-walling could be moderately to highly likely to have a high impact on results (maybe a 65% likelihood of a 50% impact). Being incorrectly accused of cloaking would be a much lower chance (IMO) but with potentially higher impact (maybe a 5% or less chance of an 85% impact). When weighing these two things up, I subjectively conclude that I'd rather make the cloaking less 'cloaky' in and way I could, and leave everything outside of a paywall. That's how I'd personally weigh it up
Personally I'd treat Google as a paid user. If you DID have a 'full' paywall, this would be really sketchy but since it's only partial and indeed data can continue to be accessed for FREE via recaptcha entry, that's the one I'd go for
Again I'm not saying there is no risk, just that each set of dice you have at your disposal are ... not great? And this is the set of dice I'd personally choose to roll with
The only thing to keep in mind is that, the algorithms which Googlebot return data to are pretty smart. But they're not human smart, a quirk in an algo could cause a big problem. Really though, the chances of that IMO (if all you have said is accurate) are minimal. It's the lesser of two evils from my current perspective
-
Yes our DA is good and we got lot of gouv, edu and medias backlinks.
Paid user did not go through recaptcha, indeed treat Google as a paid user could be a good solution.
So you did not recommend using a paywall ?
Today recaptcha is only used for decision pages
But we need thoses pages to be indexed for our business because all or our paid user find us while searching a justice decision on Google.So we have 2 solutions :
- Change nothing and treat Google as a paid user
- Use hard paywall and inform Google that we use json shema markup but we risk to seen lot of page deindexed
In addition we could go from 2 pages visited then captcha to something less intrusive like 6 pages then captcha
Also in the captcha page there is also a form to start a free trial, so visitor can check captcha and keep navigate or create a free account and get an unlimited access for 7 days.To conclude, if I well understand your opinion, we don't have to stress about being penalized for cloaking because Gbot is smart and understand why we use captcha and our DA help us being trustable by gbot. So I think the best solution is the 1, Change nothing and treat Google as a paid user.
Thank a lot for your time and your help !
It's a complicated subject and it's hard to find people able to answer my question, but you did it -
Well if you have a partnership with the Court of Justice I'd assume your trust and authority metrics would be pretty high with them linking to you on occasion. If that is true then I think in this instance Google would give you the benefit of the doubt, as you're not just some random tech start-up (maybe a start-up, but one which matters and is trusted)
It makes sense that in your scenario your data protection has to be iron-clad. Do paid users have to go through the recaptcha? If they don't, would there be a way to treat Google as a paid user rather than a free user?
Yeah putting down a hard paywall could have significant consequences for you. Some huge publishers manage to still get indexed (pay-walled news sites), but not many and their performance deteriorates over time IMO
Here's a question for you. So you have some pages you really want indexed, and you have a load of data you don't want scraped or taken / stolen - right? Is it possible to ONLY apply the recaptcha for the pages which contain the data that you don't want stolen, and never trigger the recaptcha (at all) in other areas? Just trying to think if there is a wiggle way in the middle, to make it obvious to Google you are doing all you possibly can to do keep Google's view and the user view the same
-
Hi effectdigital, thanks a lot for that answer. I agreed with you captcha is not the best UX idea but our content is sensitive, we are a legal tech indexing french justice decision. We get unique partnership with Court of Justice because we got a unique technology to anonymize data in justice decision so we don't want our competitor to scrap our date (and trust me they try, every day..). This is why we use recaptcha protection. For Gbot we use Google reverse DNS and user agent so even a great scrapper can't bypass our security.
Then we have a paid option, people can create an account and paid a monthly subscription to access content in unlimited. This is why I think about paywall. We could replace captcha page by a paywall page (with a freetrial of course) but I'm not sur Google will index millions of page hiding behing a metered paywall
As you said, I think there is no good answer..
And again, thank a lot to having take time to answer my question -
Unless you have previously experienced heavy scraping which you cannot solve any other way, this seems a little excessive. Most websites don't have such strong anti-spam measures and they cope just fine without them
I would say that it would be better to embed the recaptcha on the page and just block users from proceeding further (or accessing the content), until the recaptcha were filled. Unfortunately this would be a bad solution as scrapers would still be able to scrape the page, so I guess redirecting to the captcha is your only option. Remember that if you are letting Googlebot through (probably with a user agent toggle) then as long as scrape-builders program their scripts to serve the Googlebot UA, they can penetrate your recaptcha redirects and just refuse to do them. Even users can alter their browser's UA to avoid the redirects
There are a number of situations where Google don't consider redirect penetration to be cloaking. One big one is regional redirects, as Google needs to crawl a whole multilingual site instead of being redirected. I would think that in this situation Google wouldn't take too much of an issue with what you are doing, but you can never be certain (algorithms work in weird and wonderful ways)
I don't think any schema can really help you. Google will want to know that you are using technology that could annoy users so they can lower your UX score(s) accordingly, but unfortunately letting them see this will stop your site being properly crawled so I don't know what the right answer is. Surely there must be some less nuclear, obstructive technology you could integrate instead? Or just keep on top of your block lists (IP ranges, user agents) and monitor your site (don't make users suffer)
If you are already letting Googlebot through your redirects, why not just have a user-agent based allow list instead of a black list which is harder to manage? Find the UAs of most common mobile / desktop browsers (Chrome, Safari, Firefox, Edge, Opera, whatever) and allow those UAs plus Googlebot. Anyone who does penetrate for scraping, deal with them on a case-by-case basis
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Ranking all but lost after new content pushed
Okay, I thought I was following best practices. In our industry, electronic hardware, we were ranking well for a particular product line (/spacers) but we wanted to do better. We addressed several concerns that Moz found first; duplicate page titles, lack of meta-descriptions and overall lack of targeted keywords. We also took a new approach to add a better structure to our site. Instead of being presented with a list of part numbers we wanted the user to learn more about our products with content. So we added a /products page with content and a product specific page (/spacers) that is almost a definitive buyers guide. We are attempting to answer the questions that we think our customers find most relevant. Well our customers might find it relevant but Google sure didn't. After our deployment of new content our rankings for targeted keywords in Google fell from 10-15 to 80-95 As an open ended question, could somebody explain to me why our ranks fell off a cliff? Homepage: https://www.lyntron.com
Intermediate & Advanced SEO | | jandk4014
New catalog summary page: https://www.lyntron.com/products
New content with focus to rank high: https://www.lyntron.com/spacers TPdn6ym1 -
Duplicate content question
Hi there, I work for a Theater news site. We have an issue where our system creates a chunk of duplicate content in Google's eyes and we're not sure how best to solve. When an editor produces a video, it simultaneously 1) creates a page with it's own static URL (e.g. http://www.theatermania.com/video/mary-louise-parker-tommy-tune-laura-osnes-and-more_668.html); and 2) displays said video on a public index page (http://www.theatermania.com/videos/). Since the content is very similar, Google sees them as duplicate. What should we do about this? We were thinking that one solution would to be dynamically canonicalize the index page to the static page whenever a new video is posted, but would Google frown on this? Alternatively, should we simply nofollow the index page? Lastly, are there any solutions we may have missed entirely?
Intermediate & Advanced SEO | | TheaterMania0 -
Question on Moving Content
I just moved my site from a Wordpress hosted site to Squarespace. We have the same domain, however, the content is now located on a different URL (again, same base domain). I'm unable to easily set up 301 redirects for the old content to be mapped to the new content so I was wondering if anyone had any recommendations for a workaround. Basically, I want to make sure google knows that Product A's page is now located at this new URL. (www.domain.com/11245 > www.domain.com/product-a). Maybe it's something that I don't have to worry about anymore because the old content is gone? I mean, I have a global redirect set up that no matter what you enter after the base domain, it now goes to the homepage but I just want to make sure I'm not missing something here. Really appreciate your help!
Intermediate & Advanced SEO | | TheBatesMillStore1 -
All Thin Content removed and duplicate content replaced. But still no success?
Good morning, Over the last three months i have gone about replacing and removing all the duplicate content (1000+ page) from our site top4office.co.uk. Now it been just under 2 months since we made all the changes and we still are not showing any improvements in the SERPS. Can anyone tell me why we aren't making any progress or spot something we are not doing correctly? Another problem is that although we have removed 3000+ pages using the removal tool searching site:top4office.co.uk still shows 2800 pages indexed (before there was 3500). Look forward to your responses!
Intermediate & Advanced SEO | | apogeecorp0 -
Joomla duplicate content
My website report says http://www.enigmacrea.com/diseno-grafico-portafolio-publicidad and http://www.enigmacrea.com/diseno-grafico-portafolio-publicidad?limitstart=0 Has the same content so I have duplicate pages the only problem is the ?limitstart=0 How can I fix this? Thanks in advance
Intermediate & Advanced SEO | | kuavicrea0 -
Duplicate content
I have just read http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world and I would like to know which option is the best fit for my case. I have the website http://www.hotelelgreco.gr and every image in image library http://www.hotelelgreco.gr/image-library.aspx has a different url but is considered duplicate with others of the library. Please suggest me what should i do.
Intermediate & Advanced SEO | | socrateskirtsios0 -
Duplicate Content | eBay
My client is generating templates for his eBay template based on content he has on his eCommerce platform. I'm 100% sure this will cause duplicate content issues. My question is this.. and I'm not sure where eBay policy stands with this but adding the canonical tag to the template.. will this work if it's coming from a different page i.e. eBay? Update: I'm not finding any information regarding this on the eBay policy's: http://ocs.ebay.com/ws/eBayISAPI.dll?CustomerSupport&action=0&searchstring=canonical So it does look like I can have rel="canonical" tag in custom eBay templates but I'm concern this can be considered: "cheating" since rel="canonical is actually a 301 but as this says: http://googlewebmastercentral.blogspot.com/2009/12/handling-legitimate-cross-domain.html it's legitimately duplicate content. The question is now: should I add it or not? UPDATE seems eBay templates are embedded in a iframe but the snap shot on google actually shows the template. This makes me wonder how they are handling iframes now. looking at http://www.webmaster-toolkit.com/search-engine-simulator.shtml does shows the content inside the iframe. Interesting. Anyone else have feedback?
Intermediate & Advanced SEO | | joseph.chambers1 -
Cross-Domain Canonical and duplicate content
Hi Mozfans! I'm working on seo for one of my new clients and it's a job site (i call the site: Site A).
Intermediate & Advanced SEO | | MaartenvandenBos
The thing is that the client has about 3 sites with the same Jobs on it. I'm pointing a duplicate content problem, only the thing is the jobs on the other sites must stay there. So the client doesn't want to remove them. There is a other (non ranking) reason why. Can i solve the duplicate content problem with a cross-domain canonical?
The client wants to rank well with the site i'm working on (Site A). Thanks! Rand did a whiteboard friday about Cross-Domain Canonical
http://www.seomoz.org/blog/cross-domain-canonical-the-new-301-whiteboard-friday0