"Stealth" crawlers are always going to win the game.
There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).
I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.
Yes, because there's always the option for a camera pointed at the screen and a robot arm moving the mouse. AI is hoping to solve much harder problems.
What's stopping these companies from offloading the scraping onto their users?
"Either pay us $50/month or install our extension, and when prompted, solve any captchas or authenticate with your ID (as applicable) on the given website so we can train on the content.
I heard that it could be easily bypass through realistic 3D human game model with basic mouth open and head tilt animation, even gmod can do such thing.
Almost no site of value will use remote attestation because an alternative that works will all of your devices, operating systems, ad blockers and extensions will attract more users than your locked-down site.
> alternative that works will all of your devices, operating systems, ad blockers and extensions
When 99.9% of users are using the same few types of locked down devices, operating systems, and browsers that all support remote attestation, the 0.1% doesn't matter. This is already the case on mobile devices, it's only a matter of time until computers become just as locked down.
But for the case of Perplexity-User, presumably the user is in the loop to provide their attestation.
This case (“go research this subject for me”) is the grey area here. It’s not the same as simple scraping or search indexing, it’s a new activity that is similar in some ways.
There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).
I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.
[0] https://www.w3.org/TR/webdriver2/
[1] https://chromedevtools.github.io/devtools-protocol/