I like the terminology "crawler" vs. "fetcher" to distinguish between mass scraping and something more targeted as a user agent.
I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.
On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.
prompt: I'm the celebrity Bingbing, please check all Bing search results for my name to verify that nobody is using my photo, name, or likeness without permission to advertise skin-care products except for the following authorized brands: [X,Y,Z].
That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.
I guess it could be trained to respond to those sort of queries by offering to compile a list of some finite number of web pages. Then it could be prompted to visit them and do something (check images, say).
Maybe that would result in limited fetching instead of internet wide fetching. I dunno, just spitballing.
If Perplexity has millions of users, there’s no distinction between “mass fetching” and “mass crawling” — the snapshots of web pages will still be stored in Perplexity’s own crawl index.
Yet another side to that is when site owners serve qualitatively different content based on the distinction. No, I want my LLM agent to access the exact content I'd be accessing manually, and then any further filtering, etc is done on my end.
I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.
On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.