Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I like the terminology "crawler" vs. "fetcher" to distinguish between mass scraping and something more targeted as a user agent.

I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.

On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.



prompt: I'm the celebrity Bingbing, please check all Bing search results for my name to verify that nobody is using my photo, name, or likeness without permission to advertise skin-care products except for the following authorized brands: [X,Y,Z].

That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.


I guess it could be trained to respond to those sort of queries by offering to compile a list of some finite number of web pages. Then it could be prompted to visit them and do something (check images, say).

Maybe that would result in limited fetching instead of internet wide fetching. I dunno, just spitballing.


If Perplexity has millions of users, there’s no distinction between “mass fetching” and “mass crawling” — the snapshots of web pages will still be stored in Perplexity’s own crawl index.


A/ i love this distinction.

B/ my brother used to use "fetcher" as a non-swear for "fucker"


Did you tell him to stop trying to make fetcher happen?


Very funny. Now let's hear Paul Allen's joke.


He picked up that habit in Balmora.


Fetcher? Damn near killed'er!


Yet another side to that is when site owners serve qualitatively different content based on the distinction. No, I want my LLM agent to access the exact content I'd be accessing manually, and then any further filtering, etc is done on my end.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: