There are also a gazillion pages that are not ad-riddled content. With search engines, the implicit contract was that they could crawl pages because they would drive traffic to the websites that are crawled.
AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.
Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).
AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.
Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).