Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, this is the crux of the matter.

The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.

LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.

IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.

EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/



> LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content

Let's be real, Google et al have been doing this for years with their quick answer and info boxes. AI chatbots are worse but it's not like the big search engines were great before AI came along. Google had made itself the one-stop shop for a huge percentage of users. They paid billions to be the default search engine on Apple's platforms not out of the goodness of their hearts but to be the main destination for everyone on the web.


Anything but expanding copyright laws. Tbh, a pay per citation with an opt in database to add your info (think music streaming style monetization) would be reasonable to me. Not that I think it's a good scheme for music but I think it's fitting for web crawling. Though it does inevitably lead to enshitification. Pick your poison I guess.


The reason it works for music is because the people behind the databases have a team of lawyers that will come after you for violating copyright/performance legislation if you don’t pay your dues.

The argument that LLM outfits are using is that they are just exercising “fair use” / education rights to do an end run around copyright law. Without strengthening the rules on that I’m not sure I see how the database + team of lawyers approach would work.

But with that, sure, that’s an approach that seems to have legs in other contexts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: