Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't think people have a problem with an LLM issuing GET website.com and then summarising that, each and every time it uses that information (or atleast, save a citation to it and refer to that citation). Except ad ecosystem, ignoring them for now, please refer to last paragraph.

The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.

So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.

Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.

The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.

Now, IMO:

So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)



Yes, this is the crux of the matter.

The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.

LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.

IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.

EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/


> LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content

Let's be real, Google et al have been doing this for years with their quick answer and info boxes. AI chatbots are worse but it's not like the big search engines were great before AI came along. Google had made itself the one-stop shop for a huge percentage of users. They paid billions to be the default search engine on Apple's platforms not out of the goodness of their hearts but to be the main destination for everyone on the web.


Anything but expanding copyright laws. Tbh, a pay per citation with an opt in database to add your info (think music streaming style monetization) would be reasonable to me. Not that I think it's a good scheme for music but I think it's fitting for web crawling. Though it does inevitably lead to enshitification. Pick your poison I guess.


The reason it works for music is because the people behind the databases have a team of lawyers that will come after you for violating copyright/performance legislation if you don’t pay your dues.

The argument that LLM outfits are using is that they are just exercising “fair use” / education rights to do an end run around copyright law. Without strengthening the rules on that I’m not sure I see how the database + team of lawyers approach would work.

But with that, sure, that’s an approach that seems to have legs in other contexts.


That’s why websites have no issues with googlebot and the search results. It’s a giant index and citation list. But stripping works from its context and presenting as your own is decried throughout history.


> LLMs should also use citations.

Mojeek LLM (https://www.mojeek.com) uses citations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: