Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.
I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?
I don't want AI companies to scrape my sites (or use the files I wrote) for training data either, but that is not specifically what I am trying to stop (unless the files are supposed to be private and unpublished). I should not stop them from using the files for what they want, once they have them. (I also specifically do not want to block use of lynx, curl, Dillo, etc.)
What I want to stop is excessive crawling and scraping of my server. Once they have the file they can do what they want with it. Another comment (44786237) mentions that robots.txt is only for restricting recursive access; I agree and that is what should be blocked. They also should not access the same file several times quickly even though it should be unnecessary to do so, just as much as they should not access all of the files. (If someone wants to make a mirror of the files, there may be other ways, e.g. in case there is a archive file available to download many at once (possibly, in case if the site operator made their own index and then did it this way). If it is a git repository, then it can be cloned.)
Of course some people want that. And at the moment they can prevent it. But those methods may stop working. Will it then be alright to do it? Of course not, so why bother mentioning that they are able to prevent it now - just give a justification.
Your license is probably not relevant. I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement. Even if I told it to the whole world, it wouldn't be copyright infringement. Probably the movie seller would prefer it if I didn't tell anyone. Why should I care?
I actually agree that AI companies are generally bad and should be stopped - because they use an exorbitant amount of bandwidth and harm the services for other users. At least they should be heavily taxed. I don't even begrudge people for using Anubis, at least in some cases. But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue. We have laws against copyright infringement, and to prevent service disruption. We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index. That would be unethical. Call for a windfall tax if they piss you off so much.
> I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement.
This is a false analogy. A correct one would be going to a 1000 movies and creating the 1001th movie with scenes cropped from these 1000 movies and assemble it as a new movie, and this is copyright infringement. I don't think any of the studios would applaud and support you for your creativity.
> But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue.
Why does it have to be always about money? Personally it's not. I just don't want my work to be abused and sold to people to benefit a third party without my consent and will (and all my work is licensed appropriately for that).
> We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index.
This goes both ways. If big corporations can scrape my material without asking me and resell it as an output of a model, I can equally distill their models further and sell it as my own. If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
But that will be copyright infringement, just because they have more money. What angers me is "all is fair game because you're a small fish, and this is a capitalist marketplace" mentality.
If companies can paywall their content to humans that don't pay, I can paywall AI companies and demand money or push them out of my lawn, just because I feel like that. The inverse is very unethical, but very capitalist, yes.
It's not always about money.
P.S.: Oh, try to claim that you can train a model with medical data without any clearance because it'd be unethical to have laws limiting this. It'll be fun. Believe me.
I think you are describing something much more like stable diffusion. This article is about Perplexity, which is much closer to "watch a movie and tell me the plot" than it is like "take these 1000 movies and make a collage". The copyright points are different - stable diffusion are on much shakier ground than perplexity.
> Why does it have to be always about money?
Before I mentioned money I said "because it hurts my feelings". I'm sorry I can't give a more charitable interpretation, but I really do see this kind of objection as "I don't want you to have access to this web page because I don't like LLMs". This is not a principled objection, it is just "I don't like you, go away". I don't think this is a good principle to build the web on.
Obviously you can make your website private, if you want, and that would be a shame. But you can't have this kind of pick-and-choose "public when you feel like" option. By the way I did not mention, but I am ok with people using Anubis and the like as a compromise while the situation remains unjust. But the justification is very important.
> If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
This is probably not a gambit you want to make. You literally can do this, and they would probably like it if you did. You don't want to do that, because the output of LLMs is usually not that good.
In fact, LLM companies should probably be taxed, and the taxes used to fund real human AI-free creations. This will probably not happen, but I am used to disappointment.
> P.S.: Oh, try to claim that you can train a model with medical data
> Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data.
That is unfortunately not a distinction that is currently legally enforceable. Until that changes all other "solutions" are pointless and only cause more harm.
> People who think like that made tools like Anubis, and it works.
It works to get real humans like myself to stop visiting your site while scrapers will have people whose entire job is to work around such "protections". Just like traditional DRM inconveniences honest customers and not pirates. And to be clear, what you are advocating for is DRM.
> I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
> It works to get real humans like myself to stop visiting your site
If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.
> And to be clear, what you are advocating for is DRM.
Yes. It's pretty ironic that someone like me who believes in open access prefers a DRM solution to keep companies abusing the small fish, but life is an interesting phenomenon, and these things happen.
> Until that changes all other "solutions" are pointless and only cause more harm.
As an addendum to above paragraph, I'm not happy that I have to insert draconian measures between the user and the information I want to share, but I need a way to signal that I'm not having their ways to these faceless things. What do you propose? Taking my sites offline? Burning myself in front of one of the HQs?
> If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
AI crawlers default to "Public Domain" when they find no licenses. Some of my lamest source code repositories made into "The Stack" because I forgot to add COPYING.md. A fork of a GPLv2 tool I wrote some patches also got into "The Stack", because COPYING.md was not in the root folder of the repository. I'd rather add licenses (which I can accept) to things rather than leave them as-is, because AI companies also eagerly grab things without license.
All licenses I use mandate attribution and continuation of license, at least, and my blog doesn't allow any derivations of from what I have written. So you can't ingest it into a model to be derived and remixed with something else.
> If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.
It's not invisible, the sites using it don't work perfectly well for all users and it doesn't stop AI crawlers.
I guess that's a question that might be answered by the NYT vs OpenAI lawsuit at least on the enforceability of copyright claims if you're a corporation like NYT.
If you don't have the funds to sue an AI corp, I'd probably think of a plan B. Maybe poison the data for unauthenticated users. Or embrace the inevitability. Or see the bright side of getting embedded in models as if you're leaving your mark.
I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?
Makes no sense whatsoever.