Yes. Asking for consent is built into the HTTP protocol. The issue at hand here is that Perplexity scrapers lie about who they are by providing a false user agent. Thus consent was given on a false pretense.
> Asking for consent is built into the HTTP protocol.
The HTTP protocol does not specify what is right and wrong. The fact a protocol encodes or permits a particular kind of behaviour does not mean that every use of the protocol is ethically justified. I am sure you would agree with me that "black people can't visit this server" would be such an unethical rule, even though HTTP permits you to enforce such a rule. So let's forget about the protocol for a minute.
Is it morally wrong to lie about your User Agent in order to visit a website. Well, that depends on whether it is legitimate for the server operator to discriminate according to the User Agent. If it is not legitimate, then lying about your User Agent to circumvent the restriction is morally justified.
So we are back at square one: is it legitimate for a server operator to discriminate what sort of a client is used to visit them. Since the service is public, the person is allowed to visit the service and to read the content. If the client is misbehaved in some way (some LLM scrapers are) then this is a legitimate difference. But if this is controlled for, so the LLM scraper can't be easily distinguished from a human doing the same thing, then the service is not harmed any more than would be ordinary. Therefore the discrimination is not legitimate.
If I were DOSing your blog, you'd ask me to stop. I run server ops for multiple online communities that are being severely negatively impacted and DOSed by these AI scrapers, and we have very few ways to stop them.
That is a problem, but is not related to my comment. The person I'm replying to is acting as if consent is a relevant aspect of the public web, I am saying it isn't. That is not the same as saying "you can do whatever you want to a public server". It is just that what you are allowed to do is not related to the arbitrary whim of the server operator.
Consent is also expressed through technical conventions. I, the website owner, express my intention through - for example - robots.txt. if you write a bot that specifically ignores it, you are violating consent.
Likewise, I may prevent certain user-agents to visit my site. If you - say, an AI megacorp - are intentionally spoofing the user-agent to appear as a user, you are also violating consent.
I don't know how to make this any clearer. You - website owner - your consent does not matter. You are publishing information on the internet. I do not think you have a right to decide who is allowed to read it or not, or how they use what they read. You have exactly two legal rights: the right not to be DoS'd/hacked, and the right not to have your copyright infringed. Neither of those rights have anything to do with what you "consent" to. It is a bad thing to connect consent - the arbitrary, capricious whim of a website operator - to access of a public resource. Consent is for people, for your body, for your relationships. It is not a magic spell to give you arbitrary control over people.
You can repeat it, but we fundamentally disagree, it's not a matter of understanding.
Fundamentally it's not true that the moment I publish something on the internet, I lose control of who can consume my intellectual property. Licensing, for example, is a way we regulate the way that code or prose can be consumed even if public.
Also expressing my consent is not in any way a way to control others, is a way to control my ideas, my writing, my [whatever] and people are not automatically entitled to it because it's published on the internet.
So overall I understand your position, but I so much disagree with it.
Ok, that wasn't clear before since you just kept saying how you expressed your consent rather than why your consent should be taken into account.
Licensing is much much more limited than you seem to be thinking of it. For
instance, you said explicitly you want a way to control your ideas. The only
thing this can mean is a way to control who gets to use your ideas, or what
they get to use them for. So if I express a political idea in a novel way or
tell a funny joke or something I should be able to dictate who gets to repeat
it, or in this case with LLMs who gets to summarise and describe it.
This kind of control is antithetical to the spirit of the internet and would be
frankly evil if people were actually able to assert it. Luckily in most cases
it's impossible, nobody can actually stop me from describing a movie to my
friends or from reposting a meme. Just copying and reposting what you wrote
verbatim is something we can probably agree is wrong, but that isn't what's up
for questioning here. The idea I was actually replying to in the first place
was that you can decide somebody can't read your ideas - even if they're public
- just because you don't like them or you don't like what they will do with
them. It is hard to think of a more egregious kind of 1984-style censorship,
really.
There is a place for regulation of LLM companies, they are doing a lot of harm
that I wish governments would effectively rein in. It would not be hard if the
political will existed. But this idea of saying I should be able to "control my
ideas" is way, way worse.
LLMs are not "someone", LLMs are something, and they don't "read content", they by definition acquire and reuse that content (for example, by summarizing it), as part of their product.
So here the consent is indeed about what can be done with the data.
In general, it's absolutely the norm that public websites (I.e., unauthenticated) restrict even who can access the data. The simplest example that comes to mind is geoblocking. I have all the rights to say that my website is not made available to anybody in the US, for example. Would you still call that website "public"? Would bypassing the block via a VPN be a violation of my consent? This is mostly a moral discussion I suppose.
But anyway, it's not what's happening here. LLMs access content for the sole purpose of doing something with that content, either training or providing the service to their customers. They are not humans, they are not consumers, they don't simply fetch the content and present it to the users (a much more neutral action, like curl or the browser does). It's impossible to distinguish, in the case of LLMs the act of accessing and the act of using, so the difference you make doesn't apply in my opinion.
LLMs are indeed not "someone". They are programs, like web browsers, acting on user instruction. The user is a person. I am only talking about people - I never said that an LLM does anything of its own volition.
> The simplest example that comes to mind is geoblocking.
Do you think it is alright to geoblock people, for arbitrary reasons? It is one thing when GDPR imposes a legal obligation on you for serving content in a particular way. Note that that actually doesn't prevent you from seeing the content, it just prevents you from being served by that server. The distinction is important - circumventing a geoblock is something I think should be legally protected.
> They are not humans, they are not consumers, they don't simply fetch the content and present it to the users
They simply fetch the content, run it through a software, and present it to the user. As far as you, the service owner, are concerned, they are simply fetching the content for the user. It is none of your business what the user and the AI company go on to do with "your content".
No, they are not like browsers. The browser access my content in a transparent way. An LLM reuses the information and acts as an opaque intermediary which - maybe - will at most add a reference to my content.
> I never said that an LLM does anything of its own volition
It doesn't matter why it does what it does, it matters what it does. Your previous comment stressed the idea that it's possible to regulate _what can be done_ with my intellectual property (licensing), but not who can access it, once made it public. What I am saying is that this is exactly the case for LLMs, who _use_ my intellectual property, they are not a tool to _access_ it (like a browser).
> Do you think it is alright to geoblock people, for arbitrary reasons?
Yes. Why wouldn't it be? And if you believe it's not, where do you draw the line? Once you share a picture with your partner, everyone has the right to see it? Or if you share it with your group of friends? Or if you share it on a private social media profile (where you have acquaintances)? When does the audience turn from "a restricted group" to "everyone"? Or why would it be different with my blog? If I want my blog accessible only from my country, I can absolutely do that and there is nothing wrong with it at all. Nobody is entitled to my intellectual property. Obviously I am playing devil's advocate, but this was to say that the fact that something is public, doesn't mean it's unrestricted. And don't get me started on "the spirit of the internet". I can't imagine something breaking that spirit more than LLMs acting as interface between people and the other people on the internet. That spirit is gone, and belongs to a time when the internet was tiny. When OpenAI and company will respect the "spirit of the internet", maybe I will think about doing the same.
> As far as you, the service owner, are concerned, they are simply fetching the content for the user. It is none of your business what the user and the AI company go on to do with "your content".
No, as far as I am concerned the program can take my information, summarize, change, distort, misinterpret it and then present it back to its user. This can happen with or without the user ever knowing that the information can from me. Considering this equal to the user accessing the information is something I simply will not concede and is a fundamental disagreement between us, from which many other disagreements stems.
You realize that consent in this case is just what we refer to as "authorization", right? And it is absolutely within any website operator's rights to only authorize their site for certain users and for certain purposes.
Websites are not "public resources"; site operators just mostly choose to allow the general public to access them. There's no legal requirement that they do so.
If you want anti-discrimination laws that apply to businesses to also cover bots, that is well outside of current law. A site operator can absolutely morally and legally decide they do not allow non-human visitors, just like a store can prohibit pets.
> And, yes, I would, because I'd be breaking the law otherwise.
No you wouldn't be. Even if someone tells you not to visit your site, you have every legal right to continue visiting it, at least in the US.
Under common interpretation of the CFAA, there needs to be a formal mechanism of authorized access. E.g. you could be charged if you hacked into a password-protected area of someone's site. But if you're merely told "hey bro don't visit my site", that's not going to reach the required legal threshold.
Which is why crawlers aren't breaking the law. If you want to restrict authorization, you need to actually implement that as a mechanism by creating logins, restricting content to logged-in users, and not giving logins to crawlers.
Repeat after me - intentional discrimination of computer programs over humans is a good and praise worthy thing. We can and should make execution of computer programs harder and harder, even disproportionately so, if that makes lives of humans better and easier.
"if that makes lives of humans better" is doing a lot of heavy lifting, and remains to be explained.
Computer programs don't take actions, people do. If I use a web browser, or scrape some site to make an LLM, that's me doing it, not the program. And I have human rights.
If you think training LLMs should be illegal, just say that. If you think LLM companies are putting an undue strain on computer networks and they should be forced to pay for it, say that. But don't act like it's a virtue to try and capriciously gatekeep access to a public resource.
I was being unclear, but that's on me since it was intentional. But to clarify my stance - I'm against accidental or intentional equalization of programs with individual humans and using that as a foundation for all kinds of negative (imo) corporate behavior later.
For example - humans can learn, programs can't. The "learning" cop out for LLM-corpos shouldn't be accepted by anyone, let alone by law. Humans have a fair use carve out of the copyright laws, not because it's something axiomatic, it's because some humans with empathy have forced others to allow all humans a leeway in legally using other's IP works. Just because such law exist for humans, doesn't mean that random computer programs should be applicable to it. Scraping web for LLMs should not be considered "fair use" because a) it is clearly not (commercialized later) and b) programs aren't humans and don't have equal rights.
And the list goes on. Now, I do get that train has long left the station and we are all collectively living in the anecdote about stealing a bicycle and asking god for forgiveness. But that doesn't mean I agree with this state. I'm just shouting my displeasure towards that passing train cause I'm weird like that. It's like with climate change - we are doing nothing that matters, no one discusses what really matters and I just accepted that nothing will really change. Doesn't mean I like the situation.
PS: tl;dr - LLMs clearly should be legal, it's just simple code is all. LLM corporations who steal IP content without compensation to the authors should be illegal, but of course they won't ever be.
PPS: there is a huge, gigantic gap between a single person scraping a few thousand pages for a personal use, maybe even some small local commercial use (though that's a grey area already) and a billion dollar megacorp, intent on destroying everything of value for humans in the internet for profit.
god help us if they ever manage to build anything more than shitty chatbots