Your comment and the above comment of course show different cases.
An agent making a request on the explicit behalf of someone else is probably something most of us agree is reasonable. "What are the current stories on Hacker News?" -- the agent is just doing the same request to the same website that I would have done anyways.
But the sort of non-explicit just-in-case crawling that Perplexity might do for a general question where it crawls 4-6 sources isn't as easy to defend. "Are polar bears always white?" -- Now it's making requests I wouldn't have necessarily made, and it could even been seen as a sort of amplification attack.
That said, TFA's example is where they register secretexample.com and then ask Perplexity "what is secretexample.com about?" and Perplexity sends a request to answer the question, so that's an example of the first case, not the second.
As a person who has a couple of sites out there, and witnesses AI crawlers coming and fetching pages from these sites, I have a question:
What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?
> What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?
What prevents anyone else? robots.txt is a request, not an access policy.
This honor system mostly worked at scale because interests align, which seems to be no longer the case.
Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.
Can the Terms of Service of individual content creators leverage a "death of a thousand cuts" model to produce a legal honeypot which would require organizations like Perplexity to be bound up in 10s of thousands of conciliation court cases?
Big Tech has hidden behind ToS for years. Now, it seems as though it only works for them, but not against. It seems as though this would be easy to orchestrate and prove forcing these companies into a legal nightmare or risk insolvent business stature due to the high load of cases filed against.
Why couldn't something like this be used to flip the table? A conciliation brigading, of sorts.
Because lawyers are expensive and big tech companies have lots of them. Because it takes a ton of time and effort to sue someone. Because you need to show standing, which means you need to be able to demonstrate you lost something of value by their actions. Because the power imbalance is heavily weighted towards a corporation. Because the way to deal with such things should be legislation and not court decisions. And lots more reasons...
That's exactly why I said conciliation court. None of what you've outlined is required nor is it expensive. But, for each case, the defendant is still required to show up.
I've successfully used conciliation court against large corporations in the past which is why I question it here.
And while this should be able to be handled via legislation it won't be. Beyond that a workaround could force that to happen.
Sorry, I had never heard that term before. You would still have to show standing though. How would you try to prove that their violating your TOS cost you money?
Is it not viable to produce a work of art and say that this is free for humans, but not for bots and cannot be used for training and said violation cost X?
Again, I can't copy and distribute a game Microsoft rents to me. But if I do I can be found held accountable for a ridiculous amount of money. If it's my work of art the terms can dictate who doesn't need to pay and who does. If an LLM is consuming my work of art and now distributing it within their user base how is that not the same?
These are arguments you would tell the judge. And the judge would almost certainly tell you 'this is the wrong venue for that. You are in small claims. I need an itemized list of monetary damages you have suffered before I can make a judgement.'
I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them. The blog platform I use (Mataroa) keeps rudimentary analytics (essentially page hit counters, nothing more) on index, RSS and per post.
Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.
Looks like my homepage which doesn't have anything but links is pretty popular with crawlers. My digital garden doesn't get much interest from them. All in all, human traffic on my sites are pretty much alive.
I don't believe in missing the bus in anything actually, because I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open. I post links to both when it's appropriate, but they are not made to be popular. If people read it and learn something or solve one of their problems, that's enough for me.
This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.
Lastly, this is why I don't want AI crawlers in my site and my data in the models. This thing is made by a human for humans, absolutely for free. It's not OK somebody to sell something designed to be free and make money over it.
> I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them.
Absolutely, I'm in agreement here. I want to run a JS-free blog, just plain old static HTML. I plan to use GoAccess to parse the access logs but that's it. I think I would find it encouraging to see real human traffic.
> I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open.
> I want to run a JS-free blog, just plain old static HTML.
If you want to start fast until you find a template you want to work with, I can recommend Mataroa [0]. The blog have almost no JS (it binds a couple of keys for navigation, that's it), and it's $10/year. When you feel right in your self-hosted solution, you can move there. It's all Markdown at the end of the day.
> I plan to use GoAccess to parse the access logs but that's it.
That's the only thing I use, too. Nothing else.
If you want to look at what I do, how I do, and reach out to me, the rabbit hole starts from my profile, here.
Wish you all the best, and you may find bliss and joy you never dreamed of!
if you do analytics, it is not so hard, but then you need to store user data (if not directly, then worse, with a third party), which should be viewed as a liability. I see ~2/3 human traffic, ~1/3 bot traffic (I just parse user agent strings and count whitelisted browsers as human), but my main landing page is all dynamic-populated webgl. I just asked Gemini what it sees on website, and it states "The page appears to be loading, with the text "Loading room data...".[1] There are also labels for "BG", "FG", and "CURSOR", and a background weather animation." -so I can be feel reasonably confident I don't need to worry about AI, for now; it needs a machine-friendly frontend.
you could go proper insanomode, too. remaking The Internet is trivial if you don't care about existing web standards -- replacing HTTP with your own TCP implementation, getting off html/js/css, etc. being greenfield, you can control the protocol, server, and client implementation, and put it in whatever language you want. I made a stateful Internet implementation in Python earlier for proof-of-concept, but I want to port it and expand on it in rust soon (just for fun; I don't do serious biznos). you'll very likely have 100% human traffic then, even if you're the only person curious and trusting enough to run your client.
it's not in a shareable state; is unsafe as-is. can share general idea and sample "webpage" files, though.
the server ("lodge") passes JSON to the client from what are called .branch files. the client receives JSON, parses it, then builds the UI and state representation from the JSON, then stored in that client's memory (self.current_doc and self.page_state in python client).
branches can invoke waterwheel (.ww) files hosted on the lodge. waterwheel files on the lodge contain scripts which define how patches (as JSON) are to be sent to the client. the client updates its state based on the JSON patch it receives. sample .branch and .ww from python implementation (in pastebin so to not make everyone have to scroll through this): https://pastebin.com/A0DEZDmR
It's your server. You're free to do whatever you want. You can serve different versions of the page depending on the UserAgent (has been done many times before).
You can put up a paywall depending on UserAgent or OS (has been done).
In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.
The way to prevent people from downloading your pages and using them is to take them off the public internet. There are laws to prevent people from violating your copyright or from preventing access to your service (by excessive traffic). But there is (thankfully) no magical right that stops people from reading your content and describing it.
Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.
I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?
I don't want AI companies to scrape my sites (or use the files I wrote) for training data either, but that is not specifically what I am trying to stop (unless the files are supposed to be private and unpublished). I should not stop them from using the files for what they want, once they have them. (I also specifically do not want to block use of lynx, curl, Dillo, etc.)
What I want to stop is excessive crawling and scraping of my server. Once they have the file they can do what they want with it. Another comment (44786237) mentions that robots.txt is only for restricting recursive access; I agree and that is what should be blocked. They also should not access the same file several times quickly even though it should be unnecessary to do so, just as much as they should not access all of the files. (If someone wants to make a mirror of the files, there may be other ways, e.g. in case there is a archive file available to download many at once (possibly, in case if the site operator made their own index and then did it this way). If it is a git repository, then it can be cloned.)
Of course some people want that. And at the moment they can prevent it. But those methods may stop working. Will it then be alright to do it? Of course not, so why bother mentioning that they are able to prevent it now - just give a justification.
Your license is probably not relevant. I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement. Even if I told it to the whole world, it wouldn't be copyright infringement. Probably the movie seller would prefer it if I didn't tell anyone. Why should I care?
I actually agree that AI companies are generally bad and should be stopped - because they use an exorbitant amount of bandwidth and harm the services for other users. At least they should be heavily taxed. I don't even begrudge people for using Anubis, at least in some cases. But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue. We have laws against copyright infringement, and to prevent service disruption. We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index. That would be unethical. Call for a windfall tax if they piss you off so much.
> I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement.
This is a false analogy. A correct one would be going to a 1000 movies and creating the 1001th movie with scenes cropped from these 1000 movies and assemble it as a new movie, and this is copyright infringement. I don't think any of the studios would applaud and support you for your creativity.
> But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue.
Why does it have to be always about money? Personally it's not. I just don't want my work to be abused and sold to people to benefit a third party without my consent and will (and all my work is licensed appropriately for that).
> We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index.
This goes both ways. If big corporations can scrape my material without asking me and resell it as an output of a model, I can equally distill their models further and sell it as my own. If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
But that will be copyright infringement, just because they have more money. What angers me is "all is fair game because you're a small fish, and this is a capitalist marketplace" mentality.
If companies can paywall their content to humans that don't pay, I can paywall AI companies and demand money or push them out of my lawn, just because I feel like that. The inverse is very unethical, but very capitalist, yes.
It's not always about money.
P.S.: Oh, try to claim that you can train a model with medical data without any clearance because it'd be unethical to have laws limiting this. It'll be fun. Believe me.
I think you are describing something much more like stable diffusion. This article is about Perplexity, which is much closer to "watch a movie and tell me the plot" than it is like "take these 1000 movies and make a collage". The copyright points are different - stable diffusion are on much shakier ground than perplexity.
> Why does it have to be always about money?
Before I mentioned money I said "because it hurts my feelings". I'm sorry I can't give a more charitable interpretation, but I really do see this kind of objection as "I don't want you to have access to this web page because I don't like LLMs". This is not a principled objection, it is just "I don't like you, go away". I don't think this is a good principle to build the web on.
Obviously you can make your website private, if you want, and that would be a shame. But you can't have this kind of pick-and-choose "public when you feel like" option. By the way I did not mention, but I am ok with people using Anubis and the like as a compromise while the situation remains unjust. But the justification is very important.
> If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
This is probably not a gambit you want to make. You literally can do this, and they would probably like it if you did. You don't want to do that, because the output of LLMs is usually not that good.
In fact, LLM companies should probably be taxed, and the taxes used to fund real human AI-free creations. This will probably not happen, but I am used to disappointment.
> P.S.: Oh, try to claim that you can train a model with medical data
> Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data.
That is unfortunately not a distinction that is currently legally enforceable. Until that changes all other "solutions" are pointless and only cause more harm.
> People who think like that made tools like Anubis, and it works.
It works to get real humans like myself to stop visiting your site while scrapers will have people whose entire job is to work around such "protections". Just like traditional DRM inconveniences honest customers and not pirates. And to be clear, what you are advocating for is DRM.
> I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
> It works to get real humans like myself to stop visiting your site
If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.
> And to be clear, what you are advocating for is DRM.
Yes. It's pretty ironic that someone like me who believes in open access prefers a DRM solution to keep companies abusing the small fish, but life is an interesting phenomenon, and these things happen.
> Until that changes all other "solutions" are pointless and only cause more harm.
As an addendum to above paragraph, I'm not happy that I have to insert draconian measures between the user and the information I want to share, but I need a way to signal that I'm not having their ways to these faceless things. What do you propose? Taking my sites offline? Burning myself in front of one of the HQs?
> If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
AI crawlers default to "Public Domain" when they find no licenses. Some of my lamest source code repositories made into "The Stack" because I forgot to add COPYING.md. A fork of a GPLv2 tool I wrote some patches also got into "The Stack", because COPYING.md was not in the root folder of the repository. I'd rather add licenses (which I can accept) to things rather than leave them as-is, because AI companies also eagerly grab things without license.
All licenses I use mandate attribution and continuation of license, at least, and my blog doesn't allow any derivations of from what I have written. So you can't ingest it into a model to be derived and remixed with something else.
> If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.
It's not invisible, the sites using it don't work perfectly well for all users and it doesn't stop AI crawlers.
I guess that's a question that might be answered by the NYT vs OpenAI lawsuit at least on the enforceability of copyright claims if you're a corporation like NYT.
If you don't have the funds to sue an AI corp, I'd probably think of a plan B. Maybe poison the data for unauthenticated users. Or embrace the inevitability. Or see the bright side of getting embedded in models as if you're leaving your mark.
the fact that it would be discovered almost immediately.
If you give them a URL that does not appear in Google, ask them to visit that URL specifically, and then notice the content from that URL in the training data, it's proof that they're doing this, which would be quite damaging to them.
> […] it's proof that they're doing this, which would be quite damaging to them.
Is it? It's damning, but is it damaging at all?
I'm not getting the impression that anyone's data being available for training if some bot can get to it is just how things are now, rather than an unsettled point of contention. There's too much money invested in this thing for any other outcome, and with the present decline of the rule of law…
Hacker news wants you to vist the site, look at the main page, enter threads and participate in discussion.
When you swap in an AI and ask what are the current stories. The AI fetches the front page and every thread and feeds it back to you. You are less likely to participate in discussion because you've already had the info summarized.
If most people quit spending money on Amazon then Amazon stops being worth running.
If most people stop discussing things on HN, and the discussion is indeed one of the major reasons it’s kept running, then HN stops being worth running.
Indeed. But that is a false equivalence - this is conflict of desires between small companies and creators and an AI-corp where the AI-corp wants to steal their content and give it to users with their shop branding.
Foo news wants you to visit the site, look at the main page, watch the ads, click on them and buy the products advertised by third parties which will give money to Foo news in exchange for this service.
And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.
They claim that since they are free to not buy an advertised product, why would they be forced to see ads for it. But Foo news claims that they are also free to not waste bandwidth to serve their free website to people who declare (by using an ad blocker or the modern alternative: AI aummarizera) they won't participate in the funding of the service
It's not ads. We have ads in paper magazines and newspapers and no one went around with scissors to remove them. It's obnoxious ads, designed to violently grabs your attention and trackers (malware). It's like a newspapers giving your address to a whole crew of salemens that intrudes on your property at 3am and looking at you sleeping and installing cameras in your bathroom. All so that they can jump at you in the street to loudly claim they have the underwear you told your partner you like. If you're going to be that invasive about my person, then I'm going to be that forceful about restrictions.
This is one of the dumbest things about ad networks. Google has enough data about your watching habits on Youtube and their algorithm is basically as good as it gets in terms of showing you what you want to watch and getting you hooked on it, but the moment they show you ads, all that technical expertise appears to have vanished into thin air and all they show you is fake mobile ads?
People hate obnoxious ads because the money that pays for them is essentially a bribe to artificially elevate content above its deserved ranking. It feels like you're being manipulated into an unfavorable trade.
> their algorithm is basically as good as it gets in terms of showing you what you want to watch and getting you hooked on it
It is? Are we talking about the same YouTube? I get absolutely useless recommendations, I get un-hooked within a couple videos, and I even keep getting recommendations for the same videos I've literally watched yesterday. Who in the world gets hooked by this??
> And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.
I think this is a pretty different scenario. Here the user and the news website are talking directly to each other, but then the user is making a choice around what to do with the content the news website send to them. With AI agents, there is a company inserting themselves between the user and the news website and acting as a middleman.
It seems reasonable to me that the news website might say they only want to deal with users and not middlemen.
Yes. Because they want to own your attention and that only works if they are interfacing directly to you.
I remember that Samsung was at one time offering to play non-skippable full-screen apps on their newest 8K OLED TVs and their argument was precisely that these ads will reach those rich people who normally pay extra to avoid getting spammed with ads. Or going with your executive assistant example, there are situations where it makes sense to bribe them to get access to you and/or your data. E.g. "evil maid attack".
If people were forced to pay for websites by the http request people would demand that websites stop loading a ton of externally hosted JS, stop filling sites with ads, and would demand that websites actually have content worth the price.
There are so many links I click on these days that are such trash I'd be demanding refunds constantly.
>There are so many links I click on these days that are such trash
That is why AI "summarization" becomes a necessary intermediate layer. You'd not see nor trash nor ads, and thus the payment instead of being exposed to the ads. AI saves the Internet :)
It's not a development problem, it's an adoption problem. Publishers are desperate to sell us on a $20+/month subscription, they don't want to offer convenient affordable access to single articles.
$20/month would be nice if it wasn't a tier with less ads. I want no ads, and full-text rss feeds (because I want to use my clients to read). It's like how Netflix refuses to build a basic search and filter, or Spotify refuses to an actual library manager. They don't want you in control of your consumption.
An agent making a request on the explicit behalf of someone else is probably something most of us agree is reasonable. "What are the current stories on Hacker News?" -- the agent is just doing the same request to the same website that I would have done anyways.
But the sort of non-explicit just-in-case crawling that Perplexity might do for a general question where it crawls 4-6 sources isn't as easy to defend. "Are polar bears always white?" -- Now it's making requests I wouldn't have necessarily made, and it could even been seen as a sort of amplification attack.
That said, TFA's example is where they register secretexample.com and then ask Perplexity "what is secretexample.com about?" and Perplexity sends a request to answer the question, so that's an example of the first case, not the second.