This is why Perplexity is my preferred deep search engine. The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source. If a site doesn't want particular users to access their content, put it behind a login. The only way I - and eventually many others - will see it in the first place anyway is when it pops up as a cited source in the LLM output, and there's an actual need to go to said source.
Hi, website operator here. I don't want my content to be accessible to you through Perplexity.
I want my work to be freely available to any person who wants it. Feel free to transform my material as you see fit. Hell, do it with LLMs! I don't care.
The LLM isn't the problem, it's what companies like Perplexity are doing with the LLM. Do not create commercial products that regurgitate my work as if it was your own. It's de facto theft, if not de jure theft.
Knowing that it is not de jure theft, and so I have no legal recourse, I will continue to tune my servers to block and/or deceive Perplexity and similar tools.
By the way, I do not use my websites as a revenue stream. This isn't about money.
Another very important reason why I prefer Perplexity is due to attribution. It actually does cite the sources that it bases it's output on (unless it's something generic or calculated), so if I suspect something is off or want to look deeper into some particular aspect I can easily click through. And I've done enough click-throughs to be confident that Perplexity faithfully represents sourced content, and accurately gets exactly the bits I'm interested in, maybe 98% or more of the time.
It is your prerogative to tune your servers as you see fit, but as LLM adoption increases you'll merely find that your site has fewer and fewer visits overall, so your content will only be utilized by you and a vanishingly small group of other persons. Perhaps you're OK with that, and that's also fine for the rest of us.
It's strange you mention theft, and then say it isn't about money. For me, and many others, it's about practicality and efficiency. We went from having to visit physical libraries to using search engines, and now we're entering the era of increasingly intelligent content fetch+preprocess tools.
> as LLM adoption increases you'll merely find that your site has fewer and fewer visits overall, so your content will only be utilized by you and a vanishingly small group of other persons.
So far, AI has had the opposite effect on my site. I've now been featured on both Hackaday and Adafruit's blog. Both features were clearly AI-generated. Both posts coincided with an influx of emails from folks interested in my work.
Perplexity is good at citing things when it decides to cite things and when you tell it to cite things. It can and does spit out plain expository text with no indication of the information's origin. I do appreciate that you have better-than-usual habits about validating sources.
I think you may have misinterpreted my remark about money. With the direction conversations around AI have been going lately, I was expecting a backhanded accusation that I was farming ad revenue.
"It's not about money" meant that I have nothing to lose financially by losing direct human traffic to my websites. Instead, I stand to lose those aforementioned email conversations.
> So far, AI has had the opposite effect on my site. I've now been featured on both Hackaday and Adafruit's blog. Both features were clearly AI-generated. Both posts coincided with an influx of emails from folks interested in my work.
This may be missing some context, but it seems as though you're saying that you made something with AI and it led to traction. That's great! Seems off the point that blocking LLM service will lead to less exposure over time though.
> Perplexity is good at citing things when it decides to cite things and when you tell it to cite things.
Maybe I'm just lucky, but a quick skim of my Perplexity history yielded only 2 instances of no citations, and they were for general coding queries. I've never had to ask it to cite anything, as that's built into the default prompt.
> lose those aforementioned email conversations.
I think those will remain a possibility as long as LLM users, or services, ensure citations are included in output.
> This may be missing some context, but it seems as though you're saying that you made something with AI and it led to traction. That's great! Seems off the point that blocking LLM service will lead to less exposure over time though.
Hah, I can see how you would have read it that way. Quite the opposite. I don't use AI tools for my writing. Hackaday and Adafruit have both featured my posts, and their posts were pretty clearly AI-generated.
That still sounds like a great deal. Less work for those post authors, and you benefitting from being cited in some way (maybe they used Perplexity or similar, and didn't even visit your site themselves).
@ryukoposting - i am the founder of hackaday, but do not run the site now, and i am also the managing director of adafruit and editor of the adafruit blog. the adafruit does not use generative text, etc. unless clearly indicated https://www.adafruit.com/editorialstandards ... appreciate a correction to your post, hard to combat misinformation with ai, but you can email and i can prove i am human if ya want... pt at adafruit dot com
> The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source.
If you are the source I think they could make plenty of sense. As an example, I run a website where I've spent a lot of time documenting the history of a somewhat niche activity. Much of this information isn't available online anywhere else.
As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work. Even more so when it actually costs me money to serve requests to the company!
> I think it's a reasonable stance to not want other companies to profit from my hard work
For me, the dividing line is whether someone else's profit is at my expense. If I sell a book, and someone starts hawking cheaper photocopies of it, that takes away my future sales. It's at my expense, and I'm harmed.
But if someone takes my book's story and writes song lyrics derived from it, I might feel a little envy (perhaps I've always wanted to be a songwriter), but I don't think I'd harbor ill will. I might even hope for the song to be successful, as it would surely drive further sales of my book.
It's human nature to covet someone else's success, but the fact is there was nothing stopping me (except talent) from writing the song.
> but I think it's a reasonable stance to not want other companies to profit from my hard work
Imagine someone at another company reads your site, and it informs a strategic decision they make at the company to make money around the niche activity you're talking about. And they make lots of money they wouldn't have otherwise. That's totally legal and totally ethical as well.
The reality is, if you do hard work and make the results public, well you've made them public. People and corporations are free to profit off the facts you've made public, and they should be. There are certain limited copyright protections (they can't sell large swathes of your words verbatim), but that's all.
So the idea that you don't want companies to profit from your hard work is unreasonable, if you make it public. If you don't want that to happen, don't make anything public.
On a more human level, I think it's bleak that someone who makes a blog just to share stuff for fun is going to have most of his traffic be scrapers that distill, distort, and reheat whatever he's writing before serving it to potential readers.
If someone writes valuable stuff on a blog almost nobody finds, that's a tragedy.
If LLM's can process the information and provide it to people in conversations where it will be most helpful, where they never would have found it otherwise, then that's amazing!
If all you're trying to do is help people with the information you've discovered, why do you care if it's delivered via your own site or via LLM? You just want it out there helping people.
Because attribution, social recognition and prestige are among the many reasons why people put the information out there, and there is nothing wrong with any of them.
This is why I care if my ideas are presented to others by an LLM (that maybe cites me in some % of cases) or directly to a human. There is already a difference between a human visiting my space (acknowledging it as such) to read and learn information and being a footnote reference that may or may not be read or opened, without an immediate understanding of which information comes from me.
If you want attribution and prestige, then publish your stuff in an actual publication -- a journal, a magazine, whatever. Go on podcasts, speak at conferences, and so forth.
Publishing on a personal blog is not the path.
LLM's aren't taking away from your "prestige" or recognition. Any more than a podcaster referencing an idea of yours without mentioning you is. Or anyone else in casual conversation.
I can't believe the hypocrisy of a guy with 76029 internet points (that's a big time investment, would be a shame if someone trained an LLM on it) pretending to not understand that people want recognition for what they say, regardless of where they say it.
Are there journals who discuss about personal life and perspectives? Or a big publication about clever homelab configuration? Or the millions of other topics people discuss and publish?
Publishing a website is a perfectly fine way to put your ideas out there and expecting to be acknowledged by those who read those ideas.
And yes, a podcaster talking about someone's idea without referencing it is an unethical behavior.
In the grand scheme of things, I guess it's good to have an impact, even an indirect one, but come on, we're talking about human beings here.
Even if someone were to do it out of sheer passion without a care for financial gains, I'm sure they'd still appreciate basic validation and recognition. That's like the cheapest form of payment you could give for someone's work.
I don't understand why "actually, you're egotistical if you dare to desire recognition for stuff you put love and effort to" is such a common argument in those discussions. People are treated like machines that should swallow their pride and sense of self for the greater good, while on the other end, there is a (not saying YOU in particular did it) push to humanize LLMs.
For me, the point is that the person who has put in the work then has some rights to decide how that information is accessed and re-used. I think it is a reaosnable position for someone to hold that they want individuals to be able to freely use some content they produced, but not for a company to use and profit from that same content. I think just saying "It's public now" lacks any nuance.
Ultimately these AI tools are useful because they have access to huge swaths of content, and the owners of these tools turn a lot of revenue by selling access to these tools. Ultimately I think the internet will end up a much worse place if companies don't respect clearly established wishes of people creating the content, because if companies stop respecting things like robots.txt then people will just hide stuff behind logins, paywalls and frustraing tools like cloudflare which use heuristics to block malicious traffic.
> the person who has put in the work then has some rights to decide how that information is accessed and re-used
You do, but you give up those rights when you make the work public.
You think an author has any control over who their book gets lent to once somebody buys a copy? You think they get a share of profits when a CEO reads their book and they make a better decision? Of course not.
What you're asking for is unreasonable. It's not workable. Knowledge can't be owned. Once you put it out there, it's out there. We have copyright and patent protections in specific circumstances, but that's all. You don't own facts, no matter how much hard work and research they took to figure out.
Perhaps the better way forward here (for all) is for some kind of central content archive that bots can then pull from (Internet Archive?). But then there'll be questions of how up-to-date the archive is compared to the source, and if the source is showing the same thing that it's allowing to be archived.
When I said "I think it's a reasonable stance" I meant as in "I think it's a reasonable stance for someone to take, though I don't personally hold that view".