Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Their test seems flawed:

> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

Under this situation Perplexity should still be permitted to access information on the page they link to.

robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

https://www.robotstxt.org/faq/what.html

If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.

These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.

There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.

If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.



> > We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".

Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.


> You asked the web-enabled AI to look at the domains.

Right, and the domain was configured to disallow crawlers, but Perplexity crawled it anyway. I am really struggling to see how this is hard to understand. If you mean to say "I don't think there is anything wrong with ignoring robots.txt" then just say that. Don't pretend they didn't make it clear what they're objecting to, because they spell it out repeatedly.


> Perplexity crawled it anyway

No, they did not. Crawling = recursive fetching, which wasn't what was happening here.

But also, I don't think there is anything wrong with ignoring robots.txt. In fact, I believe it is discriminatory and people should ignore it. See: https://wiki.archiveteam.org/index.php/Robots.txt


> I don't think there is anything wrong with ignoring robots.txt

Neither do I, I just thought your reply was disingenuous.

> Crawling = recursive fetching

I do not find this convincing. I am ok with using the word crawler for recursive fetching only. But robots.txt is not only for excluding crawlers and never has been. From the very beginning it was used to exclude specific automated clients, whether they only fetch one page or many, and that is certainly how the vast majority of people think about it today.

Like I implied in my first comment, I have no problem with you saying you dislike robots.txt, but it is not reasonable to pretend the article is unclear in some way.


Yeah I'm not so sure about that.

If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.

But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.


It doesn’t matter what you do with it afterwards. Crawling is defined by recursively following links. If a user asks software about a specific page and it fetches it, then a human is operating that software, it’s not a crawler. You can’t just redefine “crawler” to mean “software that does things I don’t like”. It very specifically refers to software that recursively follows links.


Technically correct (the best kind of correct), but if I set a thousand users on to a website to each download a single page and then feed the information they retrieve from that one page into my AI model, then are those thousand users not performing the same function as a crawler, even though they are (technically) not one?

If it looks like a duck, quacks like a duck and surfs a website like a duck, then perhaps we should just consider it a duck...

Edit: I should also add that it does matter what you do with it afterwards, because it's not content that belongs to you, it belongs to someone else. The law in most jurisdictions quite rightly restricts what you can do with content you've come across. For personal, relatively ephemeral use, or fair quoting for news etc. - all good. For feeding to your AI - not all good.


> if I set a thousand users on to a website to each download a single page and then feed the information they retrieve from that one page into my AI model, then are those thousand users not performing the same function as a crawler, even though they are (technically) not one?

No.

robots.txt is designed to stop recursive fetching. It is not designed to stop AI companies from getting your content. Devising scenarios in which AI companies get your content without recursively fetching it is irrelevant to robots.txt because robots.txt is about recursively fetching.

If you try to use robots.txt to stop AI companies from accessing your content, then you will be disappointed because robots.txt is not designed to do that. It’s using the wrong tool for the job.


I don’t disagree with you about robots.txt… however, what _is_ the right tool for the job?


auth, If you don't want content to be publicly accessible, don't make it public.


Perplexity can then just ask the user to copy/paste the page content. That should be legal , it's what the user wants. The cases are equivalent


I can’t copy/paste the content of a book or a movie or music, that’s piracy.

But when a trillion dollar industry does it, its okay?


> But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.

If it is not recursive access, and is only one file, then it hopefully should be OK (except for issues with HTML where common browsers will usually also download CSS, JavaScripts, WebAssembly, pictures, favicons (even if the web page does not declare any favicons), etc; many "small web" formats deliberately avoid this), especially if it is just used only since you requested it.

However, if they do then use it to train their model, without documenting that, that can be a problem, especially if the file being accessed is not intended to be public; but this is a different issue than the above.


Relevant to this is that Perplexity lies to the user when specifically asked about this. When the user asks if there is a robots.txt file for the domain, it lies and says there is not.

If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?


The article did not test if the issue was specific to robots.txt or if it can not find other files.

There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.


> specific to robots.txt > poor summarization of data

I'm not really addressing the issue raised in the article. I am noting that the LLM, when asked, is either lying to the user or making a statement that it does not know to be true (that there is no robots.txt). This is way beyond poor summarization.


I would say it's orthogonal to it. LLMs being unable to judge their capabilities is a separate issue to summarization quality.


I'm not critiquing its ability to judge its own capability, I am pointing out that it is providing false information to the user.


> If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it

That's not what Perplexity own documentation[1] says though:

"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity

Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."

[1] https://docs.perplexity.ai/guides/bots


You left out the part that says Perplexity-User generally ignores robots.txt because it's used for user requested actions.

> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.


Yes, it should stop recursive fetches. Furthermore, excessive unnecessary requests should also be stopped, although that is separate from robots.txt. At least, these are what I intended, and possibly also you.


robots.txt isn't even designed to stop recursive fetches. It is designed to ask nicely recursive fetchers not to recursively fetch. It comes from a time where site operators wanted their sites to be scraped by search engines, but not things like edit links and admin panels.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: