> > We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.
Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".
Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.
> You asked the web-enabled AI to look at the domains.
Right, and the domain was configured to disallow crawlers, but Perplexity crawled it anyway. I am really struggling to see how this is hard to understand. If you mean to say "I don't think there is anything wrong with ignoring robots.txt" then just say that. Don't pretend they didn't make it clear what they're objecting to, because they spell it out repeatedly.
No, they did not. Crawling = recursive fetching, which wasn't what was happening here.
But also, I don't think there is anything wrong with ignoring robots.txt. In fact, I believe it is discriminatory and people should ignore it. See: https://wiki.archiveteam.org/index.php/Robots.txt
> I don't think there is anything wrong with ignoring robots.txt
Neither do I, I just thought your reply was disingenuous.
> Crawling = recursive fetching
I do not find this convincing. I am ok with using the word crawler for recursive fetching only. But robots.txt is not only for excluding crawlers and never has been. From the very beginning it was used to exclude specific automated clients, whether they only fetch one page or many, and that is certainly how the vast majority of people think about it today.
Like I implied in my first comment, I have no problem with you saying you dislike robots.txt, but it is not reasonable to pretend the article is unclear in some way.
Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".
Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.