Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.

If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?

Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.

Yes this does make it much harder for non-incumbents, and I don't know what to do about that.



it's because those SEO bots keep crawling over and over, which perplexity does not seem to do (considering that the URLS are user-requested). Those are different cases and robots.txt is only about the former. Cloudflare in this case is not doing "ddos protection" because i presume Perplexity does not constantly refetch or crawl or ddos the website (If perplexity does those things then they are guilty)

https://www.robotstxt.org/faq/what.html

I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.

Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.


I believe that AI crawlers are the main thing that is currently blocked by default when you enroll a new site. No traditional crawlers are blocked, it's not that the big incumbents are allow-listed. And I think that clearly marked "user request" agents like ChatGPT-User are not blocked by default.

But at end of day it's up to the site operator, and any server or reverse proxy provides an easy way to block well-behaved bots that use a consistent user-agent.


> The Big Tech bots provide proven value to most sites.

They provide valeu for their companies. If you get some value from them it's just a side effect.


It goes without saying that they are profit-oriented. The point is that they historically offered a clear trade: let us crawl you, and we will refer traffic to you. An AI crawler does not provide clear value back. An AI user request agent might or might not provide enough clear value back for sites to want to participate. (Same goes for the search incumbents if they go all-in on LLM search results and don't refer much traffic back).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: