Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> it is built on trust.

This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.

Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content



We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

Cloudflare only needs to exist because the server doesn't get paid when a user or bot requests resources. Advertising only needs to exist because the publisher doesn't get paid when a user or bot requests resources.

And the thing is... people already pay for internet. They pay their ISP. So people are perfectly happy to pay for resources that they consume on the Internet, and they already have an infrastructure for doing so.

I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.

It's also easy to ratelimit. The ISP will just count the price tag as 'bytes'. So your price could be 100 MB or whatever (independent of how large the response is), and if your internet is 100 mbps, the ISP will stall out the request for 8 seconds, and then make it. If the user aborts the request before the page loads, the ISP won't send the request to the server and no resources are consumed.


> We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

I agree, but your idea below that is overly complicated. You can't micro-transact the whole internet.

That idea feels like those episodes of Star Trek DS9 that take place on Feregenar - where you have to pay admission and sign liability wavers to even walk on the sidewalk outside. It's not a true solution.


> You can't micro-transact the whole internet.

I agree that end-users cannot handle micro transactions across the whole internet. That said, I would like to point out that most of the internet is blanketed in ads and ads involve tons of tiny quick auctions and micro transactions that occur on each page load.

It is totally possible for a system to evolve involving tons of tiny transactions across page loads.


You could argue that the suggested system is actually much simpler than the one we currently have for the sites that are "free", aka funded with ads.

The lengths Meta and the like go to in order to maximize clickthroughs...


Remember Flattr?


The presented solution has invisible UX via layering it into existing metered billing.

And, the whole internet is already micro-transactioned! Every page with ads is doing a bidding war and spending money on your attention. The only person not allowed to bid is yourself!


> You can't micro-transact the whole internet.

Clearly you don't have the lobes for business /s


> We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

But it's done through a bait and switch. They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

It would be better if Google shows something like PAYMENT REQUIRED on top, at least that way I know what I'm getting at.


> They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

I'm old enough to remember when that was grounds for getting your site removed from Google results - "cloaking" was against the rules. You couldn't return one result for Googlebot, and another for humans.

No idea when they stopped doing that, but they obviously have let go of that principle.


I remember that too, along with high-profile punishments for sites that were keyword stuffing (IIRC a couple of decades ago BMW were completely unlisted for a time for this reason).

I think it died largely because it became impossible top police with any reliability, and being strict about it would remove too much from Google's index because many sites are not easily indexable without them providing a “this is the version without all the extra round-trips for ad impressions and maybe a login needed” variant to common search engines.

Applying the rule strictly would mean that sites implementing PoW tricks like Anubis to reduce unwanted bot traffic would not be included in the index if they serve to Google without the PoW step.

I can't say I like that this has been legitimised even for the (arguably more common) deliberate bait & switch tricks is something I don't like, but (I think) I understand why the rule was allowed to slide.


A scary observation in light of another front page article right now: https://news.ycombinator.com/item?id=44783566

If pages can't be served for free, all internet content is at the mercy of payment processors and their ideas of "brand safety".


“Free” could have a number of meanings here. Free to the viewer, free to the hoster, free to the creator, etc…

That content can't be served entirely for free doesn't mean that all content will require payment, and so is subject to issues with payment processors, just that some things may gravitate back to a model where it costs a small amount to host something (i.e. pay for home internet and host bits off that, or you might have VPS out there that runs tools and costs a few $ /yr or /month). I pay for resources to host my bits & bobs instead of relying on services provided in exchange for stalking the people looking at them, this is free for the viewer as they aren't even paying indirectly.

Most things are paid for anyway, even if the person hosting it nor my looking at it are paying directly: adtech arseholes give services to people hosting content in exchange for the ability to stalk us and attempt to divert our attention. Very few sites/apps, other than play/hobby ones like mine or those from more actively privacy focused types, are free of that.


That's already a deep problem for all of society. If we don't want that to be an ongoing issue, we need to make sure money is a neutral infrastructure.

It doesn't just apply to the web, it applies to literally everything that we spend money on via a third party service. Which is... most everything these days.


My first reaction: This solution would basically kill what little remaining fun there is to be had browsing the Internet and all but assure no new sites/smaller players will ever see traffic.

Curious to hear other perspectives here. Maybe I’m over reacting/misunderstanding.


Depending on the implementation (a big if) it would help smaller websites, because it would make hosting much cheaper. ISPs don’t choose what sites users visit, only what they pay. As long as the ISP isn’t giving significant discounts to visiting big sites (just charging a fixed rate per bytes downloads and uploaded) and charging something reasonable, visiting a small site would be so cheap (a few cents at most, but more likely <1 cent) users won’t weigh cost at all.


But users depend on major sites like google [insert service] still and will prioritize their usage accordingly like limited minutes and texts back in the day, right?


Networking is so cheap, unless ISPs drastically inflate their price, users won’t care.

The average American allegedly* downloads 650-700GB/month, or >20GB/day. 10MB is more than enough for a webpage (honestly, 1MB is usually enough), so that means on average, ISPs serve over 2000 webpages worth of data per day. And the average internet plan is allegedly** $73/month, or <$2.50/day. So $2.50 gets you over 2000 indie sites.

That’s cheap enough, wrapped in a monthly bill, users won’t even pay attention to what sites they visit. The only people hurt by an ideal (granted, ideal) implementation are those who abuse fixed rates and download unreasonable amounts of data, like web crawlers who visit the same page seconds apart for many pages in parallel.

* https://www.astound.com/learn/internet/average-internet-data...

** https://www.nerdwallet.com/article/finance/how-much-is-inter...


Wait, so the ISPs do from taking $73/user home today to taking $0/user home tomorrow under this plan?


Yeah same reaction here - there's no world in which ISP's would agree to this and even if they did I don't want to add them to my list of utilities I have to regularly fight with over claimed vs. actual usage like I do with my power/water/gas companies.


If site operators can’t afford the costs of keeping sites up in the face of AI scraping, the new/smaller sites are gone anyway.


Maybe not but we are not realistically in an either/or scenario here.


Why would I pay for a page if I don't know if the content is what I asked for? How much are you going to pay? How much are you going to charge? This will end up in SEO hell, especially with AI-generated pages farming paid clicks.


Hah still remember the old “solving the internet with hate” idea from Zed Shaw in the glory days of Ruby on Rails.

https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-saving-...

I do believe we will end there eventually, with the emerging tech like Brazil’s and India’s payment architectures it should be a possibility in the coming decades


I think value is not proportional to bytes - an AI only needs to read a page once to add it to its model, and then served the effectively cached data many times.


402 Payment Required

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

Sadly development along these lines has not progressed. Yes, Google Cloud and other services may return it and require some manual human intervention, but I'd love to see _automatic payment negotiation_.

I'm hopeful that instant-settlement options like Bitcoin Lightning payments could progress us past this.

https://docs.lightning.engineering/the-lightning-network/l40...

https://hackernoon.com/the-resurgence-of-http-402-in-the-age...


I get your thinking, but x.com is proof that simply making users pay (quite a lot) does not eliminate bots.

The amount of "verified" paying "users" with a blue checkmark that are just total LLM bots is incredible on there.

As long as spamming and DDOS'ing pays more than whatever the request costs, it will keep existing.


Your theory does not match the practice of Cloudflare.

Whatever method is used by Cloudflare for detecting "threats" has nothing to do with consuming resources on the "protected" servers.

The so-called "threats" are identified in users that may make a few accesses per day to a site, transferring perhaps a few kilobytes of useful data on the viewed pages (besides whatever amount of stupid scripts the site designer has implemented).

So certainly Cloudflare does not meter the consumed resources.

Moreover, Cloudflare preemptively annoys any user who accesses for the first time a site, having never consumed any resources, perhaps based on irrational profiling based on the used browser and operating system, and geographical location.


As time passes I’m more certain in the belief that the internet will end up being a licensed system with insanely high barriers to entry which will stop your average dev from even being able to afford deploying a hobby project on it.

Your idea of micro transacting web requests would play into it and probably end up with a system like Netflix where your ISP has access to a set of content creators to whom they grant ‘unlimited’ access as part of the service fee.

I’d imagine that accessing any content creators which are not part of their package will either be blocked via a paywall (buy an addon to access X creators outside our network each month) or charged at an insane price per MB as is the case with mobile data.

Obvious this is all super hypothetical but weirder stuff has happened in my lifetime


Wouldn't this lead to pirated page clones where customer pays less for same-ish content, and less, all the way down to essentially free?

Because I as an user would be glad to have "free sites only" filter, and then just steal content :))

But it's an interesting idea and thought experiment.


That’s fine. The point for website owners isn’t to make money, it’s to not spend money hosting (or more specifically, to pay a small fixed rate hosting). They want people to see the content; if someone makes the content more accessible, that’s a good thing.


You ignore the issue of motivation. Most web content exists because someone wants to make money on it. If the content creator can't do that, they will stop producing content.

These AI web crawlers (Google, Perplexity, etc) are self-cannibalizing robots. They eat the goose that laid the golden egg for breakfast, and lose money doing it most of the time.

If something isn't done to incentivize content creators again eventually there will be only walled-gardens and obsolete content left for the cannibals.


AFAIK, currently creators get money while not charging for users because of ads.

While I don’t blame creators for using ads now, I don’t think they’re a long-term solution. Ads are already blocked when people visit the site with ad blockers, which are becoming more popular. Obvious sponsored content may be blocked with the ads, and non-obvious sponsored content turns these “creators” into “shills” who are inauthentic and untrustworthy. Even without Google summaries, ad revenue may decrease over time as advertisers realize they aren’t effective or want more profit; even if it doesn’t, it’s my personal opinion that society should decrease the overall amount of ads.

Not everyone creates only for money, the best only create for enough money to sustain themselves. A long-term solution is to expand art funding (e.g. creators apply for grants with their ideas and, if accepted, get paid a fixed rate to execute them) or UBI. Then media can be redistributed, remixed, etc. without impacting creators’ finances.


Pretty sure this "most" motivation means it's not a golden egg. It's SEO slop.

If only the one in ten thousand with something to share are left standing to share it, no manufactured content, that's a fine thing.


Strongly agree with this armchair POV. Btw it doesn't cost much to host markdown.


Or, flip this, don't expect to get paid for pamphleteering?


The reason why that didn’t work was because regulations made micropayments too expensive, and the government wants it that way to keep control over the financial system.


Can't agree more, cloudflare is destroying the internet. We've entered the equivalent of when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much. These user hostile solutions have taken us back to dialup era page loading speeds for many sites, it's absurd that anyone thinks this is a service worth paying for.


So server owners are just supposed to bend over and take all the abuse they get from shitty bots and DDOS attacks and do nothing?

That seems pretty unreasonable.


Unreasonable is to use such incompetent companies like Cloudflare, which are absolutely incapable of distinguishing between the normal usage of a Web site by humans and DDOS attacks or accesses done by bots.

Only this week I have witnessed several dozen cases when Cloudflare has blocked normal Web page accesses without any possible correct reason, and this besides the normal annoyance of slowing every single access to any page on their "protected" sites with a bot check popup window.


I don’t know seems like it was working as intended to me.


Therefore "working as intended" for you means wasting the time of many people around the world, who cannot be considered as "threats" by any definition and who certainly do not waste any resources on the "protected" sites, because they are using the sites exactly for their intended purpose.

It is true that this has never happened before, but this week Cloudflare has frequently blocked my access to a site where I am a paid subscriber, and where there is no doubt that my access pattern matches exactly what that site must have been designed for, i.e. the site hosts a database and I make a few queries on it each day, less than a dozen, spread over the entire day, where each query takes a couple of seconds at most.

Whoever has implemented a "threat" detection algorithm that decides that such a usage is a "threat" and not normal usage, must be completely incompetent.


No they're supposed to allow scraping and information aggregation. That's the essence of the web: it's all text, crawlable, machine-readable (sort of) and parseable. Feel free to block ddos'es.


Feel free to crawl paywalled sites and republish them with discoverable links.

Also after starting the crawl, you can read about Aaron Swartz while waiting.


There is a difference between blocking abusive behavior and blocking all bots. No one really cared about bot scraping to this degree before AI scraping for training purposes became a concern. This is fearmongering by Cloudflare for website maintainers who haven't figured out how to adapt to the AI era so they'll buy more Cloudflare.


> No one really cared about bot scraping to this degree before AI scraping for training purposes became a concern. This is fearmongering by Cloudflare for website maintainers who haven't figured out how to adapt to the AI era so they'll buy more Cloudflare.

I think this is an overly harsh take. I run a fairly niche website which collates some info which isn't available anywhere else on the internet. As it happens I don't mind companies scraping the content, but I could totally undrestand if someone didn't want a company profiting from their work in that way. No one is under an obligation to provide a free service to AI companies.


No, they're supposed to rally together and fight for better laws and enforcement of those laws. Which is, arguably, exactly what they've done just in a way that you and I don't like.


What kind of laws and enforcement would stop a foreign actor from effectively DDoSing your site? What if the actor has (illegally) hacked tech-illiterate users so they have domestic residential IP addresses?


> What kind of laws and enforcement would stop a foreign actor from effectively DDoSing your site?

The kind of laws and enforcement that would block that entire country from the internet if it doesn't get its criminal act together.


Ethics-free organizations and individuals like Perplexity are why Cloudflare exists. If you have a better way to solve the problems that they solve, the marketplace would reward you handsomely.


Do you think users shouldn't get to have user agents or that "content farm ads scaffold" as a business model has a right to be viable? Forcing users to reward either stance seems unsustainable.


> Do you think users shouldn't get to have user agents or that "content farm ads scaffold" as a business model has a right to be viable?

Users should get to have authenticated, anonymous proxy user agents. Because companies like Perplexity just ignore `robots.txt`, maybe something like Private Access Tokens (PATs) with a new class for autonomous agents could be a solution for this.

By "content farm ads scaffold", I'm not sure if you had Perplexity and their ads business in mind, or those crappy little single-serving garbage sites. In any case, they shouldn't be treated differently. I have no problem with the business model, other than that the scam only works because it's currently trivial to parasitically strip-mine and monetize other people's IP.


While the existence of Perplexity may justify the existence of Cloudflare, it does not justify the incompetence of Cloudflare, which is unable to distinguish accesses done by Perplexity and the like from normal accesses done by humans, who use those sites exactly for the purpose they exist, so there cannot be any excuse for the failure of Cloudflare to recognize this.


Cloudflare operates with such biased logic such as "why are we shooting at all men who have a long beard?" really you want the terrorists to kill all your kids then???

"why are we cutting all the trees in the park?" really you want trees to fall on your kid and crushing them to death?? what's wrong with saving kids??

"why are we closing the water in the fountains in the town?" really you want your kids to drown into the fountains or drink contaminated water??


In the previous years, I did not have many problems with Cloudflare.

However, in the last few months, Cloudflare has become increasingly annoying. I suspect that they might have implemented some "AI" "threat" detection, which gives much more false positives than before.

For instance, this week I have frequently been blocked when trying to access the home page of some sites where I am a paid subscriber, with a completely cryptic message "The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.".

The only "action" that I have done was opening the home page of the site, where I would then normally login with my credentials.

Also, during the last few days I have been blocked from accessing ResearchGate. I may happen to hit a few times per day some page on the ResearchGate site, while searching for various research papers, which is the very purpose of that site. Therefore I cannot understand what stupid algorithm is used by Cloudflare, that it declares that such normal usage is a "threat".

The weird part is that this blocking happens only if I use Firefox (Linux version). With another browser, i.e. Vivaldi or Chrome, I am not blocked.

I have no idea whether Cloudflare specifically associates Firefox on Linux with "threats" or this happens because whatever flawed statistics Cloudflare has collected about my accesses have all recorded the use of Firefox.

In any case, Cloudflare is completely incapable of discriminating between normal usage of a site by a human (which may be a paying customer) and "threats" caused by bots or whatever "threatening" entities might exist according to Cloudflare.

I am really annoyed by the incompetent programmers who implement such dumb "threat detection solutions", which can create major inconveniences for countless people around the world, while the incompetents who are the cause of this are hiding behind their employer corporation and never suffer consequences proportional to the problems that they have caused to others.


I'm running into this as well (Firefox Debian). I suspect it may be Firefox's tracker blocking combined with the older extended support release.

Sometimes just refreshing the page seems to work too. Disabling the tracker blocking allows cross-site requests to Cloudflare endpoints which seems to be enough. Maybe worth allow-listing CF domains, but I didn't look into if that is possible yet.


yes exactly, cloudflare is just bad tech where the remedy is worse than the disease. I am using a VPN and I get endless loops of please verify you are not a robot, this may take a few second (minutes, hours...)... so basically cloudflare tech must have this primitive code

is_using_vpn? -> bad,abuse,ddos

thanks' cloudflare for saving our internet by destroying it...


> when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much

This exact same thing continues in 2025 with Windows Defender. The cheaper Windows Server VMs in the various cloud providers are practically unusable until you disable it.

You can tell this stuff is no longer about protecting users or property when there are no meaningful workarounds or exceptions offered anymore. You must use defender (or Cloudflare) unless you intend to be a naughty pirate user.

I think half of this stuff is simply an elaborate power trip. Human egos are fairly predictable machines in aggregate.


> The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

Plenty of site/service owners explicitly want Google, Meta and Apple bots (because they believe they have a symbiotic relationship with it) and don't want your bot because they view you as, most likely, parasitic.


they didnt seem to mind when openai et al. took all their content to train LLMs when they were still parasites that didn't have a symbiotic relationship. This thinking is kind of too pro-monopolist for me


Pretty sure they DID mind that. It's what the whole post is about.


That’s a good thing. You want an LLM to know about product or service you are selling and promote it to its users. Getting into the training data is the new SEO.


> The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall

I don't think it's fair to blame Cloudflare for that. That's looking at a pool of blood and not what caused it: the bots/traffic which predate LLMs. And Cloudflare is working to fix it with the PrivacyPass standard (which Apple joined).

Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?


do you think that every well-meaning GET request should be treated the same way as a distributed attack ? The latter is the reason why people use CF not the former.


The line can be extremely blurry (that's putting it mildly), and "the latter" is not the only reason people use CF (actually, I wouldn't be surprised at all if it wasn't even the biggest reason).


The reason people use Cloudflare is because they provide free CDN, and we have at least 10 years of content marketing out there telling aspiring bloggers that, if they use a CDN in front of their website, their shitty WordPress website hosted on a shady shared hosting will become fast.


well they aren't wrong


How does one tell a "well-meaning" request from an attack?


By the volume, distribution, and parameters (get and post body) of the requests.


> The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.

If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?

Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.

Yes this does make it much harder for non-incumbents, and I don't know what to do about that.


it's because those SEO bots keep crawling over and over, which perplexity does not seem to do (considering that the URLS are user-requested). Those are different cases and robots.txt is only about the former. Cloudflare in this case is not doing "ddos protection" because i presume Perplexity does not constantly refetch or crawl or ddos the website (If perplexity does those things then they are guilty)

https://www.robotstxt.org/faq/what.html

I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.

Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.


I believe that AI crawlers are the main thing that is currently blocked by default when you enroll a new site. No traditional crawlers are blocked, it's not that the big incumbents are allow-listed. And I think that clearly marked "user request" agents like ChatGPT-User are not blocked by default.

But at end of day it's up to the site operator, and any server or reverse proxy provides an easy way to block well-behaved bots that use a consistent user-agent.


> The Big Tech bots provide proven value to most sites.

They provide valeu for their companies. If you get some value from them it's just a side effect.


It goes without saying that they are profit-oriented. The point is that they historically offered a clear trade: let us crawl you, and we will refer traffic to you. An AI crawler does not provide clear value back. An AI user request agent might or might not provide enough clear value back for sites to want to participate. (Same goes for the search incumbents if they go all-in on LLM search results and don't refer much traffic back).


Here's how perplexity works:

1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")

2) It runs queries against a web search index (I think it was using Bing or Brave at first, but they probably have their own by now), and uses an LLM to decide which are the best/most relevant documents. It starts writing a summary while it dives into sources (see next).

3) If necessary it will download full source documents that popped up in search to seed the context when generating a more in-depth summary/answer. They do this themselves because using OpenAI to do it is far more expensive.

#3 is the problem. Especially because SEO has really made it so the same sites pop up on top for certain classes of queries. (for example Reddit will be on top for product reviews alot). These sites operate on ad revenue so their incentive is to block. Perplexity does whatever they can in the game of sidestepping the sites' wishes. They are a bad actor.

EDIT: I should also add that Google, Bing, and others, always obey robots.txt and they are good netizens. They have enough scale and maturity to patiently crawl a site. I wholeheartedly agree that if an independent site is also a good netizen, they should not be blocked. If Perplexity is not obeying robots.txt and they are impatient, they should absolutely be blocked.


What’s wrong with it downloading documents when the user asks it to? My browser also downloads whole documents and sometimes even prefetches documents I haven’t even clicked on yet. Toss in a adblocker or reader mode and my browser also strips all the ads.

Why is it okay for me to ask my browser to do this but I can’t ask my LLM to do the same?


When Google sends people to a review website, 30% of users might have an adblocker, but 70% don't. And even those with adblockers might click an affiliate link if they found the review particularly helpful.

When ChatGPT reads a review website, though? Zero ad clicks, zero affiliate links.


So if enough people used adblockers that would make them bad too? It’s just an issue of numbers?

Brave blocks ads by default. Tools like Pocket and reader mode disables ads.

Why is it okay for some user agents but not others?


There’s nothing wrong with downloading documents. I do this in my personal search app. But if you are hammering the site that wants you to calm down, or bypass robots.txt, that’s wrong.


robots.txt is for bots and I am not one though. As a user I can access anything regardless of it being blocked to bots. There are other mechanisms like status codes to rate limit or authenticate if that is an issue.


I'm talking about perplexity's behavior. Perhaps there's a point of contention on perplexity downloading a document on a person's behalf. I view this as if there is a service running that does it for multiple people, then it's a bot.


Perplexity makes requests on behalf of its users. I would argue that’s only illegitimate if the combined volume of the requests exceeds what the users would do by an order of magnitude or two. Maybe that’s what’s happening.

But “for multiple people” isn’t an argument IMO, since each of those people could run a separate service doing the same. Using the same service, on the contrary, provides an opportunity to reduce the request volume by caching.


> This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request.

Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.


You’re paying Cloudflare to not get DDoS-attacked or swamped by illegitimate requests. GP is implying that Cloudflare could do a better job of not blocking legitimate, benign requests.


Then we're all operating with very different definitions of legitimate or benign!

I've only ever seen a Cloudflare interstitial when viewing a page with my VPN on, for example -- something I'm happy about as a site owner and accept quite willingly as a VPN user knowing the kinds of abuse that occur over VPN.


> The internet we knew was open and not trusted ... monopolistic behavior

Monopolistic is the wrong word, because you have the problem backwards. Cloudflare isnt helping Apple/Google... It's helping its paying consumers and those are the only services those consumers want to let through.

Do you know how I can predict that AI agents, the sort that end users use to accomplish real tasks, will never take off? Because the people your agent would interact with want your EYEBALLS for ads, build anti patterns on purpose, want to make it hard to unsubscribe, cancel, get a refund, do a return.

AI that is useful to people will fail. For the same reason that no one has great public API's any more. Because every public companies real customers are its stock holders, and the consumers are simply a source of revenue. One that is modeled, marked to, and manipulated all in the name of returns on investment.


I disagree about AI agents, at least those that work by automating a web browser that a human could also use. I suppose Google's proposal to add remote attestation to Chrome might make it a little harder, but that seems to be dead for now (and I hope forever).


As agents become more useful, the monetization model will shift to something ... that we haven't though of yet.


> why does perplexity even need to crawl websites?

I was recently working on a project where I needed to find out the published date for a lot of article links and this came helpful. Not sure if it's changed recently but asking ChatGPT, Gemini etc didn't work and it said that it doesn't have access to the current websites. However, asking perplexity, it fetched the website in real time and gave me the info I needed.

I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.


As a website owner I definitely want the capability allow and block certain crawlers. If I say I don’t want crawlers from Perplexity they should respect that. This sneaky evasion just highlights that company is not to be trusted, and I would definitely pay any hosting provider that helps me enforce blocking parasitic companies like perplexity.


Don't they need a search index?


> the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

You say "shouldn't" here, but why?

There seems to be a fundamental conflict between two groups who each assert they have "rights":

* Content consumers claim the right to use whatever software they want to consume content.

* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).

These two "rights" are in direct conflict.

The bias here on HN, at least in this thread, is clearly towards the first "right". And I tend to come down on this side myself, as a computer power user. I hate that I cannot, for example, customize the software I use to stream movies from popular streaming services.

But on the other hand, content costs money to make. Creators need to eat. If the content creators cannot monetize their content, then a lot of that content will stop being made. Then what? That doesn't seem good for anyone, right?

Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?

(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)


The web browsers that the AI companies are about to ship will make requests that are indistinguishable from user requests. The ship on trying to save minimization has sailed.


We will be able to distinguish them.


"Creators" need to eat, OK, but there's no right to get paid to paste yesterday's recycled newspapers on my laptop screen. Making that unprofitable seems incredibly good for by and large everyone.

It'd likely be a fantastic good if "content creators" stopped being able to eat from the slop they shovel. In the meantime, the smarter the tools that let folks never encounter that form of "content", the more they will pay for them.

There remain legitimate information creation or information discovery activities that nobody used to call "content". One can tell which they are by whether they have names pre-existing SEO, like "research" or "journalism" or "creative writing".

Ad-scaffolding, what the word "content" came to mean, costs money to make, ideally less than the ads it provides a place for generate. This simple equation means the whole ecosystem, together with the technology attempting to perpetuate it as viable, is an ouroboros, eating its own effluvia.

It is, I would argue, undetermined that advertising-driven content as a business model has a "right" to exist in today's form, rather than any number of other business models that sufficed for millennia of information and artistry before.

Today LLMs serve both the generation of additional literally brain-less content, and the sifting of such from information worth using. Both sides are up in arms, but in the long run, it sure seems some other form of information origination and creativity is likely to serve everyone better.


I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare sucks. What business is it of theirs to block something that is meant to be accessed by everyone. Like an RSS feeds. FU Cloudflare.


That's not Cloudflare's fault, that's the website owner's fault.

If they want the RSS feeds to be accessible then they should configure it to allow those requests.


Websites and any business really, have the right to impose terms of use and deny service.

Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.

"And those LLMs didn't ask anyones permission to crawl the entire 'net."

False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.

You either side with the law or with criminals.


Is that also how e.g. antrhopic trained on libgen?

You can't even say the same thing about openAI because we don't know the corpus they train their models on.


Ironically, cloudflare is also the reason OpenAI agent mode with web use isn’t very usable right now. Every second time I asked it to do a mundane task like checking me in for a flight it couldn’t because of cloudflare.


what ironic with this???

we seeing many post about site owner that got hit by millions request because of LLM, we cant blame cloudflare for this because it literally neccessary evil


Ask yourself why so many content hosting platforms utilize CLoudflare's services and then contrast that perspective with your posted one. Might enlighten you a bit to think about that for a second.


I could not keep my website up without Cloudflare given the level of bot and AI crawlers hammering things. I try whenever to do challenges, but sometimes I have to block entire AS blocks.


Spam and DDOS are serious problems, it's not fair to suggest Cloudflare is just doing this to gatekeep the Internet for its own sake.


It's definitely not a DDOS when it's a single http request per year. I don't know if they do it on purpose but the fact is none of the big tech crawlers are limited.


This is most attributable to the fact that traffic is essentially anonymous so the source ip address is the best that a service can do if it's trying to protect an endpoint.


ovh does a good job with ddos


I'm sorry, but that's some crazy take.

Sure, the internet should be open and not trusted. But physical reality exists. Hosting and bandwidth cost money. I trust Google won't DDoS my site or cost my an arbitrary amount of money. I won't trust bots made by random people on the internet in the same way. The fact that Google respects robots.txt while Perplexity doesn't tells you why people trust Google more than random bots.


agree to disagree , but:

Google already has access to any webpage because its own search Crawlers are allowed by most websites, and google crawls recursively. Thus Gemini has an advantage of this synergy with google search. Perplexity does not crawl recursively (i presume -- therefore it does not need to consult robots.txt), and it doesn't have synergies with a major search engine.


> That said, why does perplexity even need to crawl websites?

So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.

They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: