Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

If you want to gatekeep your content, use authentication.

Robots.txt is not a technical solution, it's a social nicety.

Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.

Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.



Would highly recommend listening to the latest Hard Fork podcast with Matthew Prince (CEO, Cloudflare): https://www.nytimes.com/2025/08/01/podcasts/hardfork-age-res...

I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.


I don't think criticizing the business practices of Cloudfare does the work of excusing Perplexity's disregard for norms.


> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> If you want to gatekeep your content, use authentication.

Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?


Yes, I believe that's basically what https://serpapi.com/ is doing.


There are many APIs that scrape Google but I don't know of any search engine that scrapes and rebrands Google results. Kagi.com pays Google for search results. Either Kagi has a better deal than SERP apis (I doubt) or this is not legal.


I tried to scrape Google results once using an automated process, and quickly got banned from all of Google. They banned my IP address completely. It kind of really sucked for a while, until my ISP assigned a new IP address. Funny enough, this was about 15 years ago and I was exploring developing something very similar to what LLMs are today.


I think OP based this on an old case about what you can do with data from Facebook vs LinkedIn based on if you need to be logged in to get it. Not relevant when you talk about scraping in this case I think. P is clearly in the wrong here.


> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?


They exist to optimize the internet for the platforms and big providers. Little people get screwed, with no legal recourse. They actively and explicitly degrade the internet, acting as censors and gatekeepers and on behalf of bad faith actors without legal authority or oversight.

They allow the big platforms to pay for special access. If you wanted to run a scraper, however, you're not allowed, despite the internet standards and protocols and the laws governing network access and free communications standards responsibilities by ISPs and service providers not granting the authority to any party involved with cloudflare blocking access.

It's equivalent to a private company deciding who, when, and how you can call from your phone, based on the interests and payments of people who profit from listening to your calls. What we have is not normal or good, unless you're exploiting the users of websites for profit and influence.


most users of cloudflare assume it's for spam control. They don't realize that they are blocking their content for everyone except for Faangs


Well if it continues like this, that's what will happen. And I dread that future.

Noone will care to share anything for free anymore, because it's AI companies profiting off their hard work. And no way to prevent that from happening, because these crawlers don't identify themselves.


Cloudflare is growing more and more vile with each passing year. Half the tools they're building now should never have existed in the first place.


[flagged]


> Eat a dick.

Could you please stop breaking the HN guidelines? Your account has unfortunately done that repeatedly, and we've asked you several times to stop.

Your comment would be just fine without that bit.

https://news.ycombinator.com/newsguidelines.html


This is 100% incorrect.


I think Cloudfare is setting themselves up to get sued.

(IANAL) tortious interference




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: