Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think it's an issue of scale.

The next step in your progression here might be:

If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?



Maybe we should just institutionalize and explicitly legalize the Internet Archive and Archive Team. Then, I can download a complete and halfway current crawl of domain X from the IA and that way, no additional costs are incurred for domain X.

But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.


https://commoncrawl.org/

>Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.


The problem is that many websites and domains are missing from it.


I have mixed feelings on this.

Many websites (especially the bigger ones) are just businesses. They pay people to produce content, hopefully make enough ad revenue to make a profit, and repeat. Anything that reproduces their content and steals their views has a direct effect on their income and their ability to stay in business.

Maybe IA should have a way for websites to register to collect payment for lost views or something. I think it’s negligible now, there are likely no websites losing meaningful revenue from people using IA instead, but it might be a way to get better buy in if it were institutionalized.


If magazines and newspapers were once able to be funded by native ads, so can websites. The spying industry doesn't want you to know this, but ads work without spying too - just look at all the IRL billboards still around.


Thanks for pointing this out! This is too often ignored!


I never said anything about spying.

Magazines and newspapers were able to by funded by native ads because you couldn't auto-remove ads from their printed media and nobody could clone their content and give it away for free.


Newspapers sell information. Information is now trivial to copy and send across the globe, when 50 years ago it wasnt. And youre wrong about "nobody could clone their content", because they absolutely could, different editions were pressed throughout the day (morning, lunch, evening newspapers) at the peak of print media. The barrier to entry used to be a printing press, now its just an internet connection, print media has a hard time accepting that


You can't remove ads that are part of a site's native HTML either - well, not easily, not without an AI determining what is an ad based on the content itself. The few ads I see despite uBlock are like that - something the website author themself included, and not by pulling it in from a different domain.

And those ads don't spy. They tend to be a jpg that functions as a link. That's why I mentioned spying.


If ads were more respectful I wouldn’t have to remove them. Alas they can’t help themselves and so I do.

When ads were far less invasive, I had a lot more tolerance.

Now they want my data, they want to play audio, video, hijack the content, page etc.

Advertising scum can not be trusted to forever take more and more and more.


I also have ad-blockers for the same reason. However, if you don't support the people or companies producing the media you consume then don't be surprised when they go out of business.


> don't be surprised when they go out of business.

I’m ok with this. I support the media I truely want to see, and that media offers alternatives that are not ads.

For instance, I pay for YouTube premium. That said, many will not pay.


Or websites can monetize their data via paid apis and downloadable archives. That's what makes Reddit the most valuable data trove for regular users.


I don't think Reddit pays the people who voluntarily write Reddit content. Valuable to Reddit, I guess.


Doesn't o3 sort of already do this? Whenever I ask it something, it makes it look like it simultaneously opens 3-8 pages (something a human can't do).

Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."

> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?

To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.

But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.


You don't middle click a bunch of links when doing research? Of all the things to point to I wouldn't have thought "opens a bunch of tabs" to be one of the differentiating behaviors between browsing with Firefox and browsing with an LLM.


> simultaneously opens 3-8 pages (something a human can't do).

Can't you read?


>Doesn't o3 sort of already do this?

ChatGPT probably uses a cache though. Theoretically, the average load on the original sites could be far less than users accessing them directly.


how do you propose we do anything about this? any law you propose would have to be global


I saw someone suggest in another post, if only one crawler was visiting and scraping and everyone else reused from that copy I think most websites would be ok with it. But the problem is every billionaire backed startup draining your resources with something similar to a DOS attack.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: