If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?
Maybe we should just institutionalize and explicitly legalize the Internet Archive and Archive Team. Then, I can download a complete and halfway current crawl of domain X from the IA and that way, no additional costs are incurred for domain X.
But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.
Many websites (especially the bigger ones) are just businesses. They pay people to produce content, hopefully make enough ad revenue to make a profit, and repeat. Anything that reproduces their content and steals their views has a direct effect on their income and their ability to stay in business.
Maybe IA should have a way for websites to register to collect payment for lost views or something. I think it’s negligible now, there are likely no websites losing meaningful revenue from people using IA instead, but it might be a way to get better buy in if it were institutionalized.
If magazines and newspapers were once able to be funded by native ads, so can websites. The spying industry doesn't want you to know this, but ads work without spying too - just look at all the IRL billboards still around.
Magazines and newspapers were able to by funded by native ads because you couldn't auto-remove ads from their printed media and nobody could clone their content and give it away for free.
Newspapers sell information. Information is now trivial to copy and send across the globe, when 50 years ago it wasnt. And youre wrong about "nobody could clone their content", because they absolutely could, different editions were pressed throughout the day (morning, lunch, evening newspapers) at the peak of print media. The barrier to entry used to be a printing press, now its just an internet connection, print media has a hard time accepting that
You can't remove ads that are part of a site's native HTML either - well, not easily, not without an AI determining what is an ad based on the content itself. The few ads I see despite uBlock are like that - something the website author themself included, and not by pulling it in from a different domain.
And those ads don't spy. They tend to be a jpg that functions as a link. That's why I mentioned spying.
I also have ad-blockers for the same reason. However, if you don't support the people or companies producing the media you consume then don't be surprised when they go out of business.
Doesn't o3 sort of already do this? Whenever I ask it something, it makes it look like it simultaneously opens 3-8 pages (something a human can't do).
Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."
> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?
To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.
But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.
You don't middle click a bunch of links when doing research? Of all the things to point to I wouldn't have thought "opens a bunch of tabs" to be one of the differentiating behaviors between browsing with Firefox and browsing with an LLM.
I saw someone suggest in another post, if only one crawler was visiting and scraping and everyone else reused from that copy I think most websites would be ok with it. But the problem is every billionaire backed startup draining your resources with something similar to a DOS attack.
The next step in your progression here might be:
If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?