Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In theory retrieving a page on behalf of a user would be acceptable, but these are AI companies who have disregarded all norms surrounding copyright, etc. It would be stupid of them not to also save contents of the page and use it for future AI training or further crawling


If you allow Googlebot to crawl your website and train Gemini, but you don't allow smaller AI companies to do the same thing, then you're contributing to Google's hegemony. Given that AI is likely to be an increasingly important part of society in the future, that kind of discrimination is anti-social. I don't want a future where everything is run by Google even more than it currently is.

Crawling is legal. Training is presumably legal. Long may the little guys do both.


Googlebot respects robots.txt. And Google doesn't use the fetched data from users of Chrome to supplement their search index (as a2128 is speculating that Perplexity might do when they fetch pages on the user's behalf).


Yes, but there's no way to say "allow indexing for search, but not for AI use", right?


But there is: https://developers.google.com/search/docs/crawling-indexing/...

There is an user agent for search that you can control in robots.txt.

    user-agent: Googlebot
There is another user agent for AI training.

    user-agent: Google-Extended


Wow, I had no idea this page existed, thanks for the reference!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: