The rise and fall of robots.txt: As unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

alyaza [they/she]@beehaw.org · 1 year ago

The rise and fall of robots.txt: As unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

gayhitler420@lemm.ee · 1 year ago

robots.txt isn’t a basic social contract, it’s a file intended to save web crawlers precious resources.

AutoTL;DR@lemmings.world · 1 year ago

🤖 I’m a bot that provides automatic summaries for articles:

Click here to see the summary

If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.

AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.

In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities.

You might build a totally innocent one to crawl around and make sure all your on-page links still lead to other live pages; you might send a much sketchier one around the web harvesting every email address or phone number you can find.

The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI’s models “were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.” A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.

“We recognize that existing web publisher controls were developed before new AI and research use cases,” Google’s VP of trust Danielle Romain wrote last year.

Saved 92% of original text.

bedrooms@kbin.social · 1 year ago

As I always write, trying to restrict AI training on the ground of copyright will only backfire. The sad truth is that malicious parties (dictatorships) will get more training materials because they won’t abide by rules. The end result is, dictators would outperform democracies in terms of future generation AIs, if we treat AI training like human reading.

zaphod@lemmy.ca · edit-2 1 year ago

You know what?

I’m fine with that hypothetical risk.

“The bad guys will do it anyway so we need to do it, too” is the worst kind of fatalism. That kind of logic can be used to justify any number of heinous acts, and I refuse to live in a world where the worst of us are allowed to drag down the rest of us.