I feel like this move has nothing to do with investors and everything to do with setting the standard for big corps like Microsoft and Google to be able to scrape their massive amount of data to train next gen AIs. They know they have HUGE amount of data from now and for years and years ago. Content, created by others, then sold for enormous profit.
I mean AI is already stealing all art and images on the web without paying anything. They could just literally scrape and pay nothing. Web scraping isn’t illegal, they already do it, why would they pay anyone? Unless the law catches up about the rights to manufacture AI content based on ill-gotten data, then why would they pay what they don’t have to?
The thing I worry about whenever someone mentions this angle: What about Lemmy content? As the community moves away from the commercial platforms in favor of Lemmy, Bluesky, Mastodon etc. Then does that lower the legal barrier for AI companies to train on all this content for free? Is that shift in the legal vulnerability of public content something that users consider? Is that desirable to most users? Are people thinking about that?
I’m with you on that. AI is the future. Just because xxx big corp is doing AI training for their closed source product doesnt mean that open source models won’t also benefit. If you post to a public space you should expect it to be read.
If he thinks locking down the API is going to stop them, he’s bumped his head. These companies have more than enough manpower to write and maintain an HTML scraper for Reddit.
Creating a web scraper vs actually maintaining one that is effective and works is two different things. It’s very easy to fight web scraping if you know what you are doing.
You are right. You would need a team of skilled scrapers and network engineers though would know how to get around rate limiters with some kind of external load balancer or something along those lines.
Rate limiters work on IP source. This is easily bypassed with a rotating proxy. There are even SaaS that offer this. The trick is to not use large subnets that can be easily blocked. You have to use a lot of random /32 IPs to be effective.
I feel like this move has nothing to do with investors and everything to do with setting the standard for big corps like Microsoft and Google to be able to scrape their massive amount of data to train next gen AIs. They know they have HUGE amount of data from now and for years and years ago. Content, created by others, then sold for enormous profit.
I mean AI is already stealing all art and images on the web without paying anything. They could just literally scrape and pay nothing. Web scraping isn’t illegal, they already do it, why would they pay anyone? Unless the law catches up about the rights to manufacture AI content based on ill-gotten data, then why would they pay what they don’t have to?
The thing I worry about whenever someone mentions this angle: What about Lemmy content? As the community moves away from the commercial platforms in favor of Lemmy, Bluesky, Mastodon etc. Then does that lower the legal barrier for AI companies to train on all this content for free? Is that shift in the legal vulnerability of public content something that users consider? Is that desirable to most users? Are people thinking about that?
deleted by creator
I’m with you on that. AI is the future. Just because xxx big corp is doing AI training for their closed source product doesnt mean that open source models won’t also benefit. If you post to a public space you should expect it to be read.
If he thinks locking down the API is going to stop them, he’s bumped his head. These companies have more than enough manpower to write and maintain an HTML scraper for Reddit.
Creating a web scraper vs actually maintaining one that is effective and works is two different things. It’s very easy to fight web scraping if you know what you are doing.
Right, but these are big companies with lots of talented programmers on hand. If anyone can overcome such an obstacle, it’s them.
Also, Google and Microsoft already have a search index full of Reddit content to scrape.
You are right. You would need a team of skilled scrapers and network engineers though would know how to get around rate limiters with some kind of external load balancer or something along those lines.
Rate limiters work on IP source. This is easily bypassed with a rotating proxy. There are even SaaS that offer this. The trick is to not use large subnets that can be easily blocked. You have to use a lot of random /32 IPs to be effective.