• 11 Posts
  • 1.31K Comments
Joined 1 year ago
cake
Cake day: July 7th, 2023

help-circle



  • You’re thinking of licensing as a person putting something online WITH a license.

    The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.

    Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.

    Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.

    Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.


  • EULA and TOS agreements stop Reddit and similar sites from being sued. They changed them before they were selling the data and barely gave notice about it (see the exodus from reddit pt2), but if you keep using the service, you agree to both, and they can get away with it because they own the platform.

    Anyone who has their content on a platform of the like that got the rug pulled out from under them with silent amendments being made to allow that is unfortunately fucked.

    Any other platforms that didn’t explicitly state this was happening is not in scope to just allow these training tools to grab and train. What we know is that OpenAI at the very least was training on public sites that didn’t explicitly allow this. Personal blogs, Wikipedia…etc.









  • It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

    What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

    They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.