- cross-posted to:
- technology@beehaw.org
- cross-posted to:
- technology@beehaw.org
On Monday, court documents revealed that AI company Anthropic spent millions of dollars physically scanning print books to build Claude, an AI assistant similar to ChatGPT. In the process, the company cut millions of print books from their bindings, scanned them into digital files, and threw away the originals solely for the purpose of training AI—details buried in a copyright ruling on fair use whose broader fair use implications we reported yesterday.
The 32-page legal decision tells the story of how, in February 2024, the company hired Tom Turvey, the former head of partnerships for the Google Books book-scanning project, and tasked him with obtaining “all the books in the world.” The strategic hire appears to have been designed to replicate Google’s legally successful book digitization approach—the same scanning operation that survived copyright challenges and established key fair use precedents.
While destructive scanning is a common practice among some book digitizing operations, Anthropic’s approach was somewhat unusual due to its documented massive scale. By contrast, the Google Books project largely used a patented non-destructive camera process to scan millions of books borrowed from libraries and later returned. For Anthropic, the faster speed and lower cost of the destructive process appears to have trumped any need for preserving the physical books themselves, hinting at the need for a cheap and easy solution in a highly competitive industry.
Ultimately, Judge William Alsup ruled that this destructive scanning operation qualified as fair use—but only because Anthropic had legally purchased the books first, destroyed each print copy after scanning, and kept the digital files internally rather than distributing them. The judge compared the process to “conserv[ing] space” through format conversion and found it transformative. Had Anthropic stuck to this approach from the beginning, it might have achieved the first legally sanctioned case of AI fair use. Instead, the company’s earlier piracy undermined its position.
But if you’re not intimately familiar with the AI industry and copyright, you might wonder: Why would a company spend millions of dollars on books to destroy them? Behind these odd legal maneuvers lies a more fundamental driver: the AI industry’s insatiable hunger for high-quality text.
When a bookstore goes out of business or just can’t sell a book, they don’t return it to the printers, they tear off the cover, return that and by law have to throw the rest of the book in the trash and destroy it. So books are already destroyed by the millions. When I was a kid our hometown bookstore went out of business and I watched them throw away 2 metal dumpsters full of coverless books. If they were destroying ancient texts or valuable copies, that would be more something to get excited about. I doubt that they were doing that though.
It’s not secret, it was their defence when they got sued for copyright infringement. Instead of download all the books from Anna’s archive like meta, they buy a copy, cut the binding, scan it, then destroy it. “We bought a copy for personal use then use the content for profit, it’s not piracy”
I assume “destructively scan” means to cut the spine off so they lie flat, and that one copy of each book will be scanned? Isn’t that a pretty normal way of doing it in cases where the prints aren’t rare?
Probably, yes. I think there’s a copyright reason behind destroying the book?
Not copyright, as much as if the book isn’t precious, it’s easier to do that, feed the loose pages into the scanner, and then get an intact one if you want it, compared to the additional expense of having to build and program a machine to carefully turn the pages and photograph what’s inside, or the time it would need by comparison.




