Please do not use my book to learn artificial intelligence!
Recently, there was a story about the books3 dataset, which contains 183,000 published books. In fact, until books2, something was passed over, but now that artificial intelligence is getting a lot of attention, this seems to be a problem. The Authors Guild of America has started a petition about this.
AI Learning and Books: According to the Authors Guild, many authors have recently discovered that their books are being used to train AI in the Books3 dataset, which contains 183,000 books and was downloaded from illegal sources.
Copyright Issues and Lawsuits: Authors say this practice raises concerns about copyright, compensation, and the future impact of AI. The Authors Guild is pursuing a class action lawsuit against OpenAI, Meta, Google, and others.
What authors can do: Authors can write letters to AI companies stating that they do not have the right to use their books. Authors can also sign an open letter to the Authors Guild, asking AI companies to obtain appropriate permissions and pay authors compensation.
Questions or insights to think about
Copyright Protection: As AI advances, will we need a new legal framework for copyright?
The Importance of Transparency: How transparent should AI companies be about the data they use?
Author Responsibilities and Rights: How do authors know how their work is being used by AI?
The Danish anti-piracy group ' Rights Alliance ' requested the host ' The Eye ' to delete the data set ' Books3 ' of about 200,000 books, and the data set was deleted. was broken. Books3 is a dataset that was also used for training the large-scale language model 'LLaMA ' developed by Meta. Anti-Piracy Group Takes Prominent AI Training Dataset ''Books3'' Offline * TorrentFreak https://torrentfreak.com/anti-piracy-group-takes-prominent-ai-training-dataset-books3-offline-230816/ Revealed: The Authors Whose Pirated Books Are Powering Generative AI - The Atlantic https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/ Massive Books3 collection for training AI was taken down over copyright issues | Mashable https://mashable.com/article/books3-ai-training-dmca-takedown Anti-Piracy Group Takes AI Training Dataset 'Books3' Offline https://gizmodo.com/anti-piracy-group-takes-ai-training-dataset-books3-off-1850743763 Books3, released as part of the open source AI training data set 'The Pile' provided by the non-profit AI research group ' EleutherAI ', is about 196,640 books for AI model training, about 37 GB. data was included. Books3 was uploaded in 2020 by AI developer Sean Presser and has since been hosted by large-scale repository The Eye. Mr. Presser reported, ``The development goal of Books3 was to allow anyone to create an AI model comparable to ChatGPT.'' ``It's important to be able to create your own ChatGPT-like AI model in case ChatGPT goes offline for some reason or faces a lawsuit,'' he said. Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data. Now you do. Now everyone does. Presenting 'books3', aka 'all of bibliotik' - 196,640 books - in plain.txt - reliable, direct download, for years: https://t.co/KKSrhEAnrD thread ???? pic.twitter.com/m6bdpHfYJx — Shawn Presser (@theshawwn) October 25, 2020 Books3 is also used for training Meta's large-scale language models LLaMA and BloombergGPT , and Meta researchers describe Books3 as 'a public dataset for training large-scale language models' (PDF file). I was. The Eye claims that 'all datasets comply with the Digital Millennium Copyright Act ,' but suspicions of intellectual property and copyright infringement have been pointed out. Amid growing concerns about copyright infringement on AI, the Rights Alliance has requested The Eye to remove Books3 based on Digital Millennium Copyright Act infringement. ``It is very important to prevent AI from being trained with pirated and illegal content,'' said Maria Fredenslund, director of the Rights Alliance. There is a significant challenge not only to detect and remove illegal AI training datasets, but also to deal with AI that has been trained on illegal content and is now prevalent on the internet.” The Eye removed the Books3 dataset following a removal request from the Rights Alliance. If you access Books3 at the time of article creation, a 404 error will bedisplayed . On the other hand, although the download link of Books3 published by The Eye was taken offline, it was pointed out that the dataset was not completely deleted from the Internet. Overseas media TorrentFreak reports that 'files are still backed up on the Internet Archive's wayback machine , and alternative download links are also shared.' ``Like traditional pirated books and movies, it's very difficult to remove once it's out,'' he said. In addition to requesting the deletion of Books 3 to The Eye, the Rights Alliance is asking Meta to respond to Books 3. ``It is unlikely that Meta will retrain LLaMA to eliminate concerns about copyright infringement,'' said Gizmodo, a technology news media. ``AI developers and development companies need a framework to always share details such as the training data used to create the AI model,'' Fredenslund said.
Personal thoughts
In fact, there is almost no way to know what kind of training data was used for the so-called LLM (Large-Scale Language Model).
If there is an internal whistleblower, or the training dataset is released, or...
Also, how do we weed out books and documents that are already spread across the Internet?
This action by the Writers Guild of America seems like nothing more than a formality. It is similar to what has recently happened in the broadcasting and entertainment industry and in the press associations.
I think the key question is how to catch something that cannot be caught, and is it really a good thing to catch?
