해봄의 아카이브

내 책을 인공지능 학습에 사용하지 말아 주세요!

Haebom

Sep 28, 20233y ago

최근 183,000권의 출판된 책 데이터를 보유한 books3 데이터셋에 대한 이야기가 나왔습니다. 사실 books2 때까지만 해도 뭔가 어물쩍 넘어갔는데 이제 인공지능이 큰 주목을 받으면서 이것도 문제가 되는 것 같습니다. 이에 대해 미국 작가협회에서 서명 운동이 시작되었습니다.

•

인공지능 학습과 책: Authors Guild에 따르면, 많은 저자들이 자신의 책이 Books3 데이터셋에 포함되어 AI 훈련에 사용되고 있다는 사실을 최근에 알게 되었습니다. 이 데이터셋은 183,000권의 책을 포함하고 있으며, 이는 불법적인 출처에서 다운로드된 것입니다.

•

저작권 문제와 소송: 저자들은 이러한 행위가 저작권, 보상, 그리고 AI의 미래적인 영향에 대한 우려를 불러일으킵니다. Authors Guild는 OpenAI, Meta, Google 등에 대한 집단 소송을 진행 중입니다.

•

저자가 취할 수 있는 조치: 저자들은 AI 회사에게 자신의 책을 사용할 권리가 없다는 편지를 보낼 수 있습니다. 또한, 저자들은 Authors Guild의 오픈 레터에 서명하여 AI 회사에게 적절한 허가를 받고 작가에게 보상을 지급하도록 요구할 수 있습니다.

생각해볼 질문이나 통찰

•

저작권 보호: AI가 크게 발전함에 따라, 저작권에 대한 새로운 법적 프레임워크가 필요하지 않을까요?

•

투명성의 중요성: AI 회사들은 얼마나 투명하게 자신들이 사용하는 데이터에 대해 공개해야 할까요?

•

저자의 책임과 권리: 저자들은 자신의 작품이 AI에 의해 어떻게 사용되고 있는지 어떻게 알 수 있을까요?

You Just Found Out Your Book Was Used to Train AI. Now What? - The Authors Guild

If you’re an author, you may have recently discovered that your published book was included in a dataset of books used to train artificial intelligence systems without your permission. (Search the dataset here.) This can be an unsettling revelation, raising concerns about […]

authorsguild.org

AI learning dataset 'Books3', which was also used for training Meta's large-scale language model 'LLaMA', is deleted

The Danish anti-piracy group ' Rights Alliance ' requested the host ' The Eye ' to delete the data set ' Books3 ' of about 200,000 books, and the data set was deleted. was broken. Books3 is a dataset that was also used for training the large-scale language model 'LLaMA ' developed by Meta. Anti-Piracy Group Takes Prominent AI Training Dataset ''Books3'' Offline * TorrentFreak https://torrentfreak.com/anti-piracy-group-takes-prominent-ai-training-dataset-books3-offline-230816/ Revealed: The Authors Whose Pirated Books Are Powering Generative AI - The Atlantic https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/ Massive Books3 collection for training AI was taken down over copyright issues | Mashable https://mashable.com/article/books3-ai-training-dmca-takedown Anti-Piracy Group Takes AI Training Dataset 'Books3' Offline https://gizmodo.com/anti-piracy-group-takes-ai-training-dataset-books3-off-1850743763 Books3, released as part of the open source AI training data set 'The Pile' provided by the non-profit AI research group ' EleutherAI ', is about 196,640 books for AI model training, about 37 GB. data was included. Books3 was uploaded in 2020 by AI developer Sean Presser and has since been hosted by large-scale repository The Eye. Mr. Presser reported, ``The development goal of Books3 was to allow anyone to create an AI model comparable to ChatGPT.'' ``It's important to be able to create your own ChatGPT-like AI model in case ChatGPT goes offline for some reason or faces a lawsuit,'' he said. Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data. Now you do. Now everyone does. Presenting 'books3', aka 'all of bibliotik' - 196,640 books - in plain.txt - reliable, direct download, for years: https://t.co/KKSrhEAnrD thread ???? pic.twitter.com/m6bdpHfYJx — Shawn Presser (@theshawwn) October 25, 2020 Books3 is also used for training Meta's large-scale language models LLaMA and BloombergGPT , and Meta researchers describe Books3 as 'a public dataset for training large-scale language models' (PDF file). I was. The Eye claims that 'all datasets comply with the Digital Millennium Copyright Act ,' but suspicions of intellectual property and copyright infringement have been pointed out. Amid growing concerns about copyright infringement on AI, the Rights Alliance has requested The Eye to remove Books3 based on Digital Millennium Copyright Act infringement. ``It is very important to prevent AI from being trained with pirated and illegal content,'' said Maria Fredenslund, director of the Rights Alliance. There is a significant challenge not only to detect and remove illegal AI training datasets, but also to deal with AI that has been trained on illegal content and is now prevalent on the internet.” The Eye removed the Books3 dataset following a removal request from the Rights Alliance. If you access Books3 at the time of article creation, a 404 error will bedisplayed . On the other hand, although the download link of Books3 published by The Eye was taken offline, it was pointed out that the dataset was not completely deleted from the Internet. Overseas media TorrentFreak reports that 'files are still backed up on the Internet Archive's wayback machine , and alternative download links are also shared.' ``Like traditional pirated books and movies, it's very difficult to remove once it's out,'' he said. In addition to requesting the deletion of Books 3 to The Eye, the Rights Alliance is asking Meta to respond to Books 3. ``It is unlikely that Meta will retrain LLaMA to eliminate concerns about copyright infringement,'' said Gizmodo, a technology news media. ``AI developers and development companies need a framework to always share details such as the training data used to create the AI model,'' Fredenslund said.

개인적 생각

•

사실 우리가 흔히 말하는 LLM(초거대 규모 언어모델)에 어떤 학습데이터가 쓰였는지 알 방법은 거의 없습니다.

◦

내부 고발이 있거나, 학습데이터셋을 공개하거나 인데...

◦

또한 이미 인터넷 상에 퍼져있는 책 및 문서 들의 경우 어떻게 솎아 낼 것인지?

•

미국 작가협회에서 이런 행동을 취한 것은 요식행동으로 밖에 안보입니다. 최근에 방송 연예계 및 기자협회 등에서 발생한 것과 비슷한 양상 입니다.

•

핵심은 잡을 방법이 없는 것을 어떻게 잡고 그것이 과연 잡는게 좋은 것인가? 하는 질문일 것 같습니다.

'haebom' 구독하기

사이트를 구독하면 새 포스트 등 최신 업데이트를 알림과 메일로 가장 먼저 받아보실 수 있습니다.
Slashpage에 가입하고 'haebom'을 구독하세요!

구독