English
Share
Sign In
Where does so much artificial intelligence learning data come from?
Haebom
Recently, as artificial intelligence (AI) technology has developed rapidly, the shocking fact has been revealed that the texts, photos, and videos we post online are being used without permission for AI training. In particular, there is a huge controversy over claims that famous AI companies such as OpenAI and Anthropic are collecting content without permission, ignoring the wishes of website owners. At the heart of this problem is a small file called 'robots.txt'. What is the problem and what impact will it have on us?
The robots.txt file: traffic lights on the web
The 'robots.txt' file is a kind of notice that the website owner tells search engines or other 'bots' (automated programs), "You can look here" or "Don't look here." Simply put, it sets the ‘access rules’ for the website. For example, "User-agent: * Disallow: /private/" is the equivalent of telling all bots to stay out of the "/private/" folder. This small text file serves several important roles.
1.
Reduce server load : Too many bots at once can slow down your website.
2.
Privacy Protection : You can hide information you do not want to make public.
3.
Efficient information provision : Important information can be shown first.
Controversial actions by AI companies
According to reports from Business Insider and Reuters, AI companies such as OpenAI and Anthropic promise to follow these 'robots.txt' rules but are actually ignoring them. This is similar to driving while ignoring traffic signals. This behavior is not only a violation of the website owner's rights, but is also a serious problem that can harm the credibility of the Internet as a whole. Even in Korea, it was revealed that when Claude3 was asked to write in a DC Inside tone or a specific community tone, he would do so in a very bold way, which was used in a very interesting way.
Beware of profanity.
In this situation, a company called TollBit is attracting attention. TollBit, a content licensing brokerage, has closely examined the behavior of AI companies. They use AI to track which websites you visit and how often, and based on that calculate the appropriate usage fee. This can be seen as playing a similar role to cracking down on illegal parking. TollBit's activities can contribute to creating a fair relationship between AI companies and content creators.
Did TollBit deliberately revive this controversy?
Since TollBit is a company that makes money as an intermediary, there may be some business interest in their claims. However, this is a basic business principle, and if TollBit's claims are based on facts, they are legitimate claims. TollBit is committed to promoting fair trade between AI companies and publishers, providing solutions that benefit both parties.
TollBit's argument appears to be valid in many respects. Ignoring the robots.txt file may violate the website owner's rights, which may be legally and ethically problematic. TollBit's claim is intended to prevent AI companies from collecting data in unethical ways, and can be said to be a valid claim to protect the rights of publishers.
Backlash and legal response from content platforms and media companies
Various content platforms and media companies are strongly opposing these actions by AI companies. They claim that unauthorized data collection by AI companies is threatening their business models. Media companies that produce news articles or professional content are expressing concerns about their works being learned and reproduced without permission by AI.
Various types of content platforms, including social media platforms, blog hosting services, and expert knowledge sharing sites, are also paying attention to this issue. They are working to prevent content produced by users of their platform from being used without permission to train AI.
Some large media companies and content platforms have already begun or are considering legal action. This requires legal judgment on AI companies' data collection practices, and could become an important precedent for future AI development and content use.
How do I stop this?
With advances in AI technology, we need to be aware that anything we post online can be used to train AI. This can have major implications for our privacy and copyright. However, blindly mandating robots.txt by law will create significant restrictions on the Internet.
In fact, blocking it is a one-dimensional method, and not posting information on the Internet is probably not an appropriate method. I think it would be best to reorganize this in a way that also benefits those who upload posts, like Reddit or Tollbit. Of course, it would be good to properly specify the collection of learning information for the artificial intelligence model.
1
💟
1
/haebom
Subscribe