The first case of domestic AI large model data being stolen, but only 1 yuan was claimed

Bishen Composition APP posted on its official WeChat account, claiming that the data of its "Zuowen Library" was stolen, and "it was crawled more than two million times in a weekend", and believed that it was created by "Xueersi", which has a long-term cooperative relationship. for. Pen God Composition calls it the "first case of theft of AI large model data" in China. But Xueersi publicly denied this.

The official Weibo of Xueersi responded: "First, MathGPT is a self-developed large model focusing on the field of mathematics, without any composition-related data; secondly, the 'Composition AI Assistant' is currently under development and has not yet been released. Use any data from Penshen Composition."

Bishen wrote that "the team has also sent a lawyer's letter to Xueersi many times, but the other party has never responded substantively." He also said, "There is no other way but to solve this problem through judicial procedures." "I just want to Xueersi paid 1 yuan in compensation and publicly apologized and deleted the crawled data." "I hope Xueersi can stop its wrong behavior and apologize in time."

 

The dispute between Penshen Composition and Xueersi brought out a "hidden corner" of the large model: Is the source of the data used to train the AI ​​large model legal and compliant? In fact, disputes over large model data sets have been staged frequently at home and abroad.

The data involved in the dispute can be roughly divided into two categories: one has clear intellectual property rights, such as original pictures, music, videos, articles, etc.; , post bars, etc.

Earlier this year, Stability AI was sued by Getty Images, a large commercial gallery provider in the United States, and cartoonists, because they believed that the data used by Stability AI to train the AI ​​​​image generation model Stable Diffusion "illegally copied and processed copyrighted images." .

In addition, Twitter and Reddit, the "US version of Tieba" also announced in the first half of this year that they would charge for the API interface, and the price is not cheap. Previously, the content of these platforms could be crawled by companies such as Google and openAI for free, and used as a training library for large language models. Twitter CEO Musk said "They (Microsoft) illegally used Twitter data for training, and it's time to sue them."

Regulators are paying attention to the situation with AI large model training data sets. The "Generative Artificial Intelligence Service Management Measures (Draft for Comment)" announced by the Cyberspace Administration of China in April this year also clarified that the pre-training and optimized training data used for generative artificial intelligence products should comply with laws and regulations such as the Internet Security Law. It does not contain content that infringes intellectual property rights, and those that contain personal information should meet the requirements of the "inform-consent" principle, and the authenticity, accuracy, objectivity, and diversity of the data should also be guaranteed.

What do you think about this, please leave a message to discuss.

Supongo que te gusta

Origin blog.csdn.net/haisendashuju/article/details/131547483
Recomendado
Clasificación