The first large-scale model infringement case in China, achieved in 6 years, was crawled 2 million+ times, and only claimed 1 yuan?

0302179f29952b50d9c94a0dd6fb3da0.gif

Organize | Zheng Liyuan

Listing | CSDN (ID: CSDNnews)

Last month, Xueersi revealed that it is currently developing its own large mathematical model MathGPT, which is aimed at global mathematics enthusiasts and scientific research institutions, and is built with problem-solving and lecture algorithms in the field of mathematics as the core.

At that time, many people felt that the "science students" version of ChatGPT was finally coming.

Unexpectedly, the "scandal" about MathGPT broke out before it was actually launched: this Tuesday, the Bishen Composition App accused Xueersi of illegally accessing and caching as many as 2.58 million data on its server through "crawler" technology. times, to develop MathGPT's new product "composition AI assistant".

a6272d8e0a51490f24ac05a55ad31b21.jpeg

6a05f98788baf4cf1dbab1148081cbed.png

6 years of achievements, crawled more than 2 million times in one weekend

One of the protagonists of this incident, Bishen Composition, is a K12 (education from kindergarten to grade 12) composition education platform established in December 2017, which is affiliated to Beijing Yiyilianghua Technology Co., Ltd.

At that time, the AI ​​market was far less popular than it is now, but with its feature of "using artificial intelligence technology to help writers improve their writing skills", in January 2018, Bishen Composition received several million yuan in seed round financing from ZhenFund. In July 2019, it completed a multi-million angel round of financing.

According to official information, Pen God Composition has been online for six years, and has received more than 300,000 essay submissions and more than 400,000 likes and comments every month. It has accumulated millions of composition materials and corrected more than 30,000 essays per month. .

With the launch of ChatGPT at the end of last year, Shiji Tianhong, one of the investors of Penshen, once said that "Penshen" and ChatGPT have the same technology, and both use the latest algorithm based on Transformer as the bottom layer of the AI ​​model. Song Jiawei, the founder of Bishen Composition, also introduced: "One stroke and two strokes currently have more than 60% of the team as technical R&D personnel. Before the establishment of the company, the team had founded NLP companies. It has been cultivated for many years.”

Therefore, on the whole, the algorithm model of Penshen Composition is self-developed and trained by the company, and the big data of its platform comes from its own accumulation.

Because of its technical accumulation and remarkable achievements in writing, Bishen Composition and Xueersi reached a cooperation three years ago: it signed a contract with Xueersi's learning tool app "Tipai Pai", which is mainly responsible for providing composition material query services.

As a partner, this week's Bishen Composition stated: On April 13, something we did not expect happened. The six-year achievements of our team since the establishment of the company were achieved by "Xueersi" who have cooperated for many years in just a short period of time. Over two million crawls in one weekend!

29cc5c9791b72061e7d9092df0597b6c.png

Demand: 1 yuan compensation, public apology and data deletion

Judging from the official Weibo statement of Penshen Composition, it does not have a complete data security mechanism, and it has not set up all precautions for its "partners" Xueersi, which led to Santi Yunlian (Xueersi) Subsidiaries) take advantage of this trust, that is: without the authorization of Pen God Composition APP, from April 13 to April 17, 2023, illegally access and cache the Pen God Composition APP server through "crawler" technology The data is up to 2.58 million times.

In this regard, Bishen Composition claims that this behavior violates the terms of the contract between the two parties, and even violates Article 32 of the "Data Protection Law" "Any organization or individual shall collect data in a legal and proper manner, and shall not steal or use other methods to collect data." Illegally obtaining data” has seriously violated the data rights and interests of the Bishenzuowen APP.

Afterwards, Penshen Composition asked Xueersi for verification, and the other party directly admitted that their algorithm group was crawling the data and using it for their own use. Therefore, Penshen Composition sent a lawyer's letter, but did not get a substantive reply from the other party. At this time, Xueersi's AI model MathGPT is about to launch a new product "Composition AI Assistant".

"As a company much smaller than 'Xueersi', we have no choice but to protect our rights through legal channels." AI large model data theft] judgment precedent, so it can only "take this first step bravely".

As for the appeal of Penshen Composition, it is not actually asking for a large amount of compensation: I just want Xueersi to pay 1 yuan in compensation, apologize publicly and delete the crawled data.

In this regard, Bishen Composition explained: "Data is valuable, but our hard work is even more priceless. The claim for 1 yuan is because fairness and justice cannot be measured by money. We hope to tell the society that this behavior is wrong through litigation. The development of the artificial intelligence industry relies on co-creation rather than coveting and plagiarizing the achievements of others."

dbea536ce1d2110f1168ec64c7dfd1d3.png

It is true that as the composition of the pen god said, its volume is not large, so this statement did not attract much attention, but the only few comments condemned the behavior of learning and thinking.

b8c1f5265c99ef1264e6f7edd7df3588.png

Xueersi Response: All meet the requirements of the contract

After being reported by many media, this incident gradually fermented, so the official Weibo of Xueersi also posted a response to this:

First of all, MathGPT is a self-developed large model focusing on the field of mathematics, and does not have any composition-related data; secondly, "Composition AI Assistant" is currently under development and has not yet been released. The service does not use any data from Penshen Composition.

a9c5c52ed50fa21cf38e996c0e1f4fc6.png

However, Bishen Composition claimed that more than 2 million times of data had been crawled. Xueersi pointed out that the contract clearly stated that "the number of calls included in the monthly guaranteed fee is on the order of millions", and the interface it calls "belongs to the contract agreement between the two parties. the normal scope of cooperation".

At the end of the response, Xueersi emphasized that it "always respects intellectual property rights and attaches great importance to intellectual property protection", and all actions are performed in strict accordance with the contract, but: "The public statement of Penshen Composition has already caused damage to Xueersi's brand reputation. , we will reserve the right to pursue its reputation infringement responsibility."

453fa972b6388eb879f1771e41bb62aa.png

Copyright issues of AI training data

Judging from the current statements made by both parties, this dispute cannot yet draw a final conclusion, but it also reveals a blind spot that is easily overlooked but very important in the recent increasingly hot AI large-scale model competition: AI training data. Copyright issue.

In fact, Reddit, the "US version of Tieba" that has been making a lot of noise on the Internet recently, decided to force API fees for this reason.

In recent years, the chat content published on Reddit has become the material for companies such as Google, OpenAI and Microsoft to train AI large models to develop generative AI products such as ChatGPT. With the popularity of such AI tools, the founder and CEO of Reddit said: "Reddit's data corpus is very valuable, but we don't want to provide this content to some giant companies for free."

After Reddit took the lead in asking tech giants to pay for data usage, Stack Overflow, a well-known IT question-and-answer website, also announced plans to charge large AI developers for data access from the middle of this year. (LLM) development, the contribution made must also be compensated.”

In addition to large sites such as Reddit and Stack Overflow, even in the developer circle, some programmers also announced that they would abandon GitHub because of Copilot's alleged code copyright infringement:

1df46f2e8f4490202aa7394de9c63bd0.png

fd5a4ac6e1ef789006f1fb2fa3550e9b.png

Undoubtedly, in the process of making AI large models smarter, massive training data is essential, but from the current point of view, OpenAI, the "popular fried chicken" in the AI ​​field today, does not have a lot of copyright issues for training data. good solution.

However, with the further advancement of the AI ​​boom, this problem is bound to be resolved. As Chen Zhong, a professor at Peking University’s School of Computer Science, said: “Maybe in the early stages of research and development, people don’t care about the source of data, but when you generate huge economic benefits, the traditional economic model and legal system will constrain your research and development. Behavior."

So what do you think about this issue?

Reference link:

https://weibo.com/combmobile

https://weibo.com/5308312222/4912235782345634?wm=3333_2001&from=10D6093010&sourcetype=weixin&s_trans=3830025800_4912235782345634&s_channel=4

https://www.36kr.com/p/1723938652161

Recommended reading:

75-year-old AI godfather Hinton: I am already old, how to control the "super intelligence" that is smarter than humans is up to you

▶Yang Likun, winner of the Turing Award: No one will use the GPT model in five years, and the world model is the future of AGI

▶Unveiling the "Mobile Cloud Cup" industry track—the sub-track of innovative application of cloud computing

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/131237998