GPT-4 exhausts the data of the whole universe! OpenAI suffered lawsuits one after another, but because of the lack of data, UC Berkeley professor issued a warning

 

Exhausting the "whole network", generative AI will soon run out of data.

picture

Recently, the famous UC Berkeley computer scientist Stuart Russell said that the training of ChatGPT and other AI tools may quickly exhaust "the text of the whole universe."

In other words, training an AI like ChatGPT will be hampered by insufficient data.

picture

This could affect how generative AI developers gather data and train AI in the coming years.

At the same time, Russell believes that artificial intelligence will replace humans in the work of "language input, language output".

The data is not enough, what can I use to make it up?

Russell's recent predictions have caught everyone's attention.

OpenAI and other generative AI developers have started collecting data for training large language models.

Yet the data collection practices integral to ChatGPT and other chatbots are facing increased scrutiny.

picture

Among them are some executives who are upset that ideas are being used without the consent of individuals and that platform data is being used freely.

But Russell's insight points to another potential weakness: a shortage of text to train these datasets on.

Last November, a study by researchers including MIT estimated that machine learning datasets could exhaust all "high-quality language data" by 2026.

picture

Paper address: https://arxiv.org/pdf/2211.04325.pdf

According to the study, "high-quality" centralized language data comes from: books, news articles, scientific papers, Wikipedia, and filtered web content, among others.

And the model GPT-4 behind the popular fried chicken ChatGPT also received a lot of high-quality text training.

The data comes from publicly available online sources (including digital news sources and social media sites)

"Data scraping" from social media sites led Musk to limit the number of tweets users can view each day.

picture

Russell said that while many reports are unconfirmed, they all detail that OpenAI purchased the text dataset from a private source. While there may be explanations for this buying behavior, the natural inference is that there is not enough high-quality public data left.

For a long time, OpenAI has not disclosed the training data behind GPT-4.

Now, OpenAI needs to supplement its public language data with "private data" to create GPT-4, the company's most powerful and advanced artificial intelligence model to date.

It can be seen that high-quality data is indeed not enough.

OpenAI did not immediately respond to a request for comment ahead of publication.

OpenAI deep in the data storm

OpenAI has been in big trouble lately, and it's all about data.

First, 16 people anonymously sued OpenAI and Microsoft, and submitted a 157-page lawsuit, claiming that they used sensitive data such as private conversations and medical records.

picture

Their claim amounts to as much as $3 billion, the lawsuit states,

Despite having agreements in place to purchase and use personal information, OpenAI and Microsoft systematically stole 300 billion words from the Internet, including millions of personal information obtained without consent.

This includes account information, name, contact details, email, payment information, transaction history, browser data, social media, chat data, cookies and more.

This information is embedded in ChatGPT, but these just reflect personal hobbies, opinions, work history and even family photos.

Clarkson, the law firm responsible for the prosecution, has previously been responsible for large-scale class action lawsuits on issues such as data breaches and false advertising.

picture

Then, this week, several full-time authors proposed that OpenAI used its own novels to train ChatGPT without permission, which constituted infringement.

So how did you decide to use your own novels for training?

The evidence is that ChatGPT was able to generate accurate summaries for their books, which is enough to show that these books were used as data to train ChatGPT.

Authors Paul Tremblay and Mona Awad stated that "ChatGPT copied data from thousands of books without permission, which violated the authors' copyright".

picture

The indictment estimates that OpenAI's training data contains at least 300,000 books, many of which come from infringing websites.

For example, when the GPT-3 training data was disclosed, it contained 2 Internet book corpora, accounting for about 15%.

The two suing authors believed that these data came from some free websites, such as Z-Library, Sci-Hub, etc.

In addition, in 2018, OpenAI revealed that the data in the training GPT-1 included 7000+ novels. Those suing argue that the books were used without the author's approval.

Find another way?

It has to be said that there are indeed many controversies about OpenAI's use of data sources.

In February, Wall Street Journal reporter Francesco Marconi said data from news outlets was also used to train ChatGPT.

Marconi asked ChatGPT to make a list, and there were 20 media.

picture

As early as May of this year, Altman said in an interview that OpenAI had not used paying customer data to train a large language model for some time.

The client obviously didn't want us to train on their data, so we changed our plans and stopped doing it.

picture

In fact, OpenAI quietly updated its terms of service in early March.

Altman mentioned that the new technology that the company is developing now can use less data to train the model.

Perhaps inspired by OpenAI, Google chose to plug this loophole first.

On July 1, Google updated its privacy policy, which now clarifies that Google has the right to collect any publicly available data and use it for the training of its artificial intelligence models.

picture

Google has shown to all users that as long as it is available through public channels, it can be used to train Bard and future AI.

Supongo que te gusta

Origin blog.csdn.net/weixin_74318097/article/details/132379689
Recomendado
Clasificación