Microsoft and OpenAI use "data perpetual motion machine" Synthetic data is dawn or twilight?

Companies such as Microsoft, OpenAI, Cohere, and others have begun testing the use of synthetic data to train AI models . Aiden Gomez, CEO of Cohere, said that synthetic data can be applied to many training scenarios, but it has not been fully promoted yet.

  Existing (generic) data sources seem to be approaching the limit of performance, and developers believe that the general data available on the web is no longer sufficient to drive the performance of AI models . Gomez pointed out that the network is extremely noisy and chaotic. "It doesn't give you the data you really want. The network can't meet all our needs."

  At an event in May this year, OpenAI CEO Sam Altman was asked if he was concerned about regulators investigating ChatGPT for possible violations of user privacy. Altman was noncommittal, saying he was "very confident that soon all data will be synthetic . "

▌The price of real human data is high

  In order to greatly improve the performance of AI models and improve their level in science, medicine, business and other fields, AI models need "unique and complex" data sets. And this kind of data needs to come from "experts" such as scientists, doctors, writers, actors, engineers, etc., or it needs to obtain professional data from large enterprises such as pharmaceutical companies, banks, and retailers.

This brings up another reason for AI companies to turn to synthetic data-data is too expensive.

  Not to mention those pharmaceutical and scientific data with extremely high technical content, just the asking price for data collection given by Reddit and Twitter before was "disliked" by Gomez for being too high.

  In this case, synthetic data has naturally become a cost-effective solution, which can not only avoid the high price of these data, but also generate some more complex data to train AI.

▌How to train with synthetic data?

  How to train AI large models with synthetic data? Gomez gave an example:

When training an advanced math model, Cohere might use two AI models for a conversation, with one acting as a math teacher and the other as a student. Afterwards, the two models will talk about mathematical problems such as trigonometric functions, "In fact, everything is 'imagined' by the model."

If the model says something wrong during this process, the human will correct it while reviewing the conversation.

Two recent studies by Microsoft Research also show that synthetic data can be used to train AI models  , which are generally smaller and simpler than OpenAI's GPT-4 and Google's PaLM-2.

  In one of the papers, GPT-4 generated a synthetic dataset of short stories called "TinyStories," using words so simple that a four-year-old could understand them. This dataset was used to train a simple large language model that generates fluent and grammatically correct stories.

▌Dawn or Twilight ? _

  There are customers who want to synthesize data, and suppliers have naturally sprung up, such as startups such as Scale AI and Gretel.ai. Founded by former intelligence analysts from the NSA and CIA, Gretel.ai has partnered with Google, HSBC, Riot Games, Illumina and other companies to augment existing data with synthetic data and help train artificial intelligence models.

Ali Golshan, CEO of Gretel.ai, said that the key to synthetic data is that it can protect the privacy of all individuals in the data set while maintaining the statistical integrity of the data .

At the same time, synthetic data can also remove bias and imbalance in existing data .

However, some people are not optimistic about synthetic data.

  Opponents argue that not all synthetic data is well-tuned and mirrors or improves upon the real world.

  Researchers from Oxford, Cambridge, Imperial College of Technology and other institutions have found that the negative impact of synthetic data is even comparable to "poison". If a large amount of AI content is used during training, it will cause model collapse and cause irreversible defects .

  The training data for a new generation of models can be polluted by the data generated by the previous generation of models, leading to a misinterpretation of real-world perception. Over time, the model forgets parts of the real underlying data. Even in an almost ideal state of long-term learning, this situation cannot be avoided - what the researchers describe as "the AI ​​model suffers from 'dementia'".

 

  Even Golshan, a synthetic data practitioner, concedes that training on poor-quality synthetic data can hinder progress.

“More and more content online is generated by AI. Over time, this can really lead to degradation, because the knowledge generated by these large models is duplicated without any new insights.

Guess you like

Origin blog.csdn.net/xyk2000114/article/details/131872732