Microsoft research team subverts AI training: synthetic data leads to a new era

The latest achievement of the Microsoft research team: They have begun to use [synthetic data] to train AI models. In the future, model training will have no copyright and training data concerns.

Paper: https://arxiv.org/abs/2401.00368
PDF: https://arxiv.org/pdf/2401.00368.pdf
More news: AI artificial intelligence industry trends, aigc application field information

The latest research results of the Microsoft research team show that they have successfully used synthetic data for AI model training, becoming one of the leaders in promoting changes in the field of artificial intelligence. By utilizing large language models (LLM), such as GPT-4, they generated "simulated" text data for hundreds of thousands of text embedding tasks in nearly 100 languages ​​to train AI models. This innovative method significantly reduces training costs, improves efficiency, and successfully reduces model bias.

Traditionally, in order for computers to understand and process human language, large amounts of real training data are essential. However, Microsoft 's new method introduces the concept of "synthetic data", which no longer relies on real data by guiding the language model to generate simulated text related to various tasks. This innovative process includes using large language models to generate task definitions and prompts, generating synthetic data to ensure diversity and coverage, and undergoing data cleaning and formatting. This gives the model significant advantages in terms of wide coverage, reduced bias, flexibility and scalability, cost efficiency, rapid iteration and improvement.

Experimental results show that the Microsoft research team successfully generated approximately 500,000 synthetic data examples, containing 150,000 unique instructions, covering 93 different languages. On the multi-lingual MIRACL data set, the model trained using synthetic data performed well, verifying the actual effect of this method in multi-language and multi-task scenarios. The successful application of this innovative method brings new possibilities to the field of AI, and also highlights the important role of synthetic data in promoting the development of artificial intelligence technology.

Guess you like

Origin blog.csdn.net/heehelcom/article/details/135414391