Science and technology cloud report: AI big model has finally come to the battle for data

Technology cloud report original.

At present, large models are in the early stage of industrialization, and high-quality data is a key element for the industrialization of large models.

Recently, a study from the Epoch AI Research team revealed a cruel fact: the model will continue to grow, but the data is not enough.
insert image description here

The researchers predicted the total amount of image and language data available between 2022 and 2100, and based on this, estimated the growth trend of the size of large model training data sets in the future.

The results show that the stock of high-quality language data will be exhausted in 2026, and the stock of low-quality language data and image data will be exhausted from 2030 to 2050 and from 2030 to 2060, respectively.

This means that if there is no significant improvement in data efficiency or new data sources become available, the growth in model size will slow down until 2040.

It is time to pay attention to the construction of the data terminal.

High-quality data has become a "hot commodity"

With the advent of a new round of AI boom in the world, a large amount of training data has become the "fuel" for the development and evolution of AI algorithm models.

From the GPT experiment, it is found that with the increase of the number of model parameters, the performance of the model has been improved to varying degrees.

But it is worth noting that the InstructGPT model generated by reinforcement learning from human feedback (RLHF) is better than the unsupervised GPT-3 model with a parameter scale of 100 times, which also shows that supervised labeling data is a successful application of large models. one of the keys.

If the above predictions are correct, then there is no doubt that data will become the main constraint to continue to expand the model, and the progress of AI will also slow down as the amount of data is exhausted.

Dr. Zhao Deli, head of the basic vision team of Alibaba Dharma Academy, once said in an interview that the construction of the data side will become a problem that every organization that does large-scale model work must consider. How much capability a large model has often depends on what kind of data.

According to Dr. Deli Zhao, it is much more difficult to make a Vincent video model than a Vincent graph model, because the amount of video data is far less than text and images, not to mention the quality of the data. Correspondingly, the effects of the existing Vincent video models are not satisfactory.

Combined with the above research results, if the current trend continues, the existing human data inventory will be exhausted, and there will be less high-quality data.

Because of this, a data scramble is kicking off.

Currently, Adobe uses its database of hundreds of millions of stock photos to build its own suite of artificial intelligence tools called Firefly. Since its release in March, Firefly has been used to create more than 1 billion images, and Adobe stock has risen 36 percent as a result.

Some startups are also flocking to this new field. In April this year, Weaviate, a database company focused on artificial intelligence, raised $50 million at a valuation of $200 million.

Just a week later, rival PineCone raised $100 million at a $750 million valuation.

Earlier this month, another database startup, Neon, also raised $46 million.

In China, Baidu Smart Cloud has also recently upgraded its large model data service capabilities, and built the first professional large model data labeling base in China. Baidu Smart Cloud said that it has cooperated with local governments across the country to build more than 10 data labeling bases.

Clearly, the scramble for data has only just begun.

Data annotation ushered in another explosion

AI large models have brought a lot of demand, accompanied by the rapid development of China's data labeling industry.

China Merchants Securities believes that, on the one hand, after entering the era of big data, the digitization and networking of people's various behaviors brings massive data, but only 1% of the generated data can be collected and stored, and 90% of the collected data is unstructured. On the other hand, the rise of artificial intelligence has brought about a huge demand for structured data used in model training, and the importance of data annotation has gradually become prominent.

Some people in the industry believe that it is expected that in October this year, there will be a wave of data demand for chatGPT-like large models in China, and this is a massive demand. Judging from the current domestic head data labeling companies, the current production capacity is not enough to meet demand.

Data from iResearch shows that the AI ​​basic data service market, including modules such as data collection, data processing (labeling), data storage, and data mining, will continue to grow in the next few years.

By 2025, the overall scale of the domestic AI basic data service market is expected to reach 10.11 billion yuan, and the overall market growth rate will reach 31.8% (2024-2025).

According to iResearch data, my country's data labeling market size in 2019 was 3.09 billion yuan, and it is expected that the market size will exceed 10 billion yuan in 2025, with a compound annual growth rate of 14.6%.

With the continuous growth of data volume and the continuous change of data structure, the fields involved in the data labeling industry are becoming more and more extensive, especially in the fields of autonomous driving and AIGC, where the demand for data labeling is huge.

As the basis for high-quality answers from AI large language models, the production process of data annotation mainly includes four links: design (structural design of training data set), collection (obtaining raw material data), processing (data labeling) and quality inspection (data in each link) quality, processing quality inspection).

Among them, data annotation needs to identify raw data such as images, texts, and videos, and add one or more tags to specify the context for the machine learning model to help it make accurate predictions.

At present, most data labeling tasks still need to be completed manually, and various data types and application fields require professional labelers in the corresponding fields to complete the labeling tasks.

With the development of technology, the data standard industry is becoming a semi-artificial intelligence and semi-artificial industry.

Faced with the data quality control of tens of billions of parameters in a large language model, it is necessary to split each complex RLHF requirement into many simple workflows through the labeling platform, let the machine do the preprocessing, and the human do the in-depth understanding-based feedback , to reduce human energy consumption on simple issues and focus on labeling on professional issues.

The industry generally adopts the method of active quality inspection and passive quality inspection. The former relies on humans to do quality inspection, and the latter relies on algorithms to do some pre-identification.

However, the accuracy rate of current data labeling tools is only a few percent, and some accuracy rates can reach 80% or 90%. The higher the recognition rate of machine labeling, the less labor requirements will be, and the cost, profit, speed, and quality can be more controllable.

With the continuous development of technology, the data labeling industry may achieve a higher degree of automation in the future, but different application fields still require a certain number of labeling personnel to perform labeling tasks.

Traditional data labeling urgently needs to be upgraded

It is worth noting that in today's hot wave of large-scale model training, the demand for traditional data labeling is likely to decline.

The key to making ChatGPT more "human" - strong artificial feedback RLHF, brings another higher requirement for data labeling.

Correlation analysis shows that in the RLHF link, the model is first pre-trained on a large data set, and then interacts with professional artificial intelligence trainers. Professional labelers will label, evaluate and give feedback on the answers generated by ChatGPT, and give a A score or label for the answer.

These labeled data can be used as a "reward function" in the reinforcement learning process to guide the parameter adjustment of ChatGPT, and ultimately help the model to perform reinforcement learning and continuous optimization.

In other words, the subtlety of making ChatGPT "more human" is likely to be that it can use the feedback results of manual annotation to continuously optimize its own model to achieve a more logical expression of human thinking.

However, the traditional data labeling mode is difficult to meet the needs of RLHF.

In the past, the mainstream business model of data labeling companies focused on selling tool systems and labeling services. On the one hand, there are few services that sell accurate datasets due to the lack of own data. On the other hand, as a systematic project, talent upgrading puts a higher test on data labeling companies.

After completing this step, RLHF training also involves a lot of fact judgments and value judgments. Among them, value judgment involves the recognized "public order and good customs", which is theoretically easier to align AI cognition, while fact judgment involves Know-How in various industries.

This often requires the help of industry professionals, not traditional data taggers, who can simply tag parts of speech and image details.

In other words, to keep up with the wave of the new generation of AI, data labeling companies not only need to upgrade at the data level, but also the replacement of talents is equally important.

At present, some labeling companies have indeed begun to write the "Personnel Improvement Course" internally. They will focus on training labelers to understand the "upgraded" labeling requirements and the compliance of answering methods.

However, in fields such as medical care where professional barriers are very high, data annotation still faces a talent dilemma.

The person in charge of the operation of a certain data labeling company once said, "Especially in medical treatment, some ordinary people can mark after training, and some must require medical practitioners. The difficulty of recruiting talents behind this can be imagined."

But even if there are many difficulties, it does not mean that data labeling companies will immediately reshuffle - at least, in the several stages of large model training, the semi-supervised learning in the initial stage also requires traditional data labeling.

Facing the opportunities of large models and RLHF, it seems inevitable to reproduce large-scale investment.

Some people in the industry believe that if data labeling companies expect to provide higher-level data services in vertical fields, they may need to establish a new product line. Even, founders with an AI research and development background would be more suitable data labeling entrepreneurs.

In the face of the new generation of AI wave, no one can lie down and make money - this is the "price" secretly marked behind the impact of every technological iteration.

[About Science and Technology Cloud Report]

Focus on original enterprise-level content experts - technology cloud reports. Founded in 2015, it is the top 10 media in the cutting-edge enterprise IT field. Recognized by the Ministry of Industry and Information Technology, Trusted Cloud, one of the official media designated by the Global Cloud Computing Conference. In-depth original reports on cloud computing, big data, artificial intelligence, blockchain and other fields.

Guess you like

Origin blog.csdn.net/weixin_43634380/article/details/132667772