Chinese County Workers Training Artificial Intelligence

 Datawhale dry goods 

Latest: The status quo of the AI ​​industry, source: Blue Word Project

Author | Lin Shi

Source | Blue Word Project (NPO2020)

Full text|About 4300

The whole world is talking about the disruptive technological revolution that ChatGPT will bring, but Li Jie, an artificial intelligence trainer, is not at all excited.

In order to complete the piecework with a unit price of 4 cents, Li Jie and dozens of people sat in a room that looked like the first generation of Internet cafes, and swiped and pulled the mouse thousands of times on the computer every day.

His job is to prepare "feed" for training the artificial intelligence model, and mark a large amount of text, voice, and images-"eyeballs", "Sichuan dialect", and "green belt". Only the marked data can be recognized by the artificial intelligence model and its discrimination ability can be trained.

What Li Jie does most is labeling road pictures, that is, marking the names, colors and other detailed information of objects on road pictures, which is commonly known as "drawing frames" in the industry.

When the efficiency is high, he can pull 2,000-3,000 frames a day. Calculated at 4 cents per frame, he can earn about 3,000 yuan a month. For young people who graduated from vocational schools and live in northwest counties, this income is not bad.

999f025832e5ad4b9248ed0ed651a95e.jpeg

A data labeling factory

The same scene also appeared in Kenya, Africa. More than 30 workers in Nairobi, the capital of the country, have become ChatGPT data labelers. They work 9 hours a day, read 150-200 paragraphs of text, and mark the content that contains sex, violence and hate speech. Some people have nightmares for a week because of the amount of high-impact text they read every day.

These workers can earn an after-tax income of US$1.32 per hour. If the established tasks are completed, the hourly wage can rise to US$1.44, and there is a bonus of about US$70, which is equivalent to earning 2,500-3,000 yuan a month, which is higher than the local average. Blue collar jobs are stronger.

When artificial intelligence products are rolling up huge waves, from Kenya, Uganda to India, China, there are still a group of invisible "artificial intelligence trainers" underwater, in a simple working environment, with the simplest skills, Connected with cutting-edge technology.

a740b2484dfc0cb8b3b41105c5afaeed.pngServing artificial intelligence

Li Jie's understanding of artificial intelligence is an intelligent voice assistant on mobile phones, "just like Apple's Siri".

He studied e-commerce in a vocational school, and most of his classmates went to work as customer service in e-commerce companies. He often heard complaints from his classmates about their work. In contrast, the work of data labeling is boring but also pure. He only needs to complete the task step by step .

6338f6461e2e34ca581da9d6baa2a67d.gif

To "draw the frame" for the car, one picture needs to repeat several similar operations

In the 2021 edition of the "National Occupational Skills Standards for Artificial Intelligence Trainers", the description of the ability characteristics of this profession is "with certain learning ability, expression ability, calculation ability; normal sense of space and color vision", written by the general education level It is "graduation from junior high school". In other words, this is a profession with almost zero threshold.

Guo Mei, who is over 50 years old, used to work in a local coal mine in Shanxi. After leaving the coal mine, she couldn't find a job for a long time, and finally became an employee in the data labeling base, drawing more than 2,000 boxes every day. "I never thought that I would have something to do with driverless cars and artificial intelligence."

b0de5df9aaed416a5926a609141d7a5d.jpeg

The data labeler who is drawing the frame for the car

In addition to "drawing the frame", Li Jie will also receive voice annotation projects, which are usually the voices of different regions and different groups of people collected by Party A. Li Jie must wear headphones to carefully identify the meaning of each voice.

At the end of the day, he has to listen to the speeches of hundreds of strangers in different situations. It may be a middle-aged man asking loudly on the road accompanied by the sound of traffic and horns, or it may be an aunt speaking Cantonese Mandarin speaking into a microphone. Instructions, and sometimes, he even heard swear words.

These voices were transcribed one by one by Li Jie into accurate text, and sometimes they need to be tagged with more subdivided labels such as the gender and emotion of the speaker. Finally, the artificial intelligence model is taught to understand human language, which is used in smart customer service, smart speakers, and map navigation. and other products.

The three cornerstones of artificial intelligence are data, computing power, and algorithms. The more data with higher quality, the more "smart" models can be trained.

The mainstream direction of artificial intelligence is deep learning. In the past, people told the machine what characteristics a cat has, and the machine judged whether an object is a cat based on these characteristics; deep learning is to "feed" a large number of pictures of different cats, and the machine can summarize the characteristics of the cat by itself . This requires a large number of manually labeled pictures. As the saying goes, as much intelligence as you have, you have to pay as much labor.

There has been a myth in the field of data annotation - the ImageNet project. The project's database holds more than 14 million annotated images, identifying more than 20,000 types of objects -- including 120 different breeds of dog.

f3a9db4f9ea4b4ce4863bc4a9c0ea97e.jpeg

There are more than 14 million labeled images in the ImageNet image collection, of which more than 1 million have borders

The project originated from Fei-Fei Li, an artificial intelligence expert at Stanford University. In 2009, the general research direction in the industry was on models and algorithms, so she found another way to improve data quality. Today, ImageNet is the world's largest image recognition database and is used in thousands of artificial intelligence research projects and experiments.

Behind the ImageNet project, there are 50,000 data labelers from 167 countries. It took them three years to complete the labeling of all the pictures.

Li Jie is a veteran of image annotation, and there are usually hundreds of photos taken on the road in the data package sent to him. Li Jie needs to follow the requirements of the project party for vehicles, pedestrians, green belts, etc. on the road. Object labeling. Another common labeling task is to label the lane lines of the road.

This kind of data labeling requires a lot of requirements. "The frame cannot exceed or be less than, let alone miss points. If an error occurs and the acceptance fails, it must be pulled again." The largest flow of these data is machine learning for autonomous driving. To ensure driving safety, it is usually necessary to provide millions of labeled data to train artificial intelligence-behind the scenes are countless clicks and keyboards in front of the computer. Li Jies.

6e7a9bfd22da94339b4183c68d4d64b1.pngInternet Foxconn

Guiyang, the city of big data.

In the digital town of Bainiaohe, Huishui County, about 50 kilometers away from the center of Guiyang, there is a company called Mengdong Technology with more than 500 data labelers—half of them are students of the nearby Shenghua Vocational College.

Zheng Chengan, a third-year student, is an intern at Mengdong Technology. There are only a dozen full-time employees in the company, and the management team is also a teacher in the school. "Class is work, and the teacher is the manager . "

08baa669710921f3825dc4f93a53656a.jpeg

Shenghua Vocational College located in Bainiaohe Digital Town

He loves this job very much, and data labeling gives him another choice in life. He didn't even touch a computer before he went to a high-level job, but now he can earn more than 1,500 yuan a month with a part-time job in front of a computer.

Huishui County, where Zheng Chengan is located, ranks in the middle of the 88 counties in Guiyang. The GDP in 2020 is 13.916 billion yuan, and the per capita disposable income of rural permanent residents is 12,924 yuan—equivalent to just over 1,000 yuan a month.

Sometimes in order to earn more living expenses, Zheng Chengan will take the initiative to work overtime when encountering urgent projects. He clearly knows that it is difficult to continue the job of annotators, so he secretly set a goal to become the person who manages the annotators.

There is more than one city in China like Guiyang.

The birth of the data labeling industry can be traced back to 2005. At that time, Zhu Songchun, a famous computer vision expert and artificial intelligence expert, returned from the United States to his hometown of Ezhou, Hubei, and founded the Lianhuashan Research Institute, which was said to be the earliest big data labeling team in the world at that time.

After deep learning has become the mainstream of artificial intelligence, the growing Internet big data has become the best nutrient for artificial intelligence.

According to statistics from the data company IDC, the amount of data produced globally every year will soar from 16.1ZB in 2016 to 163ZB in 2025, of which 80%-90% are raw data. After these are cleaned and labeled, they become data in a standardized format, which can be understood by artificial intelligence.

As a labor-intensive industry, data labeling companies are more likely to land in third- and fourth-tier cities. Local governments can hit it off with Internet companies, whether it is for poverty alleviation or to take a ride on the Internet.

In 2018, the Shanxi Transformation and Comprehensive Reform Demonstration Zone in Taiyuan reached a cooperation with Baidu to create what is known as "the single data labeling base with the largest personnel and output value in the country". The base covers an area of ​​over 10,000 square meters and has introduced There are at least 35 data labeling companies and more than 2,000 data labelers.

b0b21f3ab53b9d4a7fc96e4689177180.png

Baidu Shanxi Data Labeling Base

In Hotan, Xinjiang, there are 4,000 people engaged in data labeling work in the local digital economy industrial park, and the Hotan area has even thrown out the goal of "the capital of the data labeling industry" and a data labeling employment base of 100,000 people.

In Henan, hundreds of data labeling companies have grown from scratch; in Jinan, the first data labeling base in Shandong has accommodated 1,500 "artificial intelligence trainers"; Datatang, which is listed on the New Third Board, is also in Baoding and Hefei , established bases for hundreds of data labelers to work at the same time.

The labels on the data labelers are "Internet migrant workers" and "cyber assembly line". For the vast majority of people in it, an Internet version of Foxconn is already a rare choice at present.

9c2bd3adbe643252f4fc1c0e98f7a27e.png"Church the apprentice, starve the master"

When data labeling becomes a "outlet", gold diggers will follow.

In 2017, Zhou Hua accidentally learned from a friend that data labeling could make money. He had just failed in starting a business, so he decided to take another gamble.

He calculated that the output value of a data labeler can reach 7,000 yuan a month, and after deducting the 3,000 yuan salary, quality inspection, site equipment and other expenses, he can earn 1,500 yuan. "If you recruit 100 people, you can earn 150,000 yuan a month."

He found a partner, purchased computers, determined the venue, and quickly recruited a group of data labelers who did not require academic qualifications or work experience, and took orders in full swing.

62c16111595f90335efcb57288cb44ed.jpeg

Data labeler at work

At this time, the data labeling industry is catching up with the wave of artificial intelligence entrepreneurship. According to the statistics of Qianzhan Industry Research Institute, data labeling companies have been increasing since 2014 and reached a peak in 2017. There were 9 financing events related to data labeling in that year. By April 2021, 18 companies have obtained financing. There were 39 financing incidents.

There are three different types of companies in the data labeling industry. One is the internal data labeling department of a large Internet company, which processes data within the company; the other is a data labeling company with its own base, such as Datatang, which has the ability to independently undertake orders. , or even outsourced to third parties; the largest number are small companies that exist in the form of studios, and they usually can only accept orders on crowdsourcing platforms, or orders that are subcontracted layer by layer from third-party intermediary companies—— On the platform, they may be called "guilds" or "teams".

Zhou Hua's studio belongs to the last category. At that time, it mainly relied on the platform orders of Baidu Zhongce. The platform will distribute various tasks, which are called "free questions" in the industry, including data collection, image annotation, text annotation, etc. According to data from Baidu Crowdtest, there are 25 million registered users on the platform.

But not all the orders on Baidu public test are in Zhou Hua's hands. Sometimes he has to take the initiative to undertake some second-hand or even third-hand orders, and those companies that have access to channels can earn the difference.

Also hit the wind like him, and Stardust Data, which was still a startup company at the time.

Zhang Lei, the founder of Stardust Data, has worked in Wall Street and Silicon Valley for 10 years, and worked as a senior data scientist at the investment platform CircleUp. When he returned to China in 2017, he originally wanted to continue to start a business in the investment field, and tried to build an investment research robot to assist investors in decision-making by studying a large number of company annual reports, prospectuses and other financial documents. At that time, domestic data labeling often could only meet customer needs mechanized. This "novel" data labeling requirement was difficult to achieve in the industry. Zhang Lei saw an opportunity.

Stardust Data, which he founded, is known as a data labeling solution tailored for customers. This company located in Sanlitun, Beijing, completed the Pre-A round of financing of RMB 10 million as early as January 2018, and the latest round of financing of RMB 50 million in August last year. Labeling platform" business - they will bid for data labeling orders given by large companies, and then subcontract to some small "data factories". Zhou Hua is one of their partners.

Haitian AAC, which was established in 2005, has become even more profitable in this wave of generative artificial intelligence. This company, which is famous in the industry for voice data labeling, was successfully listed on the Science and Technology Innovation Board in 21 years. Since January this year, its stock price has soared from about 60 yuan per share to more than 200 yuan per share.

397380a36b0fd1d370f1634b31fc87e0.jpeg

Haitian AAC started as the earliest voice tagging project

After all, for many large domestic companies that research and develop artificial intelligence, basic data labeling is just needed, but it is impossible to do it by yourself forever. So as long as there is an order, whether it is a studio like Zhou Hua, or a big company like Haitian Ruisheng and Stardust Data, they can make a lot of money. Not all entrants can have the luck of Zhou Hua. Zhou Hua knows many peers, because of lack of orders and long settlement cycle, the company left early.

Of course, with the successive appearances of GPT-4 and Wenxin Yiyan, artificial intelligence is being "upgraded", and the data labeling industry is also accompanied by new changes.

Artificial intelligence researchers have begun to try to "feed" unlabeled data and partially labeled data to the machine, that is, "semi-supervised learning". Self-supervised learning and data labeling that do not rely on manual labeling have also begun to be practiced in the industry. .

At the end of June last year, at the Tesla office in San Mateo County, California, several Tesla employees were told in a meeting that they had been laid off. Most of the 200 people who ended up being laid off were data labelers. The computer Dojo currently being developed by Tesla uses self-supervised learning technology to train artificial intelligence models, and the demand for data labeling is getting lower and lower.

724e11b41c186a40a7972593373c573b.jpeg

Data Labelers in Africa

Tencent, Ali, ByteDance and many other major companies are also developing self-supervised learning algorithms, and even some data labeling companies have 60% of their content comes from automatic machine labeling.

Li Jie has heard a saying that the data labeler is the "teacher of artificial intelligence". It is he and his colleagues who pull the box day after day and teach artificial intelligence to understand the human world.

But he never thought that when the era of artificial intelligence really comes, it will be his former students who will replace them.

2f6599f237478e446f356175e38dfc9e.png

" Watching " together _

Guess you like

Origin blog.csdn.net/Datawhale/article/details/130143383