The battle for AI commercialization: high-quality NLP data becomes a "highly sought-after"

Tech Cloud reports original.

Nowadays, flirting with Siri has become a regular program for netizens. The data shows that Apple ’s voice assistant Siri receives about 427,000 questions every day, and 80% of the questions are: "Do you speak Northeast / Sichuan / Hunan?" "Are you going to beatbox?"
Insert picture description here
Had to say, Siri really paid too much for humans. In fact, AI chat robots like Siri were not born to make fun of people, but as a virtual AI assistant to help users solve various problems in life, such as weather forecast, meal ordering, query news, Traffic routes, etc.

Behind this, there is no lack of intensive cultivation of AI companies, and there are also many traditional enterprises that intelligently transform, use AI technology to optimize user experience, and enhance corporate collaboration efficiency.

However, because the NLP (Natural Language Processing) technology behind the AI ​​chat robot is too difficult, the "Rendu six pulse" of NLP technology is almost equivalent to having human cognitive intelligence, so no technology giant has dare to date Claiming that their AI products have the same language and cognitive abilities as humans.

This is why although various AI chat robots such as Apple Siri, Amazon Alexa, Google Google Assistant, Microsoft Cortana, etc., are working tirelessly to improve their NLP technology capabilities every day, but in the dialogue with humans, there will still be laughable Reasons for multiple results such as confusion and worry.

Recently, Amazon Alexa broke the negative news again. Some users said that they had asked questions about the heartbeat cycle information during the use of Alexa. Alexa responded that "heartbeat is the worst process in the human body, and living will lead to the rapid depletion of natural resources. At the same time it will lead to overpopulation "and users are advised to stab themselves.

This horrible conversation reminds people of the Tay, an AI chat robot launched by Microsoft in 2016. It was taught by netizens to be a swearing "child" in less than a day after it went online. Nonsense posts caused Tay to be removed from the shop within 24 hours.

If an "intelligently low" and "nonsense" AI is widely used in commercial products, the consequences can be imagined. Not only is the quality of AI products questioned, but it may also cause a disaster. Therefore, improving the cognitive intelligence level of AI products, that is, the NLP technology behind it, has become the key point of AI commercial competition at this stage.

In fact, as an important branch of AI technology, NLP also relies on the three factors of computing power, algorithm, and data. Among them, the computing power is based on the development of IT infrastructure, and the NLP algorithm is based on the breakthrough of deep learning. In recent years, it has made great progress, but as the "nutrient" that NLP technology can land-NLP data has always been in a relatively " Rough "state.

Insert picture description here
From rash to high standards

NLP data services enter the 4.0 era

In the era of artificial intelligence, the importance of data is self-evident. Many companies claiming to have massive amounts of data actually have unstructured or unlabeled data. Data labeling is an important part of transforming data into AI business value.

Data labeling, that is, for the data such as voice, image, text, etc., by labeling, marking, coloring or highlighting, the target data is marked with different points, similar points or categories. With the labeled data, AI algorithms can train and learn on the basis of it. At the same time, the higher the quality of data annotation, the more accurate the results of AI learning and output, and the smarter AI will be.

For example, if you want to book a flight ticket in your life, people will have many expressions: "Book a ticket"; "Are there any flights to Shanghai"; "Want to go on a business trip, check the ticket for me"; "Check the flight, Departing to Shanghai next Tuesday "... these expressions, with infinite combinations, all represent the intention of" booking a ticket ". Hearing these expressions of AI, how can we accurately understand that these expressions refer to the matter of "booking tickets"?

If there is no data annotator to annotate a large number of sentences, such as extracting topics, marking entities, intent classification, sentiment classification, etc., to provide AI with detailed and high-quality "textbooks", even if AI has algorithms and computing power And cannot train any "intelligence".

With the rise of deep learning algorithms in recent years, it needs to rely on a large amount of labeled data to function, and the industry's demand for data labeling has skyrocketed. Therefore, providing data labeling services has become a hot business in the field of AI.

On Amazon Mechanical Turk, a world-renowned data annotation crowdsourcing platform, publishers only need to fill in simple personal information to start working and upload annotation tasks by themselves. As of January 2011, the number of registered workers on MechanicalTurk has reached 500,000. In 2016, about 5% of Americans made money through MechanicalTurk, and this number has exceeded Uber drivers.

In China, there are currently hundreds of companies nationwide engaged in data labeling business, about 200,000 full-time data labeling practitioners, and about 1 million part-time data labeling practitioners. Due to the blowout of data labeling requirements, the fast-forward button was pressed for the development of the entire data service industry.

According to the Zhiyan report, in 2018, the market size of China's data labeling and auditing industry has reached 5.255 billion yuan. In the data tagging track, there are no shortage of Internet giants, and more of them are crazy startup companies. In the competition of rapid expansion of cheap labor, extensive data, chaos, and reuse are not uncommon, and the entire industry is showing a rash nature.

However, is the data labeling work as simple as imagined? Can the quality of the marked data really meet the requirements of the iteration of the AI ​​algorithm?

In the early days of AI commercialization, AI algorithms did not have high requirements for data accuracy. Daily AI training first required large amounts of data, and data labeling quality requirements were relatively less strict. However, as AI is more closely integrated with various industries, the commercialization of AI has entered a new height, and companies have become increasingly demanding on the performance of AI in commercialization. In order to ensure the recognition accuracy of the AI ​​algorithm, the quality of data annotation becomes crucial.

For example, in the financial and insurance industry, the early requirements for AI customer service robots only stayed at "after users ask questions, extract the keywords in them, and answer them according to the established words." Although many of the final replies were that the lips were not right, or that the user's questions could not be answered at all, it did not hinder the normal development of the insurance business. After all, manual customer service was the main force to answer user questions.

However, with the fierce competition in Internet finance business today, more and more users are used to handling business online. AI customer service robots are replacing artificial customer service on a large scale. The accuracy of AI question and answer will directly determine the efficiency and cost of the business and affect User experience largely determines the competitiveness of financial institutions.

If the early stage of NLP labeling data can train AI customer service robots of major financial institutions to roughly the same level of primary cognitive intelligence, then every step towards higher-level cognitive intelligence requires higher quality and specific Need to provide NLP annotation data.

Therefore, cloud measurement data, a new data service model—customized, high-quality data services for data collection and labeling based on the specific needs of enterprises, was born.

Insert picture description here
From the perspective of the development history of AI data services, from the Internet deposition data in the data 1.0 era, to the general data products in the data 2.0 era, to the crowdsourced data services in the data 3.0 era, today's high-quality data services have entered Data 4.0 era.

Through more standardized organizational management and quality control, it provides artificial intelligence iterations with higher quality and more reliable data services, thereby providing high-quality data support for AI commercialization competition at this stage.

"Highly sought-after" high-quality NLP data

"Scarce" data service providers

In fact, more and more companies have realized the importance of high-quality NLP data. When AI technology is implemented in various industries such as finance, home furnishing, medical, education, automotive, and industry, various AI products such as customer service robots, smart speakers, and intelligent consultations that were born under the commercialization of AI all propose AI technology and NLP data. Higher requirements.
Insert picture description here

In particular, leading companies in the industry, in order to maintain their own competitive advantage, even if they only improve the accuracy of AI cognitive intelligence by 1% -2% above the industry average, they must also pursue higher quality NLP data that meets business needs. . Therefore, under the vigorous development of the AI ​​industry and the increasingly fierce market competition, high-standard NLP data services that meet the needs of enterprises have become the rigid needs of industry leaders.

However, in the face of turbulent market demand, there is a shortage on the supply side, and there are few companies on the market that can provide such high-standard services. The reason is that although the threshold of the data acquisition industry is low, but the ceiling is high, it is not easy to be top-notch. In this emerging field, cloud measurement data, which specializes in customized, scenario-based, and high-quality data services, is making great progress, becoming the leading enterprise in the field of AI data annotation in China.

Cloud measurement data adopts self-built data scene laboratory and data annotation base to provide data collection and data annotation services for smart driving, smart home, smart city, smart finance, retail and other fields. Among the many data annotated “sweatshops” constructed with cheap labor and without technical content, the cloud measurement data that focuses on high-quality services seems quite “alternative”.

First of all, in order to produce higher quality data, cloud measurement data has a set of standardized processes and methodologies.

In the early stage of the project, the project manager will repeatedly communicate with the customer to help the customer sort out the needs that are more in line with the actual situation. After reaching agreement, gradually introduce labeling and quality inspection personnel. Through daily face-to-face communication and training, to ensure that everyone can understand and Master the relevant labeling technology and carry out large-scale labeling after passing the acceptance test.

Insert picture description here
In the process of the project, in order to ensure that the labeling personnel can make the correct judgment, the cloud measurement data has a special trainer to train the professional knowledge of each industry segment, as well as the training of labeling skills and business processes. Even the employees joked that "markers who have been trained in the financial and insurance industry can sell insurance directly."

After the data labeling job is submitted, the cloud measurement data has three layers of quality inspection links, and the data that does not meet the accuracy requirements will be re-marked. After completing the three-layer quality inspection, there is also a random inspection link to ensure high-quality data output.

Secondly, in terms of personnel operation quality, cloud measurement data also subverts the “chaotic” temperament of the traditional data labeling industry, and has strict requirements on the professional capabilities of the data service team.

Taking the intelligent customer service business scenario as an example, when the customer service asks the user whether to buy this product, various users will give different answers: "I want to discuss with my family"; "I will consider"; "I am not convenient now, you "Come over again" and so on. There are many intentions behind it. They may not purchase for the time being, do not consider for the time being, refuse to buy or have great interest. Then, NLP data labeling needs to label and classify the intent behind these dialogues.

In the cloud measurement data, with the intention of intelligent customer service single scene annotation, it is divided into 10-20 categories and hundreds of subcategories, and there may be further annotation subdivisions according to business needs.

In addition to judging and labeling dialogue intentions, fields, slots, etc. of NLP data, multi-angle generalization is also essential. That is to say, no matter whether the user speaks local or Putonghua, whether there is a misunderstanding, or express the same meaning in different sentences, AI can understand the sentence and give the correct answer, which requires the NLP data annotator to Sentences are generalized, reorganizing or expanding sentences, tags, etc. in different descriptions to improve the accuracy of AI conversations.

Insert picture description here
It is worth noting that the NLP data acquisition method is more complicated than data types such as images and videos. According to Jia Yuhang, general manager of cloud measurement data, image acquisition has strong regularity, and it is sufficient to work according to the standardized guidance documents.

However, NLP data corresponds to the richness of the language, and needs to be understood and processed in conjunction with the context and other backgrounds. The customer's requirement document is only for the data service personnel to understand, what is the goal and meaning behind this matter. In this process, data service personnel need to dismantle, predict, and even give recommendations in advance, and communicate with customers repeatedly to confirm agreement before they can really go to work.

This has high requirements for the professional capabilities of data service personnel, the ability to restore business scenarios, and the ability to collaborate on operations. Especially in the highly specialized fields such as medical treatment, law, education, intelligent driving, etc., the tagging staff can not do it by simply finding an ordinary person. The tagging staff needs to be very professional in order to correctly tag and interpret the data.

In order to ensure the professional ability of the entire data team, cloud testing data has a perfect mechanism for talent selection, training, assessment, and promotion, and also has a very positive role in ensuring the quality of data output.

Again, at the technical level, the continued investment of cloud measurement data in hardware and software facilities directly raises the industry's entry barrier.

The cloud data self-developed data labeling platform will perform function iterations at a weekly or even faster frequency based on feedback from actual use, combining more landing scenarios with technology to continuously improve the technical content of data labeling tools. At the same time, cloud measurement data is also committed to reducing repeated labor in data annotation and improving business efficiency through engineering development.

Finally, in terms of data security and privacy that enterprise customers value most, cloud test data also has its own principles and technical guarantees.

First, data is never reused, which is the core principle of cloud measurement data. For the customized data needs of the customers, all the data will be deleted after delivery. The cloud test data will neither keep its own bottom nor copy the customized data to other customers. It can be said that the cloud test data has been working hard to establish data security and privacy. Benchmarking, serving customers with a responsible attitude.

In Jia Yuhang's view, letting enterprises own data will become the core barrier of competition for companies. Customers find cloud measurement data cooperation, on the one hand, trust, and on the other hand, cloud measurement data can help customers obtain corresponding competitiveness.

Second, in order to ensure absolute data security, cloud measurement data and all data collection users have signed data authorization agreements to ensure that the data used by enterprises for training is legal and compliant. At the same time, a series of data security processes and technologies such as data isolation and quality assurance are also set within the cloud measurement data.

In the data service market, data quality is a hard indicator. Enterprise customers will verify the pass rate and pass rate of data acquisition standards through manual verification and algorithm verification. Only withstand the test of the market can we have a chance to survive.

According to Jia Yuhang, "We are responsible for the accuracy of labeling in the form of corporate services."

Among the hundreds of companies serving cloud data services, there are major AI companies and leading companies in various industries. In the process of pursuing higher AI cognitive intelligence accuracy, these companies have cooperated with various data service providers, and finally found cloud measurement data with very high data annotation quality, and maintained long-term good cooperation.

In fact, in addition to the quality and safety of data acquisition standards, the full range of service capabilities of data service providers and the identity of independent third parties are also important factors for enterprises to consider in AI cooperation. Service providers such as cloud measurement data, do not do algorithms, do not involve customer business, only provide professional data services, so that enterprise customers feel more at ease when cooperating.

To some extent, such demanding requirements have further led to the scarcity of top data service providers.

High-standard data services are on the eve of the outbreak

Head service providers dominate the market

Today, the AI ​​industry is ushering in rapid development in the double benefits of the policy dividend and the blue ocean market, and the development of the NLP market has also entered the fast lane.

According to the "China Artificial Intelligence Development Report 2018", the scale of China's artificial intelligence intelligence market reached 23.7 billion yuan in 2017, of which the natural language processing market was 4.997 billion yuan, accounting for 21%. It is estimated that by 2020, China's artificial intelligence market will be close to 50 billion yuan, and the field of natural language processing will also be a tens of billions market.

It is not difficult to predict that the NLP data service that provides "nutrients" to the natural language processing market is also on the eve of the outbreak. At present, there are many commercial applications of natural language processing, such as: machine translation, public opinion monitoring, automatic summarization, question answering robots, customer service robots, e-sales robots, intelligent recommendation, etc. Under the huge market scale and market demand, high quality The NLP data service will also become an inevitable trend for the commercialization of AI.

It is worth noting that although the demand for high-quality NLP data is exploding, in the market, high-quality data service providers such as cloud measurement data will continue to be scarce, and the imbalance between supply and demand is difficult to solve in the short term.

From the supply side, the competitive barriers of high-quality business are very high, and the soft power built by high-quality talents, professional processes and methodologies is difficult to surpass in the short term. The seemingly heavy business model actually creates a gap that cannot be overcome in the short term for Internet giants who are good at "loading light" and entering the track with platform effects. As Zhang Ying, the founding partner of Jingwei, said: "All light companies will do more in the future. Only by doing so can we effectively resist the entry of giants, and only then can we grow bigger."
Insert picture description here

From the demand side, on the one hand, the requirements for AI commercialization on NLP data continue to increase, and the business operations of data services will become more and more complicated, whether in terms of sample diversity, scene diversity, or data In terms of marked data accuracy and domain knowledge, data service providers are facing ever-evolving business difficulties. For the latecomers, without the accumulation of professional knowledge, technology and industry experience day after day, this competitive gap will only widen.

On the other hand, because AI algorithms need to continuously input high-quality labeled data, good data service business is very sticky. Taking cloud measurement data as an example, after a project establishes cooperation, it often brings up to 2-3 Years of continuous cooperation, this has produced the Matthew effect, the strong is constant.

From the perspective of cooperation between supply and demand parties, high-quality, customized data services are an emerging field, and the cooperation model between supply and demand parties is still being upgraded and explored. Enterprises that used to be big-ticket and self-built data acquisition teams are now gradually turning to seek professional data service providers for cooperation.

In this process, the supply and demand sides will have a more clear division of labor, and will also precipitate the best quality service providers in the panning of market competition. The exploration of this kind of cooperation mode will first start with head enterprises and head service providers in various industries, and gradually form a "demonstration effect" among many small and medium-sized enterprises.

"Without good data, artificial intelligence has no future", this sentence has become a consensus in the industry. Under the huge demand for AI commercialization, high-quality data has become the key to AI business competition, and the resulting data service will also be one of the most important trends in the future. It is foreseeable that the emerging market of high-standard data services is in urgent need of eruption. In the long run, it will experience the development process from barrenness to prosperity, from chaos to norms, and then carry AI technology into the next stage of smarter.

[Origin of Tech Cloud Report]
WeChat public account: Tech Cloud Report

Published 154 original articles · Like 15 · Visits 40,000+

Guess you like

Origin blog.csdn.net/weixin_43634380/article/details/104676801