Hard knowledge that AI product managers must understand (1): Application areas

The article mainly focuses on the current status of AI applications in various popular fields, including four aspects: computer vision, voice interaction, natural language processing and typical AI scenarios, to share with you.

Hello, everyone, I am Ark. Next, I will publish a hard-core knowledge series, a total of three "hard knowledge that AI product managers must understand", from the three aspects of application field, common concepts and algorithms, and self-advanced. This series can be regarded as draining my many notes. In the first article, let’s talk about the current status of each mainstream application field. Some readers have responded that my article is too "dry" and too long. I have to read it several times and make an outline.

1. Computer Vision (CV)

2. Voice interaction

(1) Speech recognition (ASR)

(2) Speech synthesis (TTS)

3. Natural language processing (NLP)

Fourth, typical AI scenarios

(1) Intelligent robot

(2) Unmanned driving

(3) Face recognition (non-mobile terminal)

(4) Visual design (mobile terminal)

(5) Automatic text editing

1. Computer Vision (CV)

Computer vision is a science that studies how to make machines "see". It refers to the use of cameras and computers instead of human eyes to identify, track, and measure machine vision applications. It is a kind of biological vision that uses computers and related equipment. Simulation, processing the collected pictures or videos to obtain the three-dimensional information of the corresponding scene, so that the computer has the ability to sense, abstract, and judge the space objects in the surrounding world.

The application value of computer vision in real scenes is mainly reflected in the ability of computers to recognize images and videos to replace part of the manual work, save labor costs and improve work efficiency. Traditional computer vision basically follows the process of image preprocessing, feature extraction, modeling, and output. However, with deep learning, many problems can be directly used end-to-end, from input to output in one go.

1. Research content

  1. The quality of images collected in practical applications is usually not as ideal as laboratory data. Unsatisfactory lighting conditions and blurred images are common problems in practical applications. Therefore, it is first necessary to correct the photometric and geometric distortion introduced by the system during the imaging process, and suppress and remove the noise introduced during the imaging process. These are collectively referred to as image restoration.
  2. Preprocess the input original image. This process uses a large number of image processing techniques and algorithms, such as: image filtering, image enhancement, edge detection, etc., in order to extract corners, edges, lines, borders and colors from the image And so on the basic characteristics of the scene; this process also includes various image transformations (such as: correction), image texture detection, image motion detection, etc.
  3. According to the extracted feature information, the various image primitives reflecting the three-dimensional object, such as contours, lines, textures, edges, boundaries, and various surfaces of the object, are separated from the image, and the extension between the various primitives is established. The relationship between simplicity and geometry-called the division of primitives and the determination of the relationship.
  4. According to the pre-knowledge model stored in the database in advance, the computer recognizes certain entities in the objective world represented by each primitive or some combination of primitives-called model matching, and according to the primitives in the image Under the guidance of pre-knowledge, the meaning of the actual scene represented by the image can be obtained, and the interpretation or description of the image can be obtained.

2. Bottlenecks

  1. At present, the data collected in practical applications is still not ideal. Lighting conditions, object surface gloss, camera and spatial position changes will affect the data quality. Although algorithms can be used to compensate, in many cases the lack of information cannot be solved by algorithms.
  2. It is not easy to extract depth information or surface tilt information from one or more planar images, especially in the case of grayscale distortion, geometric distortion, and interference, to find the corresponding features between multiple images. One difficulty. In addition to obtaining three-dimensional information of objects, in the real world, objects are occluded from each other, and the occlusion between various parts of themselves makes image splitting more complicated.
  3. The different pre-knowledge settings also make the same image produce different recognition results. Pre-knowledge plays a very important role in the vision system. The pre-knowledge base stores knowledge models of various objects that may actually be encountered, and the constraint relations between various objects in the actual scene. The function of the computer is to use pre-knowledge as a guide based on the primitives and their relationships in the analyzed image, and finally obtain a description of the image through means such as matching, searching and reasoning. In the whole process, pre-knowledge provides processing templates and evidence at all times, and the processing results of each step are compared with pre-knowledge at any time, so the pre-knowledge setting will have a great impact on the result of image recognition.

Since the author himself is specialized in products in the direction of AI CV, there will be a lot of knowledge about CV and actual CV projects in future articles. In the following articles, I will discuss visual recognition, especially the star application face recognition in visual recognition. Among them, many of the AI ​​product implementation details involved in face recognition are dismantled, from imaging, preprocessing, computing power estimation to detection, multi-target, tracking, segmentation, recognition, and algorithm accuracy testing modules. If you understand it, then Extending this system to other visual projects such as vehicles and animals, the basic principles are similar.

2. Voice interaction

Voice interaction is also one of the most popular directions. In fact, the entire process of voice interaction includes speech recognition, natural language processing, and speech synthesis. Natural language processing is often studied as a separate field, so I will not expand it here for the time being. This article will also introduce natural language processing separately, so only speech recognition and speech synthesis are introduced here.

The best application scenario for voice interaction is when the eyes are inconvenient to see, or the hands are inconvenient to operate. The typical scene of "inconvenient to see" is the smart car, and the typical scene of "inconvenient to operate" is the smart speaker, which is also the two popular subdivision directions.

A complete voice interaction basically follows the process shown in the figure below:

Classic voice interaction use cases

1. Speech Recognition (ASR)

(1) Research content

The input of speech recognition is sound, which is an analog signal that cannot be directly processed by a computer. Therefore, it is necessary to convert the sound into text information that the computer can process. The traditional recognition method needs to transform it into a digital signal through coding, and extract the features from it for processing.

Traditional acoustic models generally use Hidden Markov Models (HMM), and the processing flow is speech input-encoding (feature extraction)-decoding-output.

There is also an "end-to-end" recognition method, which generally uses deep neural networks (DNN). The input of this method of acoustic model can usually use more primitive signal features (reducing the work of the encoding stage), and the output is no longer It must pass through the underlying elements such as phonemes, which can be directly letters or Chinese characters.

When computing resources and model training data are sufficient, the "end-to-end" approach can often achieve better results. The current speech recognition technology is mainly realized through DNN. The effect of speech recognition is generally measured by the "recognition rate", which is the ratio of the number of words matching the recognized text with the standard text to the total number of standard texts. At present, the recognition rate of Chinese general speech continuous recognition can reach up to 97%.

(2) Derivative research content

  • Microphone array: In various environments such as homes, conference rooms, outdoors, shopping malls, etc., voice recognition will have various problems such as noise, reverberation, vocal interference, and echo. In the context of this demand, a microphone array can be used to solve the problem. The microphone array is composed of a certain number of acoustic sensors (usually microphones). It is a system used to sample and process the spatial characteristics of the sound field. It can realize speech enhancement, sound source localization, de-reverberation, and sound source signal extraction/separation. The microphone array is divided into: 2 microphone array, 4 microphone array, 6 microphone array, 6+1 microphone array. With the increase in the number of microphones, the distance of the sound pickup, noise suppression, the angle of sound source positioning, and the price will be different, so it is necessary to fit the actual application scenario to find the best solution.
  • Far-field speech recognition: To solve the problem of far-field speech recognition, it needs to be completed together with the front and back ends. The front-end uses microphone array hardware to solve the problems caused by noise, reverberation, echo, etc., while the back-end uses the different acoustic laws of the near-field and far-field to construct an acoustic model suitable for the far-field environment. The front-end and the back-end jointly solve the problem of far-field recognition.
  • Voice wake-up: wake up the voice device through keywords, usually keywords with more than 3 syllables. For example: Hey Siri, and Amazon echo's Alexa. Voice wake-up is basically performed locally, and must be run on the device terminal, and cannot be cut into the cloud platform. Because a 7×24-hour monitoring device needs to protect user privacy, it can only do local processing, and cannot connect audio streams to the Internet for cloud processing. Voice wake-up has requirements for wake-up response time, power consumption, and wake-up effect.
  • Voice activation detection: Determine whether there is valid voice in the outside world, which is especially important in the far field with low signal-to-noise ratio.

2. Speech synthesis (TTS)

(1) Research content

It is the process of converting text into voice (read it aloud). There are currently two implementation methods: splicing method and parameter method.

  • The splicing method is to chop up a large number of recorded voices into basic units and store them, and then select and splice them as needed. The output voice quality of this method is high, but the database requirements are too large.
  • The parametric method extracts parameters through speech and converts them into waveforms to output speech. The database requirement of this method is small, but the sound is inevitably mechanical.

DeepMind earlier released a machine learning speech generation model WaveNet, which directly generates original audio waveforms, can model any sound, does not rely on any pronunciation theory models, and can obtain excellent results in text-to-speech and conventional audio generation.

(2) Bottleneck

The demand for personalized TTS data is large, which is difficult to meet when user expectations are relatively high. AI product managers are required to choose scenarios where user expectations are not harsh, or to manage user expectations during design.

3. Natural language processing (NLP)

1. Research content

Natural language processing is a subject that allows computers to understand, analyze, and generate natural language. It is the process of understanding and processing words, which is equivalent to the human brain. NLP is currently the core bottleneck of AI development. The entire NLP includes syntactic and semantic analysis, information extraction, text mining, machine translation, information retrieval, question answering system, dialogue system and other categories.

The general research process of NLP is: develop a model that can express language capabilities-propose various methods to continuously improve the ability of the language model-design various application systems based on the language model-continuously improve the language model. Both natural language understanding and natural language generation belong to the conceptual category of natural language understanding.

The natural language understanding (NLU) module focuses on the semantic understanding of a single sentence. It classifies the user’s problem at the sentence level and clarifies the intent classification. At the same time, it finds the key entities in the user’s problem at the word level. Slot Filling.

A simple example, the user asks "I want to eat ice cream", the NLU module can recognize that the user's intention is to "find a dessert shop or a supermarket", and the key entity is "ice cream". With intent and key entities, it is convenient for the subsequent dialogue management module to query the back-end database or continue multiple rounds of dialogue to fill other missing entity slots if there is missing information.

The natural language generation (NLG) module is the last mile of the interaction between the machine and the user. At present, most of the methods used in natural language generation are still rule-based template filling, which is a bit like the reverse operation of entity slot extraction, and the final query result Embedded in the template to generate a response. In addition to manual generation of templates, deep learning generation models can also be used to generate templates with entity slots through data self-learning.

2. Application scenarios

Natural language processing is a very important part of CUI (Conversational User Interface). As long as the application scenario of CUI requires natural language processing to play a role. In addition, machine translation and text classification are also important application areas of natural language processing. However, the application of natural language processing is also the most complained. The classic one is that "smart customers not only did not increase efficiency, but also reduced efficiency." Compared with CV, NLP is indeed much less intuitive.

3. Bottlenecks

(1) Definition of word entity boundary

Natural language is multi-round. A sentence cannot be viewed in isolation. It has either context or front and back dialogue. The correct division and definition of different word entities is the basis for a correct understanding of the language. The current deep learning technology, when modeling multiple rounds and contexts, is far more difficult than the one-input one-output problems such as speech recognition and image recognition. Therefore, companies that do well in speech recognition or image recognition may not be able to do natural language processing well.

(2) Word sense disambiguation

Word sense disambiguation includes polysemous word disambiguation and reference disambiguation. Polysemous words are a very common phenomenon in natural language. Disambiguation refers to the correct understanding of people or things represented by pronouns. For example: In a complex conversation environment, who exactly is "he" referring to. Word sense disambiguation also requires a correct understanding of the text context, conversation environment and background information, which cannot be clearly modeled at present.

(3) Personalized recognition

Natural language processing has to face the problem of individualization. Natural language often has ambiguous sentences, and different people may have different statements and different expressions when using the same sentence. This kind of individualization and diversification problem is very difficult to solve.

(4) NLP technology system

Here is also a summary of the entire technical system of natural language processing, as follows:

NLP technology system

(5) Product experience

Natural language recognition: Xunfei Input Method (PC software and mobile APP), Xunfei Yuji (mobile APP), Baidu Input Method PC software and mobile APP)

Far-field speech recognition (smart speakers): Amazon Echo, Google Home, Apple HomePod

Machine translation: google translation

Multi-round dialogue robots: Apple Siri, Microsoft Xiaoice, Baidu Du Mi, Xiaoi, Little Yellow Chicken, Turing Robot

(6) Recommended reading materials

  • How to find academic materials in the field of natural language processing (NLP) for beginners: http://blog.sina.com.cn/s/blog_574a437f01019poo.html
  • Principles of Speech Recognition Technology: https://www.zhihu.com/question/20398418
  • Secrets of the new generation speech recognition system of iFlytek: http://news.imobile.com.cn/articles/2015/1231/163325.shtml
  • The basic principles and applications of natural language processing (NLP): http://blog.csdn.net/inter_peng/article/details/53440621
  • Detailed explanation of siri working principle and technical analysis of siri: http://www.infoq.com/cn/articles/zjl-siri/
  • CSDN natural language processing blog post: http://so.csdn.net/so/search/s.do?q=%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8% 80%E5%A4%84%E7%90%86&t=blog&o=&s=&l=

Fourth, typical AI scenarios

As mentioned earlier, the current three main areas of AI research: computer vision, voice interaction, and natural language processing are equivalent to artificial vision, hearing, and brain. Finally, I will talk about several scenes that are currently very hot in the market. These segmented scenes are also based on the intersection of the above three areas, including intelligent robots, face recognition, mobile image processing, automatic editing, etc.

1. Intelligent Robot

Take sorting robots as an example. Sorting robots are robots equipped with sensors, objective lenses, and electronic optical systems that can sort goods quickly. With the vigorous development of e-commerce platforms, automatic sorting robots have been widely used. Amazon, Alibaba, and JD.com have all applied intelligent sorting robots in the sorting work of goods, which greatly saves labor costs. It claims to be able to complete the sorting work of 18,000 orders in one hour. Further reading is as follows:

  • The realization of industrial robot sorting technology: https://wenku.baidu.com/view/a2da4ed17f1922791688e8cf.html
  • What are the key technologies for unmanned express sorting? : Http://baijiahao.baidu.com/s?id=1572495116614945&wfr=spider&for=pc
  • The logistics robot market is developing rapidly, and the working principle of the sorting robot is introduced: http://www.xianjichina.com/news/details_45519.html

2. Autonomous driving

Autonomous vehicles (Self-piloting automobile), also known as driverless cars, computer-driven cars, or wheeled mobile robots, are smart cars that realize driverless driving through a computer system. Self-driving cars rely on artificial intelligence, visual computing, radar, monitoring devices, and global positioning systems to work together to allow computers to automatically and safely operate motor vehicles without any human active operation.

On July 6, 2017, the live video of Baidu AI Developers Conference “Li Yanhong took a driverless car to Beijing's Fifth Ring” news blasted the circle of friends. Recently, the news of an autonomous bus on the road in Shenzhen blasted the circle of friends. , A self-driving passenger bus jointly created by Hailiang Technology, Shenzhen Bus Group, Shenzhen Futian District Government, Ankai Bus, Dongfeng Xianglu, Sagitar Juchuang, ZTE Corporation, Southern University of Science and Technology, Beijing Institute of Technology, and Beijing Union University—— Alphabus (Alphabus) officially carried out line information collection and trial operation on the open road in Shenzhen Futian Free Trade Zone. This anxious world has added a group of anxious people-bus drivers.

Volvo distinguishes four stages of unmanned driving according to the level of automation: driving assistance, partial automation, high automation, and complete automation:

  1. Driving assistance system (DAS): The purpose is to provide assistance to the driver, including providing important or useful driving-related information, and issuing clear and concise warnings when the situation becomes critical. Such as the "Lane Departure Warning" (LDW) system and so on.
  2. Partially automated systems: systems that can automatically intervene when the driver receives a warning but fails to take appropriate actions in time, such as the "Automatic Emergency Braking" (AEB) system and the "Emergency Lane Assist" (ELA) system.
  3. Highly automated system: A system that can replace the driver to control the vehicle for a long or short period of time, but still requires the driver to monitor driving activities.
  4. Fully automated system: a system that can drive an unmanned vehicle and allow all occupants in the vehicle to engage in other activities without monitoring. This level of automation allows passengers to engage in computer work, rest and sleep, and other entertainment activities.

The related company in this field is Tesla, a household name abroad, and Baidu is the best driver in China. The Baidu self-driving car project started in 2013 and is led by Baidu Research Institute. Its core technology is the "Baidu Automobile Brain", which includes four modules of high-precision map, positioning, perception, intelligent decision-making and control.

Among them, the high-precision map independently collected and produced by Baidu records complete three-dimensional road information, which can realize vehicle positioning with centimeter-level accuracy. At the same time, Baidu unmanned vehicles rely on internationally leading traffic scene object recognition technology and environment perception technology to achieve high-precision vehicle detection and recognition, tracking, distance and speed estimation, road segmentation, and lane line detection, providing a basis for intelligent decision-making in autonomous driving .

Tesla (Tesla) is an American electric vehicle and energy company that produces and sells electric vehicles, solar panels, and energy storage equipment. Tesla's plan is to continue to iterate assisted driving technology and finally upgrade it to unmanned driving. When staying in the assisted driving phase, the driver is required. The driver has full control, can counter or cancel the assisted driving behavior, and is fully responsible for safety.

Google's unmanned driving is in place in one step. The basic principle is that no human intervention is required. People without a driver's license can get in the car alone and sleep as soon as they get in the car. Passengers are not responsible. The LeTV Automobile Channel was officially launched on August 20, 2010. Relying on the advantages of LeTV’s video, it will present rich, exciting and practical automobile content to a large number of netizens in the form of videos. The content covers new car reports, industry news, and test reports. Columns such as test drive, repair and maintenance, original car videos, car model style, consumer rights protection, auto races, etc.·Wonderful videos allow netizens to easily enjoy the audio-visual feast of the automotive industry. Unfortunately, unmanned driving and smart travel are the trends, but 2017 is not the breaking point. The huge LeTV empire collapsed due to unmanned vehicles supplying blood.

Further reading includes:

  • What technologies are involved in autonomous vehicles? : Https://www.zhihu.com/question/24506695
  • What is autopilot, and how to understand its functions and principles in an easy-to-understand manner? : Https://www.zhihu.com/question/54647152
  • Dry goods! Principle analysis of lidar technology and autonomous driving technology: http://www.21ic.com/app/auto/201705/721051.htm
  • Introduction to the principles of autonomous driving technology and what is the future trend: http://www.elecfans.com/xinkeji/595666_2.html
  • Google driverless driving introduction Ted video, with Chinese subtitles: https://www.ted.com/talks/chris_urmson_how_a_driverless_car_sees_the_road
  • Huang Renxun interviewed Elon Musk and mentioned the principle of Tesla assisted driving https://youtu.be/uxFeUOstyKI
  • Application of artificial intelligence in autonomous driving technology: https://wenku.baidu.com/view/277ffb5cbb1aa8114431b90d6c85ec3a87c28baa.html

3. Face recognition technology (non-mobile terminal)

Face recognition is a kind of biometric identification technology based on the facial feature information of people. Use a camera or camera to collect images or video streams containing human faces, and automatically detect and track human faces in the images, and then perform a series of related technologies on the detected human faces, usually also called face recognition and facial recognition. In 2017, it was fully used in mobile phone unlocking. The face recognition system mainly includes four components, namely: face image acquisition and detection, face image preprocessing, face image feature extraction, and matching and recognition.

Face recognition technology products have been widely used in the fields of finance, justice, military, public security, border inspection, government, aerospace, electric power, factories, education, medical treatment and many enterprises and institutions. With the further maturity of technology and the improvement of social recognition, face recognition technology will be applied in more fields. In this industry, a number of outstanding companies have emerged, such as Hunan Vision Weiye, Beijing Megvii Technology, and Beijing Shangtang Technology.

Further reading includes:

Principle of face recognition system:

  • http://blog.csdn.net/zergskj/article/details/43374003
  • Principle and development of face recognition system: https://wenku.baidu.com/view/0c56a7bf3186bceb19e8bbf9.html
  • The principle of the main algorithm of face recognition: http://blog.csdn.net/liulina603/article/details/7925170
  • Brief talk about artificial intelligence | Understanding the principle of face recognition in 2 minutes: http://baijiahao.baidu.com/s?id=1568919427558010&wfr=spider&for=pc
  • Top ten face recognition technology companies: http://www.elecfans.com/consume/571535.html?1509154910

4. Visual design (mobile terminal)

More and more selfie apps, combined with face recognition technology, can add props such as ears, nose, crown, etc. to a person’s face or head, identify and lock the person’s face or body, and ensure that the props can automatically move with the person’s movement .

Instagram can automatically recognize the design elements in a picture, and assign another picture as a filter. It can design a design effect with superb effects, turning an ordinary landscape photo into a Van Gogh style oil painting.

Domestic APPs including visual design AI are all over our mobile phones, such as Meipai, SNOW camera, Faceu, B612, Xiaotu, IN, Meijia camera, LINE camera and other mobile apps that support automatic face recognition, cat ears, Pick rabbit ears, fox ears, pig ears.

Further reading includes:

  • A Neural Algorithm of Artistic Style:https://arxiv.org/abs/1508.06576
  • Build an ostagram yourself: https://zhuanlan.zhihu.com/p/22704865

5. Automatic text editing

Robot writing is nothing new. Two years ago, there were special information APPs abroad. The content was all captured by machines and short messages were generated, mainly in the fields of sports, finance and economics. Many overseas traditional media have used robots to write. Because artificial intelligence can monitor hot words on the Internet, robots are more sensitive and faster than they are sensitive to hot time.

The robot knows what will become a hot spot, and can deliver the hot spot to the audience the first time. In the media industry, AI writing is a trend in the future, especially for structured and standardized data-based information such as financial reports, sports news, and stock market news. Manual processing is not as accurate and efficient as AI.

Products recommended for trial here include Tencent’s Dreamwriter, Baidu’s Writing Brain, Xinhua News Agency’s "Kaibi Xiaoxin", and Toutiao’s "xiaomingbot".

Automatic text editing process taking Baidu products as an example

Further reading includes:

  • The "Director of New Media Operations" of the New York Times is a robot called Blossom: http://www.leiphone.com/news/201508/Ze9HOBijDnwIQIPE.html
  • EditorAI: Use artificial intelligence technology to assist journalists in editing and writing: http://news.91.com/mip/s5947c56e593b.html
  • Artificial intelligence helps you write a paper, there is always a suitable one for you! http://www.sohu.com/a/119470301_107743

The above is the general application status of AI in various fields that I have summarized at present. It is basically relatively complete. Afterwards, around each technical point and product design, I will continue to share it in depth, so stay tuned.

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/114072680