Embedded vision will become an epoch-making product

With the emergence of PCs, mobile phones and interactive games, computer vision has entered consumer electronics and become familiar to the general public. The development of AI and microprocessors in recent years has greatly promoted the development of computer vision in various application fields, especially embedded vision, which has attracted special attention from the industry.

HAL 9000 is a virtual character in the British science fiction novel and movie "Space Odyssey". It is a heuristic programming algorithm computer with artificial intelligence (AI) that interacts with the spacecraft crew through an optical lens. Personnel dialogue and communication (this is considered the highest-level application scenario of computer vision); his responsibility is to maintain the normal operation of the entire spacecraft system, but he can talk and play chess with people, recognize voice and facial expressions, interpret and express emotions, It can even read people's lip movements. With the development of science and technology today, AI robots have only realized some basic functions of HAL, which shows the difficulty of computer vision technology.

What is embedded vision?

Computer vision uses digital processing and intelligent algorithms to interpret the meaning of information contained in images or videos to make decisions based on it. Although computer vision has been used in academia and laboratories for decades, it is little known outside of niche fields such as science fiction movies and defense/space. With the emergence of PCs, mobile phones and interactive games, computer vision has entered consumer electronics and become familiar to the general public. The development of AI and microprocessors in recent years has greatly promoted the development of computer vision in various application fields, especially embedded vision, which has attracted special attention from the industry.

Embedded vision is the use of computer vision technology in embedded systems . For example, automotive ADAS integrates multiple cameras to sense the surrounding environment and provide semi-automatic or even fully automated driving functions. The latest smart speakers equipped with screens released by Internet giants such as Google and Baidu are also concrete applications of embedded vision.

Data from market research organization Tractica predicts that from 2016 to 2025, the market for hardware, software and services for embedded vision applications will grow at a rate of 25%, reaching a scale of US$26.2 billion by 2025, including video security, automotive automation, etc. / Applications such as assisted driving, AR/VR, and factory automation will be the fastest growing and largest market segments for embedded vision (refer to Figure 1).

Figure 1 From 2016 to 2025, the hardware, software and service market for embedded vision applications will grow at a rate of 25%.

It is no exaggeration to say that any scene involving human-machine interaction has room for embedded vision. Although it is not an epoch-making technological innovation as an independent human-computer interaction product like PCs and mobile phones, embedded vision technology is playing an important role in AI and IoT.

The era will be more ubiquitous than PCs and mobile phones, almost pervasive. In this sense, embedded vision will become an epoch-making product.

So, what is the driving force for the rapid development of embedded vision?

Three major components of embedded vision

An embedded system with computer vision functions mainly includes components such as cameras and sensors, vision processors, software and algorithms, as well as memory and network interfaces. In the past five years, the design (heterogeneous computing multi-core SoC) and manufacturing (shrinking from 14nm to 7nm) of microprocessor chips have made great progress, which has brought about a substantial improvement in computing processing performance, especially for dedicated processors. Appeared, it provides hardware guarantee for massive image and video data processing.

At the same time, AI and neural networks are gradually replacing traditional visual algorithms, the popularity of cloud computing, and the emergence of various deep learning frameworks and software development tools are all driving the performance improvement of neural networks in performing visual processing tasks. Therefore, high-performance dedicated processors and deep learning algorithms are the main driving forces for the rapid development of embedded vision technology and applications. Of course, the development of high-resolution sensors, especially 3D sensors, also plays a role that cannot be ignored.

CPU, GPU, DSP, FPGA or exclusive AI chip?

A typical computer vision processing flow diagram is shown in Figure 2. The processor is generally optimized specifically for the computationally intensive part of the software processing task load. Although general-purpose CPUs are still the most widely used vision processors, adding GPUs as auxiliary processors to the CPU has become a standard practice in embedded vision design. In addition to general-purpose GPUs, DSPs and FPGAs are also popular auxiliary processors for visual application design engineers.

Figure 2 Typical computer vision processing flow.

General-purpose CPUs, GPUs, DSPs, FPGAs, and emerging special-purpose deep learning processors each have their own advantages and disadvantages when performing visual processing, and which type to choose depends on specific application requirements. Although many computer vision algorithms can be executed on desktop PC processors, they are difficult to meet the stringent power consumption and size requirements of embedded systems. From the perspective of computer instruction set architecture, CPU types include RISC such as x86, Arm, and MIPS, as well as RISC-V, which has become popular in the past two years.

General-purpose CPUs are generally more suitable for processing the following tasks: heuristic algorithms, complex decision-making, network access, user interface, storage management, and overall system control. The hardware architecture of the CPU is for a wider range of computing applications and is not specifically optimized for accelerating deep learning operations. Therefore, it needs to cooperate with a dedicated processor to obtain better visual processing performance.

The high-performance GPU has extremely strong massive parallel computing capability, and can quickly and parallelly process the image pixel data in the computer vision process. Although GPUs are mainly used in data centers and high-performance computing areas of cloud computing platforms, GPU core features are increasingly seen in application processors in smartphones. The memory architecture and parallel computing hardware architecture of GPU are very suitable for deep learning training and inference operations. In system designs requiring embedded vision and 3D processing, GPU is an indispensable part and can assist general-purpose CPUs to process many tasks. types of computer vision algorithms. However, in embedded applications, especially in IoT smart terminals and edge computing devices that are sensitive to power consumption, size and cost, the high power consumption and high cost of GPUs are challenges that make it difficult to popularize them in embedded vision.

DSP is better at processing streaming media data because its bus and structure are optimized for high-speed data processing. This architecture makes it an ideal solution for processing image and video streaming data from sensors. However, DSP generally cannot be used alone because it is not efficient in handling general computing tasks. Therefore, DSPs are generally paired with RISC processors to form a heterogeneous computing environment to accelerate the processing of video input and computer vision algorithms.

According to Yair Siegel, senior director of AI strategy at CEVA, purpose-designed vision DSPs and AI processors have higher performance and power efficiency (performance/mW) in executing vision algorithms, taking visual input signals and processing them. Since such vision engines are domain-specific, tools and libraries for vision and AI should be state-of-the-art to provide a good developer experience.

FPGA provides a flexible hardware-accelerated programmable design solution that can quickly migrate software to hardware. Unlike CPUs that compete for computing resources to perform multi-threaded tasks, FPGAs can accelerate multiple parts of the computer vision process at the same time. The FPGA architecture is similar to the GPU in terms of accelerating deep learning operations. Its advantages are flexibility and adaptability, and low power consumption. However, its disadvantages are that it is not efficient. The development tool chain is slightly lacking compared to the GPU, and the development threshold is slightly higher. In addition, for embedded applications, the cost needs to be further reduced to be accepted by the mass market.

Deep learning dedicated processor

Jeff Bier, founder and president of the Embedded Vision Alliance, shared with EE Times the latest 2019 computer vision developer survey report conducted by the industry organization. Among the various processor types used for visual processing, CPUs are still the most used processors, but the popularity of GPU usage continues to grow, while the usage of auxiliary processors such as FPGAs and DSPs has declined. Particularly striking is that one-third of those surveyed are already using dedicated deep learning processors, which did not even exist a few years ago (see Figure 3).

Figure 3 The 2019 survey report of the Embedded Vision Alliance shows that 32% of developers use dedicated deep learning processors.

At present, various dedicated vision processors have appeared in the industry, which can efficiently and easily develop vision and AI algorithms. For example, in the field of automotive applications, Mobileye's EyeQ3, EyeQ4 and the EyeQ5 under development; Tesla's recently released fully autonomous driving (FSD) computer; Nividia DRIVE Xavier chip, and European AIMotive's AiWare autonomous driving chip, etc.

In the field of video security, Shanghai Yitu, one of the four major AI algorithm companies in mainland China, took the lead in launching the QuestCore chip specifically for video analysis and processing; in addition, there is also the neural network processor used in DJI drones, which can be used for obstacles Object detection and collision avoidance. These specialized vision processors enable neural networks to perform basic AI tasks on end devices.

Future Trends and Challenges of Embedded Vision Hardware Platforms

Siegel believes that in the AI ​​processor market, some conflicting trends can develop simultaneously. For example, on the one hand, some specialized processors are more powerful, while at the same time, many efficient, low-cost and low-power processors are also popular. Tesla and Nvidia are pushing for high-performance processors exceeding 100TOPS to be able to process more complex neural networks in parallel. On the other hand, consumer pressure on price is driving the cost of consumer electronics products to continue to fall. For example, Google, Amazon, and Baidu all have smart speakers priced at $200 and less, which means that the internals of these devices The cost of an AI processor should be $10 or less. These trends will coincide and we will see processor chips of various sizes targeting different needs and application scenarios.

Performance/power ratio, size and total cost of ownership will be ongoing challenges for embedded vision processors, but a clear trend is latency requirements as more and more AI applications require low latency or even instantaneous solutions solution, said Chetan Khona, director of industrial/medical/scientific market development at Xilinx. Developers have been seeking flexible, compact, low-latency, low-power consumption and high-performance development platforms for their new generation product designs; Xilinx SoC has good flexibility and self-adjustment, which is more suitable for embedded vision applications and can Provide developers with scalable embedded HW/SW solutions.

A developer responsible for the Zhouyi AI platform in Arm China believes that developing deep learning applications in embedded scenarios requires new hardware and software development tools, such as chips with deep learning accelerators, compilers and frameworks adapted to deep learning scenarios. and operating systems, etc. However, it will take several years for the development of new software and hardware tools to mature and improve. Before that, developers will have to develop applications without tools and face various problems, such as the hardware computing power not meeting the scenario. Demand, the high cost of chip power consumption to meet computing power requirements, the dual fragmentation of models and chips lead to low efficiency in transplantation and development work, and the lack of reusable algorithm libraries lead to low development efficiency.

He predicts that before 2020, the vast majority of embedded vision scenarios will still use traditional Arm chips for deep learning development. The main difficulties encountered by developers are the lack of a lightweight and rapid deployment platform for embedded front-ends, in-depth exploration of CPU/GPU computing power, and precision control using quantitative operations after deployment. The Zhouyi Tengine framework, HCL computing library and quantitative training tools jointly developed by Arm China and OPEN AI LAB are aimed at the three major issues of deployment, computing power and accuracy. Together with traditional embedded chips equipped with Arm processors, they hope to become an embedded vision A deep learning deployment platform for developers.

sensor

New imaging applications are booming and are driving the popularity of embedded vision in various application fields. For designers and developers of embedded vision systems, smaller size, weight, power and cost (SWaP-C for short) is an ongoing requirement for various component suppliers. Since image sensors directly affect the performance and design of embedded systems, they play an important role in large-scale applications.

Marie-Charlotte Leclerc from Teledyne e2v pointed out in a technical article that to reduce the cost of image sensor modules, we must first reduce the sensor pixel size. Because the smaller the pixel size, the more dies can be cut on a wafer. For example, Teledyne e2v's Emerald 5M sensor reduces the pixel size to 2.8μm, allowing S-mount (M12) lenses to be used on 5-megapixel global shutter sensors, which can bring direct cost savings. An entry-level M12 lens costs about $10, while a larger size C- or F-mount lens costs 10-20 times more.

In addition to optical optimization, the choice of sensor interface also indirectly affects the cost of the vision system. The MIPI CSI-2 interface was originally developed by the MIPI Alliance for the mobile industry. It has been widely adopted by most system integrators and has begun to become popular in the industrial market because it can directly interface with SoC or System-on-Module (SOM). ) integration. CMOS image sensors with MIPI CSI-2 interface can directly transmit data to the host SoC or SOM of the embedded system without any intermediate converter bridge, thus saving cost and PCB space. This advantage is even more prominent in multi-sensor based embedded systems such as 360-degree panoramic systems.

CCD or CMOS?

Thanks to the popularity of smartphones and other consumer electronics, CMOS image sensors now account for more than 90% of the market. However, for some special applications, the pulse principle of CCD sensors can achieve more ideal application characteristics than CMOS sensors. Li Jia, ADI system application engineering manager, elaborated on this in detail.

CCD sensor imaging can obtain smaller pixels. At the same resolution, the optical target surface is small, which facilitates the lightweight design of the lens. For portable applications such as mobile phones, the overall size is one of the key indicators. CCD has high sensitivity, especially good performance in the 940nm band, and can be used in outdoor strong light, which is important for outdoor face recognition, mobile payment and other application scenarios.

The full-field exposure advantage of CCD can also measure fast-moving targets. The measurement distance based on the CCD sensor can reach >10 meters under strong light, which is very critical for applications such as long-distance obstacle avoidance of moving objects. The pulse principle based on CCD imaging can calculate depth information in a single frame, which is very critical for applications such as human action recognition, gesture control, and object shape scanning based on depth information. ADI's ToF solution mainly uses CCD sensors.

In addition, in fields such as scientific research and aerospace, CCD still occupies a dominant position. Teledyne e2v business development manager Tracy Da believes that this is because CCD image sensors have been accumulated and verified for many years. The strict requirements of flexibility and scanning in aviation flight mode cannot be met by mass-produced CMOS image sensors. Although CMOS can technically realize most functions, and even some features are stronger than CCD, it will still take a long time to catch up with CCD in terms of reliability and solution verification in corresponding professional fields.

3D image sensor

2D image sensors play a key role in large-scale devices and cameras across many industries because they are cheaper and more transparent to manufacture. However, a good camera or machine vision similar to human vision should be in 3D form. When considering 3D image sensors, one needs to weigh the technical advantages against the challenges of mass production technology. In terms of technology, 3D sensors face the challenge of coping with various scenarios (such as sunlight, darkness, short distance, long distance) and robustness (continuing to work without hardware calibration after encountering an impact). There are many techniques for understanding 3D images using multiple more complex sensors, two left and right sensors, or using more complex software on a single sensor.

The existing 3D imaging solution that is familiar to everyone is the face recognition of the iPhone X, which mainly uses structured light and 2D image sensors. In addition, 3D solutions also include technologies such as binocular vision, ToF sensors, and Range Gating, which have different accuracy requirements for different application scenarios. In consumer markets such as mobile phones, the size and power consumption of the overall solution are the biggest challenges, and back-end algorithm support is also a big problem.

Wang Dan, an ams field application engineer, believes that emerging 3D technologies have good development prospects in the consumer and industrial fields, such as face recognition, scene reconstruction and AR/VR in the consumer market, industrial SLAM and machine vision in applications, etc. The existing 3D cameras have been developed relatively maturely, but they still lack powerful back-end algorithm support, and there are no corresponding specifications in terms of interfaces and standards.

Currently, ToF and structured light are the two solutions preferred by many 3D cameras, and they have many advantages and potential uses in almost all camera devices. There are many domestic and foreign suppliers of these two technologies. For example, ams, ADI and Infineon's ToF image sensors have been mass-produced, and ST's structured light components have been used in iPhones in batches.

ToF image sensor technology has major advantages such as high depth resolution (VGA) and anti-interference performance under strong light, which is very suitable for the following applications:

Indoor and outdoor face recognition;

Gesture control and physical sign recognition (such as gesture interactive control, 3D people flow statistics, 3D Camera automatic door);

Logistics express cargo volume scanning;

Long-distance obstacle avoidance in moving objects (such as ToF-based reversing image collision avoidance detection, electronic fence).

Taking the vivo NEX series of mobile phones with ToF 3D super-sensing technology jointly developed by vivo and ADI as an example, the ToF 3D stereo camera has greatly improved the security of mobile payment and enriched beauty and other convenient functions. Through the ToF infrared depth-of-field lens, a large number of three-dimensional features of the user's face can be more accurately identified, which not only enables unlocking, but also enables "face-swiping" payment. Through 3D facial modeling, a 3D-level face pinching effect is achieved, and the face shape can be adjusted in detail. The ToF camera also supports the 3D measuring instrument function for measuring the size of objects.

Software and Algorithms

Computer vision applications have always relied on highly specialized computer algorithms, developed and updated over a long period of time by experienced developers specifically for a specific application field or user case. The high threshold and specialization of traditional algorithms hinder the development and application popularization of computer vision technology, making new application design and development very expensive and lengthy. Since 2012, the emergence of AI and deep learning has lowered the threshold for algorithm modeling and development, greatly unleashed the potential of computer vision, and attracted more and more developers and companies to embed visual functions in their system designs. Make algorithms, systems and applications based on computer vision easier to develop and deploy on a large scale.

Traditional algorithm and AI algorithm?

In the Embedded Vision Alliance's annual developer survey, 87% of survey respondents stated that they have or plan to use deep learning neural networks in their product designs to implement computer vision functions. In terms of software development platforms and algorithms, for non-neural network visual algorithm tasks, the most popular software language is C++, and the most widely used algorithm libraries are open source OpenCV, MATLAB and CUDA, which also have many users. For vision algorithm tasks using deep learning, Google's TensorFlow is the most popular software framework for creating, training, evaluating, and deploying neural networks.

Although the entry threshold for traditional algorithms is high and requires a deep understanding of the details of the algorithm, and subsequent real application requires repeated adjustment of parameters, traditional algorithms can be specifically adjusted for general features with good quality, and their high degree of Specialization is also a competitive advantage for developers. In addition, in some non-computation-intensive applications, or application fields that are not suitable for AI, traditional algorithms will still dominate.

The entry threshold for deep learning is low, and transfer learning can be performed based on classic networks, but it requires a large number of well-labeled data sets because the algorithm performance is directly proportional to the size of the training set. However, if deep learning needs to be further refined in the future, data scientists are required to have the ability to design, prune and quantify deep learning networks. Due to the difficulty of interpreting deep learning models, the threshold for follow-up is higher than that of traditional algorithms. In addition, deep learning training takes a long time. If the final application requires a high number of frames, it will still take a long time to optimize the model or master the development tools for different hardware.

There seems to be a natural conflict and competition between traditional algorithms and AI algorithms, but most algorithm developers actually mix and match traditional computer vision algorithms and neural networks to achieve the best accuracy and efficiency. combination.

Hardware-optimized computer vision algorithms

An innovative trend in computer vision algorithms is to replace general-purpose algorithms with hardware-optimized algorithms. Given the wide variety of processors available for embedded vision, algorithmic analysis will focus on maximizing pixel-level processing power within system constraints. Many programmable component vendors have created optimized versions of standard computer vision libraries for this purpose. For example, Nvidia has open sourced CUDA and collaborated with the OpenCV community to create algorithms that can be accelerated by general-purpose GPUs. MathWorks provides MATLAB functions/objects and Simulink modules for many computer vision algorithms in its vision system toolbox, and also allows developers to develop their own function libraries for specific programmable design architectures.

NI's graphical programming software LabVIEW is integrated with Python, the main process design language of deep learning, and supports direct calling of Python files. Users can easily use NI's data sets and Python deep learning algorithm libraries to conduct innovative explorations in engineering development. In addition, Xilinx also provides customers with optimized computer vision libraries in the form of plug-and-play IP cores to develop hardware-accelerated vision algorithms on FPGAs.

Future trends and challenges of algorithms and software platforms

The first difficulty that embedded vision application developers encounter is choosing an appropriate embedded platform among CPU, GPU, FPGA and other processors. Different hardware platforms and their corresponding tool chains are completely different, and they all need to be relearned and familiar with. . Thirdly, due to the difficult interpretability of deep learning models, model adjustment and optimization will lack direction and require repeated attempts.

Deep learning algorithms require very high computing power, and this requirement will continue to grow and change characteristics. Embedded vision platforms need to address these issues in low-cost, battery-powered devices. So the challenge for developers is not only to cram a lot of computing power into small memories and cheap processors, but also to enable these features to be updated quickly as algorithms evolve. A good software development platform should quickly port new research networks to hardware and maximize the use of hardware.

MathWorks senior technical expert Chen Jianping and his American colleagues listed the following main challenges faced by developers:

1. Deep learning optimization: Ensure that the program code is optimized to fully utilize the processor hardware performance to obtain sufficient inference engine performance. For example, developers writing CUDA code based on Nvidia GPUs can boost performance through auto-tuning, layer fusion, and taking advantage of acceleration libraries such as Thrust and TensorRT.

2. Integrate multiple libraries and software packages: For each project goal to be realized, developers need to manage libraries provided by chip developers, open source software packages, and their own handwritten codes. And deployments for different targets need to be maintained, verified, and refined separately.

3. Algorithm and supplier Lock-in: Once a developer decides to use a certain supplier's chipset and algorithm library, he is basically locked in, and the code written is difficult to transfer to other platforms.

In response to these problems, MathWorks' response is to allow developers to use the target-independent high-level MATLAB language to design their own AI systems, and then use the code generator to automatically create target deployments for different processors and hardware platforms. In this way, whether it is a CPU, GPU, FPGA or Arm processor, deep learning inference and signal and image processing can be handled with ease.

Three major applications of embedded vision

Adding vision capabilities to embedded systems has kick-started traditional industrial automation, automotive electronics, and security monitoring markets, and even consumer electronics such as gaming, AR/VR, and wearables are seeing new life.

factory automation

Computer vision is often referred to as "machine vision" in the field of factory automation, and the industries involved mainly include vehicle manufacturing, chemical and pharmaceutical, packaging equipment manufacturing, robot manufacturing, and semiconductor and electronic product manufacturing and assembly. Machine vision products used in factory automation mainly include the following parts: smart sensors, smart cameras, machine vision cameras, frame grabbers, machine vision lighting and lenses, and machine vision software.

Guo Qiao, NI Industrial Internet of Things/Artificial Intelligence Industry Manager, believes that most scenarios that require deep learning in industry are visual applications, such as defect detection and character recognition in production lines or factories. Usually this harsh environment is not suitable for embedded systems (GPUs) with fans. It is often difficult to find a suitable dedicated chip due to the variability of the objects to be tested and the repeated calculation and development of algorithms.

vehicle electronics

Autonomous driving and ADAS are the most popular applications of computer vision in automobiles, however, embedded vision can also be installed in the car to monitor the driver's driving status. In the interaction scene between the vehicle and the surrounding environment, computer vision chips and algorithms can realize functions such as lane departure warning, collision avoidance, automatic reversing and parking.

The purpose of applying computer vision to cars is not to allow cars to replace people for autonomous driving, but to make car driving safer. Siegel predicts that the focus of the automotive market will be smarter and more assistive vision devices, so algorithms will need to predict and understand human behavior, correlate context and environment, and better use this knowledge to select notifications, reactions and predictions accordingly. . This will require more advanced networking paradigms, such as recurrent neural networks (RNN) that can understand actions, not just who is in front of the car, but also where they are going and when they enter the car's danger zone.

Guo Qiao predicted that in automotive applications, automatic identification and decision-making based on the fusion of images and visual sensors, rather than solutions based solely on cameras, is the key to future development. He believes that in the next 3 to 5 years, more improvements will be made in the software tool chain. There will be more fool-like tools to reduce dependence on data scientists and make it easier to design and optimize deep learning networks.

Security Monitoring

In the traditional physical security monitoring market, the main visual products involved include network cameras, encoders, network video recorders, DVRs and other smart monitoring equipment. In recent years, "low-level" analysis phenomena have appeared in the video surveillance market, such as camera sabotage detection and video motion detection. These functions have become standard configurations on many security terminal devices.

The security market in mainland China is quite mature. New and old manufacturers such as Huawei HiSilicon, Hikvision and SenseTime are all fighting in this market from different angles. However, a common feature is that everyone is playing the AI ​​card, starting from AI cameras. , AI chips to AI algorithms, both hardware and software will have AI functions embedded in them. However, issues such as inconsistent standards in this market and expensive algorithms remain to be solved. Dedicated vision processors and deep learning algorithms may be able to help drive the healthy growth of this huge market.

There are really a lot of things to learn about the embedded Internet of Things. Don’t learn the wrong route and content, which will cause your salary to go up!

I would like to share with you a data package, which is almost 150 gigabytes. The learning content, interviews, and projects are relatively new and comprehensive! (Click to find an assistant to receive it)

Guess you like

Origin blog.csdn.net/m0_70911440/article/details/132356062