"Artificial Intelligence Development Report 2019"! Computer visual depth interpretation of the text attached link to download the full report

This excerpt "Artificial Intelligence Development Report 2019" Chapter III articles on computer vision, computer vision encompasses the concept of history, depth of interpretation of the current progress of the talent profile, reading papers, and computer vision.
The report is 393, detailed enough, you want to download the report, please poke link: https://yq.aliyun.com/download/3877

3.1 Computer Vision concept

Computer Vision (Computer Vision) , as the name suggests, is to analyze, study let the computer intelligence to achieve human-like eyes "look" of a scientific study [3]. That is the objective existence of three-dimensional three-dimensional world of understanding and identifying rely on intelligent computer to achieve. Rather, computer vision technology is the use of a video camera and a computer causes the computer to replace the human eye has the human eye has segmentation, classification, identification, tracking, identification and decision-making functions. In short, the computer vision system is to create an artificial intelligence system can complete data or three-dimensional plane of the 2D image 3D stereoscopic image in order to obtain "information" is needed.

Computer vision technology is an included computer science and engineering, neurophysiology, physics, signal processing, cognitive science, applied mathematics and statistics and many other scientific disciplines comprehensive science and technology. Since the system is based on computer vision technology based on high-performance computers, which is capable of quickly acquiring large amounts of data and information can be quickly processed information based on intelligent algorithm, but also easy to design with the information and the control information integration process.

Computer Vision itself includes many different research, comparative base and top direction comprising: an object identification and detection (Object Detection), semantic segmentation (Semantic Segmentation), movement and tracking (Motion & Tracking), visual Q (Visual Question & Answering) [4].

Object recognition and detection

Object detection in computer vision has always been very basic and important research direction, most of the new network structure learning algorithm or depth are the first to object detection in applications such as VGG-net, GoogLeNet, ResNet, etc., in imagenet data sets per year above all there are always new algorithms emerge again and again to break through history, setting a new record, and these new algorithms or network structure will soon become this year's hot spots, and improve the application to other applications in computer vision go with.

Detection and object recognition, by definition, i.e., a given input image, a common algorithm can automatically identify the object in the image, and outputs the position and out of their category. Of course, also a derivative such as a face detection (Face Detection), the detection algorithm fine classification of vehicle detection (Viechle Detection) and the like.

Semantic segmentation

Semantic segmentation is very popular in recent years, the direction, in simple terms, it can in fact be classified as a special - each pixel of the input image is classified, with a picture you can clearly describe. Clearly can be seen, the object detection and identification is generally the object framed in the original image, it can be said that the object on the "macro", and semantic segmentation are classified from each pixel in the image for each pixel They have their own category.

Movement and track

Tracking also belong to one of the fundamental problems in the field of computer vision, in recent years has been very adequate development of methods to a depth learning algorithm from the previous non-leap algorithm depth, accuracy is also getting higher and higher, but the real-time depth learning tracking algorithm has been difficult to improve the precision, speed and accuracy is very high and slow tracking algorithm of tenths, so it is difficult to come in handy in practical applications.

Academia treat criteria tracking mainly in a given video, the given position of the object to be tracked in a first frame and the size of the scale, in which the subsequent video tracking algorithms need to actively look to find the object to be tracked from a video position, and to adapt to various illumination transformation, motion blur and apparent changes. But in fact is a trace ill-posed problem (ill posed problem), such as tracking a car, if you start tracking from the tail of the car, if the vehicle is in the process of moving the apparent great changes have taken place, such as rotated 180 degrees change became a side, then the great possibilities existing tracking algorithm is to trace because they are mostly based learning model first frame, although there will be updated in subsequent tracking process, but limited training samples is too small, it is difficult to obtain a good tracking model is great changes occur when the tracked object's appearance, it is difficult to adapt. So, for now, tracking is not really a particularly popular research direction in computer vision, many algorithms are improved from detection or recognition algorithm.

Q & Vision

Q & Vision also referred VQA (Visual Question Answering), is very popular in recent years, a direction of its research aimed according to the input image, by the user to ask questions, and the algorithm automatically answer questions based on content. In addition to Q, there is an algorithm called the header generation algorithm (Caption Generation), i.e. a description text of the computer image is automatically generated based on the image, without Q. Such algorithms for crossing the two data forms (e.g., text and images), sometimes referred to may be multimodal or cross-modality problem.

Computer Vision 3.2 Development History

Although people have different views on the start time and the history of the discipline of computer vision, but it should be said, in 1982, Malta (David Marr) "visual" (Marr, 1982) a book come out, marking computer vision became an independent discipline. Research in computer vision can be roughly divided into two most visual object (object vision) and visual space (spatial vision). Visual object that the object fine classification and discrimination, determining that the visual and spatial position of the object and the shape, the "action (Action)" service. As the famous cognitive psychologist JJGibson said, the main function of vision is "to adapt to the external environment, to control their movement." Adapt to the environment and to control their own movements, demand biological survival, these functions need to rely on visual objects and spatial visual coordination is complete.

40 years of the development of computer vision, although it made a number of theories and methods, but generally speaking, computer vision has gone through three main course. Namely: Mar visual computing, multi-layered and three-dimensional reconstruction, as the geometry and visual learning. This will be the main content of the following three brief introduction [5].

Marr visual computing (Computational Vision)

Now many computer vision researchers, I am afraid, "Marr visual computing," I do not understand, say that this is a very regrettable thing. At present, the computer raised "deep Web" to improve the accuracy of object recognition seems to be equal in the "vision research." In fact, visual computing Marr put forward, both in theory and research methodology vision, has epoch-making significance.

Marr's computer vision is divided into three levels: the theory of computation and algorithms, and algorithm expression. Since Malta believes algorithm does not affect the function and effect of the algorithm, so Malta visual computing theory focuses on two parts of "computational theory" and "expression and algorithm." Malta believes that the brain's neural computing numerical and computer is no different, so Malta no "algorithm" in any discussion. From the progress of neuroscience now see, "Neural Computing" with numerical calculation and in some cases will have the essential difference, such as neuromorphic current rise of computing (Neuromorphological computing), but generally speaking, "numerical" can "simulate neural computing . " From now on at least, "Different Approaches algorithm" does not affect the calculation of Malta essential attribute of visual theory.

Multi-view geometry and layered three-dimensional reconstruction

In the early 1990s computer vision from "depression" to further "prosperity", mainly due to the following two factors: Firstly, aimed at applications from precision and robustness requirements are too high "industrial application" requirements Go not too high, especially just need to "visual effects" of applications, such as remote video conference (teleconference), archeology, virtual reality, video surveillance and so on. On the other hand, it was found that layered three-dimensional reconstruction in multi-view geometry theory can effectively improve the robustness and accuracy of three-dimensional reconstruction.

Representative figures of the first few multi-view geometry of INRIA France O.Faugeras, A.Zisserman R.Hartely Oxford University and the American Academy of GE. It should be said, the theory of multi-view geometry in 2000 has been basically perfect. Content in this regard in 2000 Hartley and Zisserman co-author of the book (Hartley & Zisserman2000) gives a summary comparison of the system, then work in this area focused on how to improve the "efficiency calculation robust reconstruction of large data."

Big data requires automatic rebuild, rebuild and automatic optimization requires repeated, and repeated optimization takes a lot of computing resources. So, how to ensure the rapid three-dimensional reconstruction of large scenes under the premise of robustness is the focus of late. As a simple example, if you want a three-dimensional reconstruction of Beijing's Zhongguancun area, in order to ensure the integrity of the reconstruction, need to get a lot of ground and UAV images. If the ground gained 10,000 high-resolution images (4000 × 3000), 5 UAV one thousand high-resolution image (8000 × 7000) (this is the current image size of a typical scale), three-dimensional reconstruction of the images to match from which to select the appropriate set of images, and the positional information of the camera calibration and scene reconstruction of the three-dimensional structure, such a large amount of data, manual intervention is possible, the entire three-dimensional reconstruction process must be fully automated. This requires reconstruction algorithm and system has a very high robustness, otherwise we could not fully automatic three-dimensional reconstruction. In the case of robust guaranteed, three-dimensional reconstruction efficiency it is also a huge challenge. Therefore, the current research focus in this area is how to quickly and robustly reconstruct large scenes.

Visual-based learning

Visual-based learning, computer vision research refers to machine learning as the main technical means. Visual Learning based literature generally divided into two stages: the manifold learning subspace method represented beginning of this century and is currently deep learning method visual representation.

The expression object is the core object recognition, a given image of the object, such as different face images of different expressions of classification and recognition of the object. Further, the image pixels directly expressed as an "overexpression", nor is a good expression. Manifold learning theory that the existence of its "internal manifold" (intrinsic manifold) An image of the object, the internal manifold is a quality of the object of expression. Therefore, the learning process is manifold learning inherent manifold expressed from the image representation, such inner manifold learning process is generally a nonlinear optimization process. Deep learning success, mainly due to higher accumulation of data and computing power. Conceptually depth of the network in the 1980s had been brought up, just because it was found that "deep Web" performance not as good as "shallow network", so do not get a big development.

International Conference on Computer Vision (ICCV), European Conference on Computer Vision (ECCV) and Computer Vision and Pattern Recognition Conference (CVPR): There seems little computer vision is depth study of potential applications, which can be from three major international conference on Computer Vision Recently published papers seen in general. The current basic situation is that people are taking advantage of deep learning to "replace" the traditional methods of computer vision. "Researchers" has become a "machine adjustment program", which is really an abnormal "mass motion." Newton's law of universal gravitation, Maxwell's electromagnetic equations, Einstein's mass-energy equation, Schrödinger equation of quantum mechanics, it seems that people should pursue.

3.3 Overview of Talent

Global Talent distribution

Scholars map is used to describe the distribution of specific areas of academics, scholars to carry out research, analyze the status of regional competitiveness is particularly important, global scholars below shows the distribution of computer vision:

image


Figure 3-1 scholars worldwide distribution of computer vision


The map is drawn to the inauguration of the current academic institution location, which represents the darker the color the more academic focus. As can be seen from the map, the United States and the number of obvious talent advantage mainly in its east and west coasts; Asia also have more talent distribution, mainly in eastern China, Korea and Japan; talent in Europe mainly in central and western Europe; other scholars such as parts of Africa and South America is very rare; the distribution of scientific and technological talents in computer vision with the regional economic situation broadly consistent strength.

In addition, in terms of gender, men in computer vision researchers accounted for 91.0%, accounting for 9.0% of women scholars, researchers accounted for males than females scholars.

Computer Vision scholars h-index profile shown below, most scholars h-index distribution in the middle area, wherein a maximum of h-index number of the interval 20-30, 706, accounting for 34.7%, less than 20 minimum interval number, there are 81 people.

image

China TALENT

image


Our experts in the field of computer vision, distributed as shown below. We can see by the following figure, the largest number of people in Beijing and Tianjin in the art, followed by the Pearl River Delta and Yangtze River Delta region, by contrast, the inland areas are more scarce talent, distribution and location of such factors and the level of economic situation not unrelated. At the same time, especially compared to Japan and South Korea, Southeast Asia and other Asian countries by the number of scholars have observed the situation of China's neighboring countries, China in the field of computer vision, a relatively large number of scholars.

China International Cooperation

Cooperation between China and other countries in computer vision can be analyzed according to AMiner get data platform, through statistical unit author of the paper, the authors mapped to each country, and then count the number of articles of cooperation between China and other countries, and in accordance with cooperation the number of published papers are sorted from high to low, as shown in the following table.

image


As can be seen from the table data, the number of papers Sino-US cooperation, reference number, the number of scholars ahead, indicating that the close cooperation between China and the US in the field of computer vision; at the same time, cooperation between China and the regions of the world is very wide, before 10 contains a partnership in Europe, Asia, North America and Oceania; the number of papers cooperation between China and Canada, though not the largest, but has the highest average number of citations description Canadian cooperation on the quality of cooperation reached a higher level.

3.4 Interpretation of paper

This section of the high-level academic conference papers in this field excavation, interpretation of these meetings 2018 - Working part representation 2019. Meeting includes:

IEEE Conference on Computer Vision and Pattern Recognition
IEEE International Conference on Computer Vision
European Conference on Computer Vision

我们对本领域论文的关键词进行分析,统计出词频Top20的关键词,生成本领域研究热点的词云图。其中,计算机视觉(computer vision)、图像(images)、视频(videos)是本领域中最热的关键词。

image

论文题目:Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

中文题目:具有空洞分离卷积的编码-解码器用于语义图像分割

论文作者:Liang-Chieh Chen,Yukun Zhu,George Papandreou,Florian Schroff,Hartwig Adam

论文出处:Proceedings of the European conference on computer vision (ECCV). 2018: 801-818.

论文地址:https://link.springer.com/chapter/10.1007%2F978-3-030-01234-2_49

研究问题:
语义分割是计算机视觉中一项基本且重要的研究内容,它是为图像中的每个像素分配语义标签。在深度学习语义分割任务中经常会使用空间金字塔池化和编码-解码器结构。空间金字塔池化可以通过不同分辨率的池化特征捕捉丰富的上下文信息,但网络中具有步进操作的池化或卷积会导致与对象边界有关的详细信息丢失。这可以通过空洞卷积提取更密集的特征图来缓解,但大大增加了计算资源的消耗。而编码-解码器结构则可以通过逐渐恢复空间信息来捕获更清晰的对象边界。通过组合两种方法的优点,提出新的模型—DeepLabv3+。

研究方法:
如下图是DeepLabv3+的网络结构,通过添加一个简单但有效的解码器模块来优化分割结果,尤其是对象边界的分割结果,扩展了DeepLabv3。编码器模块(DeepLabv3)通过在多个尺度上应用空洞卷积,编码多尺度上下文信息。空洞卷积可以明确控制由深度卷积神经网络所提特征的分辨率,并调整滤波器的感受野以捕获多尺度信息。而简单而有效的解码器模块则沿对象边界调整分割结果。

为了进一步提高模型的性能和速度,将深度分离卷积应用于ASPP(空洞空间金字塔池化)和解码器模块。深度分离卷积将传统的卷积分解为一个深度卷积和一个1×1的逐点卷积,在深度卷积操作时应用膨胀率不同的空洞卷积,以获取不同的尺度信息。

image

研究结果:
以用ImageNet-1k预训练的ResNet-101和修改的对齐Xception(更多的层、步进深度分离卷积替代最大池化、额外的BN和ReLU)为骨架网络,通过空洞卷积提取稠密特征。在PASCAL VOC 2012和Cityscapes数据集上证明了DeepLabv3+的有效性和先进性,无需任何后处理即可实现89%和82.1%的测试集性能。但是对非常相近的物体(例如椅子和沙发)、严重遮挡的物体和视野极小的物体较难进行分割。

论文题目:MobileNetV2: Inverted Residuals and Linear Bottlenecks

中文题目:MobileNetV2: 反向残差和线性瓶颈

论文作者:Sandler Mark,Howard Andrew,Zhu Menglong,Zhmoginov Andrey,Chen Liang-Chieh

论文出处:2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

论文地址:https://ieeexplore.ieee.org/document/8578572

研究问题
在众多计算机视觉领域中,深度神经网络正扮演越来越重要的角色。但是优秀性能的获得通常是以高昂计算资源为代价的,从而大大限制了在计算资源严重受限的移动端或嵌入式设备中使用。因此轻量化网络的研究在近期收到了大量关注,本文提出了一种新的移动端轻量化模型——MobileNetV2,在保持相同精度的同时显着减少了所需的操作和内存需求,关键是设计了具有线性瓶颈的反向残差模块。将上述模型应用于移动端目标检测,介绍了一种有效的方法—SSDLite。此外,通过简化的DeepLabv3构建移动端语义分割模型—Mobile DeepLabv3。

研究方法:
MobileNetV2的关键是具有线性瓶颈的反向残差模块,该模块以低维压缩表示作为输入,首先将其扩张到高维,然后使用轻量级的深度卷积进行过滤,最后使用线性卷积将特征投影回低维表示。其包含两个主要的技术:深度分离卷积和残差模块。

深度分离卷积是很多有效的神经网络结构中关键的组成部分,其基本思想是将传统卷积分解为两部分:第一层称为深度卷积,它通过对每个输入通道应用单个卷积滤波器来执行轻量化滤波;第二层是1×1卷积,称为逐点卷积,它通过计算输入通道的线性组合来构建新特征。深度分离卷积的计算量相对于传统卷积减少了大约k2(k是卷积核大小),但是性能只有极小的降低。

我们可以认为深度神经网络中任意层的激活组成一个“感兴趣流形”,它可以嵌入到低维子空间中。也就是说,深度卷积层中所有单个通道的像素,其中编码的信息实际上位于某种流形中,而该流形可以嵌入到低维子空间中。通过分析作者得到两个属性:

(1)如果感兴趣流形在ReLU变换后仍保持非零值,则对应于线性变换;
(2)ReLU能够保留输入流形的完整信息,但前提是输入流形位于输入空间的一个低维子空间中。

基于以上两个观点,帮助作者优化现有的神经网络结构:假设感兴趣流形是低维的,可以通过向卷积块插入线性瓶颈获得该流形,即本文核心具有线性瓶颈的反向残差模块,其结构如下图所示。先使用逐点卷积扩大通道数+ReLU激活,然后使用逐深度卷积提取特征+ReLU激活,最后使用逐点卷积降低通道数+线性激活,并且使用了shortcut连接。

image

研究结果:
研究者首先通过实验验证了反向残差连接和线性瓶颈的有效性,然后在图像分类、目标检测和语义分割三个任务上证明了本文网络结构的先进性。ImageNet图像分类任务上MobileNetV2的Top1最好可达74.7,优于MobileNetV1、ShuffleNet和NASNet-A。在目标检测任务上,MNetV2+SSDLite与MNetV1+SSDLite的mAP很接近,但参数量和计算时间都明显减少。在语义分割任务上保持较好性能的同时减少了参数量和计算资源的消耗。

论文题目:The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

中文题目:深度特征作为感知度量的有效性

论文作者:Zhang Richard,Isola Phillip,Efros Alexei A.,Shechtman Eli,Wang Oliver

论文出处:2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

论文地址:https://ieeexplore.ieee.org/document/8578166

研究方法:
对于人类来说,评估两幅图像之间的感知相似度几乎是毫不费力且快速的,但其潜在过程却被认为是相当复杂的。视觉模式是高维且高度相关的,视觉相似性的概念十分主观。例如在图像压缩领域,压缩图像是为了人类看来与原始图像没有很大区别,而不关注它们在像素值上可能有很大的差别。

当今最广泛使用的、传统的基于像素值的度量方法(例如L2 欧式距离、PSNR)或感知距离度量(如SSIM、MSSIM 等)是简单的浅层函数,无法解决人类感知的许多细微差别,一个最典型的例子就是模糊会造成图像在感知上的很大不同,但是在L2 范数上却差别不大。如下图所示,传统的评价指标与人类的感知判断是完全相反的。

近期深度学习社区发现,将在ImageNet 分类中训练的VGG 网络模型所提取的深度特征,用作图像合成的训练损失是非常有用,一般将这种损失称为“感知损失” (perceptual losses)。但是这些感知损失的作用有多大?哪些要素对其成功至关重要?本文研究者们尝试探讨了这些问题。

研究方法:
对于人类来说,评估两幅图像之间的感知相似度几乎是毫不费力且快速的, _但其潜在过程却被认为是相当复杂的。视觉模式是高维且高度相关的,视觉相似性的概念十分主观。例如在图像压缩领域,压缩图像是为了人类看来与原始图像没有很大区别,而不关注它们在像素值上可能有很大的差别。

当今最广泛使用的、传统的基于像素值的度量方法(例如L2 欧式距离、PSNR)或感知距离度量(如SSIM、MSSIM 等)是简单的浅层函数,无法解决人类感知的许多细微差别,一个最典型的例子就是模糊会造成图像在感知上的很大不同,但是在L2 范数上却差别不大。如下图所示,传统的评价指标与人类的感知判断是完全相反的。近期深度学习社区发现,将在ImageNet 分类中训练的VGG 网络模型所提取的深度特征,用作图像合成的训练损失是非常有用,一般将这种损失称为“感知损失” (perceptual losses)。

但是这些感知损失的作用有多大?哪些要素对其成功至关重要?本文研究者们尝试探讨了这些问题。

image

研究方法:
为了研究将深度神经网络提取的深度特征作为感知损失的有效性,本文研究者们构造了一个人类感知相似性判断的新数据集——Berkeley-Adobe Perceptual Patch Similarity Dataset(BAPPS 数据集)。该数据集包括484K 个人类判断,具有大量传统失真,如对比度、饱和度和噪声等;还有基于CNN 模型的失真,例如自编码、降噪等造成的失真;以及一些真实算法的失真,如超分辨率重建、去模糊等真实应用。

论文用如下公式计算在给到一个网络时,参考和失真图像块的距离。首先提取特征,然后将通道维度的激活归一化,用向量缩放每个通道,并采用2 距离。最后对空间维度的所有层取平均。

image

研究结果:
作者进行了大量的实验,系统地评估了不同网络结构和任务中的深度特征, _并将它们与经典指标进行比较,发现深度特征是一种非常好的感知度量指标。更令人惊讶的是,该结果不仅限于ImageNet 训练的VGG 提取的深度特征,而且还适用于不同的深度网络结构和不同的训练方式(监督,自监督,甚至无监督)。

论文题目:Residual Dense Network for Image Super-Resolution

中文题目:基于残差密集网络的图像超分辨率重建

论文作者:Yulun Zhang,Yapeng Tian,Yu Kong,Bineng Zhong,Yun Fu

论文出处:2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

论文地址:https://ieeexplore.ieee.org/document/8578360

研究内容:
单幅图像超分辨率(SISR)旨在通过其退化的低分辨率(LR)观测结果生成视觉上令人愉悦的高分辨率(HR)图像。最近,深度卷积神经网络在图像超分辨率重建方面取得了巨大的成功,网络的不断加深使模型能提供丰富分层特征,图像中的目标具有不同的比例、视角和宽高比,来自非常深的网络的分层特征能为重建提供更多线索。但是,大多数基于卷积神经网络的深层图像超分辨率模型都没有充分利用原始低分辨率(LR)图像中的分层特征,以致获得了相对较低的性能。在本文中,研究者提出了一种新颖的残差密集网络(RDN)来解决图像超分辨率中的上述问题,使模型能充分利用所有卷积层提取的分层特征。

研究方法:
如下图是残差密集网络RDN,主要包含四部分:浅层特征提取网络(SFEnet)、残差密集块(RDBs)、密集特征融合(DFF)和上采样网络(UPNet)。

一个非常深的网络直接提取LR 空间中每个卷积层的输出是困难且不切实际的,所以使用残差密集块(RDB)作为RDN 的构建模块。RDB 由密集连接层和具有局部残差学习能力的局部特征融合(LFF)组成。RDB还支持RDB之间的连续存储,一个RDB的输出可以直接访问下一个RDB中每一层,形成连续的状态传递。RDB中的每个卷积层都可以访问所有后续层,并传递需要保留的信息。局部特征融合将先前的RDB和当前RDB中所有先前层的状态连接在一起,通过自适应保留信息来提取局部密集特征。LFF通过更高的增长率来稳定更宽网络的训练。在提取多层局部密集特征后,进一步进行全局特征融合(GFF),以全局方式自适应地保留分层特征。在RDN中每个卷积层卷积核大小为3×3,局部和全局特征融合卷积核大小为1×1。在上采样部分使用ESPCNN提升图像的分辨率。

image

研究结果:
使用DIV2K数据集中全部的800幅训练图像训练模型,测试选用5个标准基准数据集:Set5、Set14、B100、Urban和Manga109。为了全面地说明所提方法的有效性,模拟了三种图像退化过程:
(1)双三次下采样(BI);
(2)高斯核模糊HR图像,再下采样(BD);
(3)先双三次下采样,再加入高斯噪声(DN)。

作者进行了大量的实验发现:
(1)RDB数量或RDB中卷积层数量越多,模型性能越好;增长率越大也会获得更好的性能。当上述模块使用数量较少时RDN依然比SRCNN性能好。
(2)进行了消融实验,验证了所提模型中连续存储、局部残差学习和全局特征融合的有效性。
(3)在三种退化模型上与六种先进的模型进行了对比:SRCNN、LapSRN、DRNN、SRDenseNet、MemNet和MDSR。在不同比例因子、退化模型和数据集中,RDN都表现出了相近甚至更好的性能。

论文题目:ShuffleNet V2: Practical guidelines for efficient cnn architecture design

中文题目:ShuffleNet V2:高效CNN网络结构设计实用指南

论文作者:Ma Ningning,Zhang Xiangyu,Zheng Hai-Tao,Sun Jian

论文出处:Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v 11218 LNCS, p 122-138, 2018, Computer Vision – _ECCV 2018 - 15th European Conference, 2018, Proceedings

论文链接:https://link.springer.com/chapter/10.1007%2F978-3-030-01264-9_8

研究内容:
自AlexNet之后,ImageNet图像分类准确率被很多新的网络结构如ResNet和DenseNet等不断提高,但是除准确率外,计算复杂度也是CNN网络需要考虑的重要指标。实际任务通常是要在有限的计算资源下获得最佳的精度,过复杂的网络由于速度原因难以在移动端等设备中应用。

为此,研究者们提出了很多轻量化的CNN网络如MobileNet和ShuffleNet等,在速度和准确度之间做了较好地平衡。以往的移动端CNN网络结构设计在考虑计算复杂度时,直接致力于优化整体网络计算所需的FLOPs,并没有考虑真正关心的速度和延迟,且具有相似FLOPs的网络速度也是不同的。像内存访问开销(MAC)、计算平台等也是需要考虑的方面。为了实际需求,本文研究者不局限于追求理论FLOPs的减少,从更直接的角度为轻量化网络设计提供指导意见。

研究方法:

image

作者建议有效的网络结构设计应考虑两个原则。首先,应使用直接指标(例如速度)代替间接指标(例如FLOP)。其次,应在目标平台上评估此类指标。通过对两个代表性最新网络的分析,作者得出了关于有效网络设计的四项准则:

(1)卷积层的输入和输出特征通道数相等时MAC最小;
(2)过多的组卷积会增大MAC;
(3)网络碎片化会降低并行度;
(4)元素级的操作(element-wise)会增加时间消耗。

遵循以上准则提出了一个更有效的网络结构——ShuffleNet V2。下图是ShuffleNet V1(图中a和b)和ShuffleNet V2(图中c和d)组成模块的对比。对比(a)和(b),ShuffleNet V2首先用Channel Split操作将输入按通道分成两部分,一部分直接向下传递,另外一部分则用于计算;然后弃用了1x1的组卷积,将通道混洗操作(Channel Shuffle)移到了最后,并将前面的Add操作用Concat代替。

研究结果:
论文进行了大量的实验,与MobileNet V1/V2、ShuffleNet V1、DenseNet、Xception、IGCV3-D、NASNet-A等模型在速度、精度、FLOPs上进行了详细的对比。实验中不少结果都和前面几点发现吻合,ShuffleNet V2在准确率和速度方面达到了很好的平衡。

论文题目:A Theory of Fermat Paths for Non-Line-of-Sight Shape Reconstruction

**中文题目:非视距形状重建的费马路径理论
**
论文作者:Shumian Xin, Sotiris Nousias, Kiriakos N. Kutulakos, Aswin C. Sankaranarayanan,Srinivasa G. Narasimhan, and Ioannis Gkioulekas.

论文出处:CVPR 2019 : IEEE Conference on Computer Vision and Pattern Recognition.

论文地址:https://www.ri.cmu.edu/wp-content/uploads/2019/05/cvpr2019.pdf

研究问题:
很多时候摄像头可能无法拍摄全部场景或物体,例如,面对摄像机的物体背面,拐角处的物体或通过漫射器观察到的物体。非视距(non-line-of-sight,NLOS)成像对于许多安全保护应用至关重要。一些传统方法通过分析隐藏场景投射阴影的细微本影和半影,以估计粗糙的运动和结构,或使用光的相干特性来定位隐藏的对象,但很难重建任意隐藏场景的3D形状。基于主动照明的瞬态NLOS成像大多采用快速调制光源和时间分辨传感器,但现有的SPAD强度估计不理想,而且重建NLOS对象的朗伯反射率假设。作者使用NLOS瞬态测量得出几何约束而非强度约束的方法来克服上述限制。

image

上图为非视距成像示例:被遮光板遮挡(a)和被漫射板遮挡(b)的物体表面重建结果与视距扫描结果(c)对比。

研究方法:
作者提出了一个新的光费马路径(Fermat path)理论,即光在已知的可见场景和不处于瞬态相机视线范围内的未知物体之间,这些光要么从镜面反射,要么被物体的边界反射,从而编码了隐藏物体的形状。作者证明,费马路径对应于瞬态测量中的不连续性,间断点的位置仅与NLOS对象的形状有关,与其反射率(BRDF)无关。并推导出一个新的约束条件,它将这些不连续处的路径长度的空间导数与曲面的曲率相关联。

基于此理论,作者提出了一种称为费马流(Fermat Flow)的算法,用于估计非视距物体的形状。其关键在于,费马路径长度的空间导数可唯一确定隐藏场景点的深度和法线,再拟合和估算平滑路径长度函数,进一步结合深度和法线获得光滑的网格,从而精确恢复了对复杂对象(从漫反射到镜面反射)形状,范围从隐藏在拐角处以及隐藏在漫射器后面的漫反射到镜面反射。最后,该方法与用于瞬态成像的特定技术无关。

研究结果:
作者使用了一些不同BRDF的凹凸几何形状的日常物品,包括半透明(塑料壶),光滑(碗,花瓶),粗糙镜面(水壶)和光滑镜面(球形)等。分别开展了使用SPAD和超快激光从皮秒级瞬态中恢复毫米级形状,以及使用干涉法实现从飞秒级瞬态中恢复毫米级形状的两种实验,实验结果显示重建细节与groundtruth形状非常吻合。

论文题目:Implicit 3D Orientation Learning for 6D Object Detection from RGB Images

**中文题目:从RGB 图像检测6维位姿的隐式三维朝向学习
**
论文作者:Martin Sundermeyer , Zoltan-Csaba Marton , Maximilian Durner , Rudolph Triebel

论文出处:ECCV 2018: European Conference on Computer Vision.

论文地址:
http://openaccess.thecvf.com/content_ECCV_2018/papers/Martin_Sundermeyer_Implicit_3D_Orientation_ECCV_2018_paper.pdf

研究问题:
对于诸如移动机器人控制和增强现实之类的应用而言,现代计算机视觉系统中最重要的组件之一就是可靠且快速的6D目标检测模块。至今尚无通用,易于应用,强大且快速的解决方案。原因是多方面的:首先,当前的解决方案通常不足以有效处理典型的挑战;其次,现有方法通常需要某些目标属性。而且,当前的方法在运行时间以及所需带标注的训练数据的数量和种类方面效率都不高。作者提出对单个RGB图像进行操作,可在很大程度上不需要深度信息,显著增加可用性。

研究方法:

image

上图为6D目标检测管道具有齐次坐标变换Hcam2obj(右上)和深度细化结果Hcam2obj(refined)(右下)。作者提出了一种基于RGB的实时目标检测和6D姿态估计流程。首先使用SSD(Single Shot Multibox Detector)来提供目标边界框和标识符。其次,在此基础上,采用新颖的3D方向估计算法,该算法基于之前的降噪自动编码器(Denoising Autoencoder)的通用版本,增强型自动编码器(AAE)。AAE使用一种新颖的域随机化策略,模型学到的并不是从输入图像到物体位姿的显式映射,而是会根据图像样本在隐含空间内建立一个隐式的物体位姿表征。因而,训练独立于目标方向的具体表示(例如四元数),避免从图像到方向的一对多映射,由此AAE可处理由对称视图引起的模糊姿态。另外学习专门编码3D方向的表征,同时实现对遮挡,杂乱背景的鲁棒性,并可推广到对不同环境和测试传感器。而且,AAE不需要任何真实的姿势标注训练数据。相反,它被训练为以自我监督的方式编码3D模型视图,克服了对大型姿势标注数据集的需要。下图为AAE训练过程。

image

研究结果:
作者在T-LESS和LineMOD数据集上评估了AAE和整个6D检测管道,仅包括2D检测,3D方向估计和投影距离估计。与最先进的深度学习方法相比,AAE准确性更好,同时效率更高。另外,作者也分析了一些失败案例,主要源于检测失败或强遮挡。

论文题目:SinGAN: Learning a Generative Model from a Single Natural Image

中文题目:SinGAN:从单张图像学习生成模型

论文作者:Tamar Rott Shaham ,Technion Tali Dekel ,Google Research ,Tomer Michaeli ,Technion

论文出处:ICCV 2019 : IEEE International Conference on Computer Vision.

论文地址:https://arxiv.org/pdf/1905.01164.pdf

研究问题:
生成对抗网络(Generative Adversarial Nets ,GAN)在模拟视觉数据的高维分布方面取得了巨大飞跃。特别是用特定类别的数据集(如人脸、卧室)进行训练时,非条件GAN在生成逼真的、高质量的样本方面取得了显著成功。但对高度多样化、多种类别的数据集(如ImageNet)的模拟仍然是一项重大挑战,而且通常需要根据另一输入信号来调整生成或为特定任务训练模型。对单个自然图像中各种图像块的内部分布进行建模已被公认为是许多计算机视觉任务的有用先验。作者将GAN带入到一个新领域—从单个自然图像中学习非条件生成模型。单个自然图像通常具有足够的内部统计信息,可学习到强大的生成模型,而不必依赖某个相同类别的数据集。为此,作者提出了一个新的单图像生成模型SinGAN,能够处理包含复杂结构和纹理的普通自然图像的神经网络。

image

相对于左边的原始图像,SinGAN生成新的逼真的图像样本,该样本在创建新的对象配置和结构的同时保留原始图像块分布。

研究方法:
作者的目标是学习一个非条件生成模型,该模型可捕获单个训练图像的内部统计数据。 此任务在概念上与常规GAN设置相似,不同之处在于,训练样本是单个图像的多尺度的图像块,而非整个图像样本。为此,SinGAN生成框架由具有层级结构的patch-GANs(马尔可夫判别器)组成,其中每个判别器负责捕获不同尺度的分布,这是第一个为从单个图像进行内部学习而探索的网络结构。图像样本从最粗尺度开始,然后依次通过所有的生成器直到最细尺度,且每个尺度都注入噪声。所有生成器和判别器具有相同的感受野,因此,随着生成过程推进可以捕获更细尺寸的结构。在训练时,对抗损失采用WGAN-GP损失,以增加训练稳定性。并设计了一种重建损失来确保可以生成原始图像的特定噪声图谱集合。

研究结果:
作者在图像场景跨度很大的数据集上进行了测试。直观上,SinGAN很好地保留目标的全局结构和纹理信息,很真实地合成了反射和阴影效果。再使用AMT真假用户调研和FID的单幅图像版本进行量化。AMT测试结果表明可以生成很真实的样本,对于细节保留的也更多,人类判别的混淆率较高。FID结果与AMT一致。

3.5 计算机视觉进展

近年来,巨量数据的不断涌现与计算能力的快速提升,给以非结构化视觉数据为研究对象的计算机视觉带来了巨大的发展机遇与挑战性难题,计算机视觉也因此成为学术界和工业界公认的前瞻性研究领域,部分研究成果已实际应用,催生出人脸识别、智能视频监控等多个极具显示度的商业化应用。

计算机视觉的研究目标是使计算机具备人类的视觉能力,能看懂图像内容、理解动态场景,期望计算机能自动提取图像、视频等视觉数据中蕴含的层次化语义概念及多语义概念间的时空关联等。计算机视觉领域不断涌现出很多激动人心的研究成果,例如,人脸识别、物体识别与分类等方面的性能已接近甚至超过人类视觉系统。本文根据近两年计算机视觉领域顶级会议最佳论文及高引论文,对该领域中的技术现状和研究前沿进行了综合分析。

近两年大多数研究都集中在深度学习、检测和分类以及面部/手势/姿势、3D传感技术等方面。随着计算机视觉研究的不断推进,研究人员开始挑战更加困难的计算机视觉问题,例如,图像描述、事件推理、场景理解等。单纯从图像或视频出发很难解决更加复杂的图像理解任务,一个重要的趋势是多学科的融合,例如,融合自然语言处理领域的技术来完成图像描述的任务。图像描述是一个融合计算机视觉、自然语言处理和机器学习的综合问题,其目标是翻译一幅图片为一段描述文字。目前主流框架为基于递归神经网络的编码器解码器结构其核心思想类似于自然语言机器翻译。

但是,由于递归网络不易提取输入图像和文本的空间以及层次化约束关系,层次化的卷积神经网络以及启发自认知模型的注意力机制受到关注。如何进一步从认知等多学科汲取知识,构建多模态多层次的描述模型是当前图像描述问题研究的重点。

事件推理目标是识别复杂视频中的事件类别并对其因果关系进行合理的推理和预测。与一般视频分析相比,其难点在于事件视频更加复杂,更加多样化,而最终目标也更具挑战性。不同于大规模图像识别任务,事件推理任务受限于训练数据的规模,还无法构建端到端的事件推理系统。目前主要使用图像深度网络作为视频的特征提取器,利用多模态特征融合模型,并利用记忆网络的推理能力,实现对事件的识别和推理认知。当前研究起源于视频的识别和检测,其方法并未充分考虑事件数据的复杂和多样性。如何利用视频数据丰富的时空关系以及事件之间的语义相关性,应是今后的关注重点。

场景理解的目的是计算机视觉系统通过分析处理自身所配置的传感器采集的环境感知数据,获得周围场景的几何/拓扑结构、组成要素(人、车及物体等)及其时空变化,并进行语义推理,形成行为决策与运动控制的时间、空间约束。近年来,场景理解已经从一个初期难以实现的目标成为目前几乎所有先进计算机视觉系统正在不断寻求新突破的重要研究方向。

利用社会-长短记忆网络(Social-LSTM)实现多个行人之间的状态联系建模,结合各自运动历史状态,决策出未来时间内的运动走向。此外神经网络压缩方向也是是目前深度学习研究的一个热门的方向,其主要的研究技术有压缩,蒸馏,网络架构搜索,量化等。

综上所述,视觉的发展需要设计新的模型,它们需要能考虑到空间和时间信息;弱监督训练如果能做出好的结果,下一步就是自监督学习;需要高质量的人类检测和视频对象检测数据集;结合文本和声音的跨模态集成;在与世界的交互中学习。

image

立即体验工业视觉智能平台训练:

https://www.aliyun.com/product/indvi?spm=5176.12825654.h2v3icoap.467.e9392c4a1KMEL9&aly_as=c7DQGDJ5

The sample data pre-training Ali cloud model based on industry best practices in each scene obtained with the combination of the user's actual scene, customized optimization model through user training sample data, so as to fit the user's actual usage scenarios.

Guess you like

Origin yq.aliyun.com/articles/740749