Crisscross | GoogLeNet (1)

deeper convolution

Thesis title: Going Deeper with Convolutions

The paper is Google's work published on CVPR 2015

Paper address: link

Abstract

The authors propose a deep convolutional neural network architecture called Inception , which achieved state-of-the-art in both classification and detection in the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC14). This architecture improves the utilization of computing resources within the network. By careful design, the authors increase the depth and width of the network while keeping the computational budget constant. To optimize quality, architectural decisions are based on the Hebbian principle and the intuition of multi-scale processing. A special form used in our submission for ILSVRC14 is called GoogLeNet , a 22-layer deep network that is evaluated in the context of classification and detection.

1. Introduction

In the past three years, object classification and detection capabilities have been greatly improved due to advances in deep learning and convolutional networks. One encouraging news is that most of the advances are not just the result of more powerful hardware, larger datasets, and larger models, but mostly new ideas, algorithms, and improved network architectures. For example, in the 2014 ILSVRC competition, no new data sources were used other than the classification dataset of the same competition used for detection purposes. The author's GoogLeNet submitted to ILSVRC 2014 actually used 12 times fewer parameters than the winning architecture of Krizhevsky et al. [9] two years ago, while being significantly more accurate. In object detection, the biggest gains come not from the naive application of larger and larger deep networks, but from the synergy of deep architectures and classical computer vision, such as the R-CNN algorithm of Girshick et al. [6].

Another factor worth noting is that as mobile and embedded computing continues to evolve, the efficiency of algorithms – especially their power and memory usage – becomes very important. Note that the design considerations for the deep architecture presented in this paper include this factor, rather than purely pursuing accuracy numbers. In most experiments, these models are designed to maintain a computational budget of 1.5 billion runs at inference, so that they don't end up being purely academic explorations, but can be put into the real world at reasonable cost, even in on large datasets.

This paper will focus on an efficient deep neural network architecture for computer vision, named Inception, whose name is derived from the Network in network paper by Lin et al. [12] and the famous “we need to go deeper” Internet meme [ 1]. In this paper, the word "depth" has two different meanings: first, it refers to the introduction of a new level of organization in the form of "Inception modules" , and it also refers to the more immediate meaning of increasing the depth of the network . In general, the Inception model can be viewed as a logical apex of [12], while taking inspiration and guidance from the theoretical work of Arora et al. [2]. The advantages of this architecture are experimentally verified in the classification and detection challenges of ILSVRC 2014, where it significantly outperforms the state-of-the-art.

2. Related Work

Starting with LeNet-5, Convolutional Neural Networks (CNNs) typically have a standard structure - stacked convolutional layers (optionally followed by contrast normalization and max pooling) followed by one or more fully connected layers. There are many variants of this basic design in the image classification literature and have achieved by far the best results on the MNIST, CIFAR and ImageNet classification challenges. For larger datasets such as Imagenet, the recent trend is to increase the number of layers and layer size while using dropout to solve the overfitting problem.

The same convolutional network architecture as [9] has also been successfully used for localization, object detection, and human pose estimation, despite concerns that the max-pooling layer will lead to loss of accurate spatial information.

Inspired by neuroscientific models of primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. The authors use a similar strategy here. Unlike the fixed 2-layer deep model of [15], all filters in the Inception architecture are learnable. In addition, the Inception layer is repeated many times, and a 22-layer deep model is generated under the GoogLeNet model.

Network-in-Network is a method proposed by Lin et al. [12] to improve the representation ability of neural networks. In their model, the extra 1 × 1 1\times 11×1 Convolutional layer is added to the network, increasing its depth. The authors use this approach several times in the architecture of this paper. In this setting,1 × 1 1\times 11×1 Convolution has a dual purpose: one, it is mainly used as a dimensionality reduction module to remove computational bottlenecks that would otherwise limit the size of the network. Two, this not only increases the depth, but also the width of the network without significantly degrading performance.

Finally, the current state-of-the-art object detection technique is the Region with Convolutional Neural Network (R-CNN) method proposed by Girshick et al. [6]. R-CNN decomposes the entire detection problem into two sub-problems: generating object locations in a class-independent manner using low-level cues such as color and texture, and using a CNN classifier to identify object classes at these locations. This two-stage approach leverages the accuracy of bounding box segmentation with low-level cues, and the powerful classification capabilities of state-of-the-art CNNs. The authors adopt a similar pipeline in detection submissions, exploring enhancements in both stages, such as multi-box prediction for higher target bounding box recall, and ensemble methods for better bounding box classification .

3. Motivation and High Level Considerations

The most straightforward way to improve the performance of a deep neural network is to increase its size. This includes increasing the depth – the number of layers of the network – and its width: the number of units per layer. This is an easy and safe way to train higher quality models, especially given the large amount of labeled training data. However, this simple solution has two major drawbacks.

Larger scale usually means more parameters, which makes the enlarged network more prone to overfitting, especially when the number of labeled examples in the training set is limited. This is a major bottleneck because strongly labeled datasets are laborious and expensive, often requiring expert evaluation to distinguish various fine-grained visual categories like those in ImageNet (even the 1000-class ILSVRC subset) such as Figure 1.

figure 1

Figure 1: Two different classes out of 1000 for the ILSVRC 2014 classification challenge. Domain knowledge is required to distinguish these classes.

Another disadvantage of uniformly increasing the size of the network is the dramatic increase in the use of computing resources. For example, in a deep vision network, if two convolutional layers are concatenated, any uniform increase in the number of their filters results in a quadratic increase in computation. If the increased capacity is used inefficiently (for example, if most weights end up close to zero), then most of the computation is wasted. Since computing budgets are always limited, it is better to allocate computing resources efficiently than to indiscriminately increase scale, even if the main goal is to improve performance.

A fundamental way to address these two problems is to introduce sparsity, replacing fully connected layers with sparse layers, even inside convolutions. In addition to mimicking biological systems, this has the advantage of a more solid theoretical foundation due to the seminal work of Arora et al. [2]. Their main results show that 如果数据集的概率分布可以由一个大型的、非常稀疏的深度神经网络来表示,那么最佳的网络拓扑结构可以通过分析前面几层激活的相关统计数据和对输出高度相关的神经元进行聚类来逐层构建。while a rigorous mathematical proof requires very strong conditions, the fact that this statement resonates with the Hebbian principle (neurons fire together, connect together) suggests that even under less stringent conditions, fundamental Ideas also apply to practice.

Unfortunately, today's computing infrastructure is very inefficient for numerical computations on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by a factor of 100, the overhead of lookups and cache misses dominates: switching to sparse matrices probably won't improve. The gap is still widening further through the use of steadily improved and highly tuned numerical libraries, allowing extremely fast dense matrix multiplications, exploiting the tiny details of the underlying CPU or GPU hardware. Furthermore, non-uniform sparse models require more complex engineering and computational infrastructure. Most current vision-oriented machine learning systems exploit sparsity in the spatial domain only by using convolutions. However, convolution is implemented as a densely connected collection of patches from the previous layer. Since [11], ConvNets have traditionally used random and sparse connected tables in the feature dimension to break symmetry and improve learning, but to further optimize parallel computing, the trend is back to [9]'s fully connected. Current state-of-the-art computer vision architectures have a unified structure. A large number of filters and larger batch sizes allow efficient use of intensive computations.

This raises the question of whether the next step is promising: an architecture that exploits filter-level sparsity, as theory suggests, but exploits current hardware by exploiting computations on dense matrices. Extensive literature on sparse matrix computation (e.g. [3]) shows that clustering sparse matrices into relatively dense sub-matrices tends to provide competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that a similar approach will be used to automatically build non-uniform deep learning architectures in the near future.

The Inception architecture started out as a case study to evaluate the hypothetical output of a complex network topology algorithm that attempts to approximate the sparse structure of vision networks implied by [2], and cover the hypotheses with dense, off-the-shelf components the result of. Although this is highly speculative work, modest gains were observed early on compared to reference networks based on [12]. With some tweaking, the gap widens, and Inception proves to be particularly useful as a base network in [6] and [5] in the context of localization and object detection. Interestingly, while most of the initial architectural choices were thoroughly questioned and tested in terms of separation, they turned out to be close to local optima. But caution: Although the Inception architecture has been a success in computer vision, it remains questionable whether this can be attributed to its guiding principles. Determining this requires more thorough analysis and validation.

references

[1] Know your meme: We need to go deeper.http://knowyourmeme.com/memes/we-need-to-go-deeper.Accessed: 2014-09-15.

[2]Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma.Provable bounds for learning some deep representations.CoRR, abs/1310.6343, 2013.

[3] Ümit V. Çatalyürek, Cevdet Aykanat, and Bora Uçar.On two-dimensional sparse matrix partitioning: Models, methods, and a recipe.SIAM J. Sci. Comput., 32(2):656–683, February 2010.

[6] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.Rich feature hierarchies for accurate object detection and semantic segmentation.In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.

[11] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.

[12] Min Lin, Qiang Chen, and Shuicheng Yan.Network in network.CoRR, abs/1312.4400, 2013.

[15] Thomas Serre, Lior Wolf, Stanley M. Bileschi, Maximilian Riesenhuber, and Tomaso Poggio.Robust object recognition with cortex-like mechanisms.IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–426, 2007.

Guess you like

Origin blog.csdn.net/wl1780852311/article/details/123187196