94-page paper review Convolutional Neural Networks: From Basic Technology to Research Prospects

640?wx_fmt=gif&wxfrom=5&wx_lazy=1

Introduction: Convolutional Neural Networks (CNNs) have achieved unprecedented success in the field of computer vision, but we currently do not have a comprehensive understanding of why they are so effective. Recently, Isma Hadji and Richard P. Wildes from the Department of Electrical Engineering and Computer Science at York University published the paper "What Do We Understand About Convolutional Networks?", which provides an overview of the technical basis, building blocks, current status and research prospects of convolutional networks. Grooming, introducing our current understanding of CNNs.



640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1


Paper address: https://arxiv.org/abs/1803.08834 



01 Introduction


1.1 Motivation


In the past few years, computer vision research has mainly focused on convolutional neural networks (often abbreviated as ConvNet or CNN). These works have achieved new state-of-the-art performance on a wide range of classification and regression tasks. Relatively speaking, although the history of these methods goes back many years, the theoretical understanding of how these systems achieve excellent results is lagging behind. In fact, many achievements in the current computer vision field use CNN as a black box. This approach is effective, but the reason for its effectiveness is very obscure, which seriously fails to meet the requirements of scientific research. Especially these two complementary questions: (1) In terms of being learned (such as convolution kernels), what exactly is being learned? (2) In terms of architectural design (e.g. number of layers, number of cores, pooling strategy, choice of nonlinearity), why are some choices better than others? The answers to these questions will not only help improve our scientific understanding of CNNs, but also their usefulness.


Furthermore, current approaches to implementing CNNs require large amounts of training data, and design decisions have a large impact on the resulting performance. A deeper theoretical understanding should alleviate the reliance on data-driven design. Although there have been empirical studies investigating how the implemented network operates, so far these results have largely been limited to the visualization of internal processing in order to understand what happens in different layers in a CNN.


1.2 Objectives


In response to the above, this report will outline the most prominent methods proposed by researchers using multi-layer convolutional architectures. It is important to note that this report will discuss the various components of a typical convolutional network by outlining different approaches, and will present the biological findings and/or sound theoretical underpinnings on which their design decisions are based. In addition, this report will outline different attempts to understand CNNs through visualization and empirical research. The ultimate goal of this report is to elucidate the role of each processing layer involved in a CNN architecture, bring together our current understanding of CNNs, and illustrate unanswered questions.


1.3 Outline of the report


This report is structured as follows: This chapter presents the motivation to review our understanding of convolutional networks. Chapter 2 will describe various multilayer networks and present the most successful architectures used in computer vision applications. Chapter 3 will focus more specifically on each of the building blocks of a typical convolutional network, and will discuss the design of the different components from both biological and theoretical perspectives. Finally, Chapter 4 will discuss current trends in CNN design and understanding the work of CNNs, and will also highlight some key shortcomings that remain.



02 Multilayer Network


Overall, this chapter will provide a brief overview of the most prominent multi-layer architectures used in the field of computer vision. It should be noted that although this chapter covers the most important contributions in the literature, it does not provide a comprehensive overview of these architectures, as such overviews already exist elsewhere (eg [17, 56, 90]). Rather, the purpose of this chapter is to set the discussion ground for the remainder of this report so that we can present and discuss in detail the current understanding of convolutional networks for visual information processing.


2.1 Multilayer Architecture


Prior to the recent success of deep learning-based networks, state-of-the-art computer vision systems for recognition relied on two separate but complementary steps. The first step is to transform the input data into a suitable form through a set of manually designed operations (such as convolution with the basis set, local or global encoding methods). Transforming the input often requires finding a compact and/or abstract representation of the input data, while also injecting some invariants depending on the current task. The goal of this transformation is to change the data in a way that is more easily separable by the classifier. Second, the transformed data is often used to train some type of classifier (such as a support vector machine) to recognize the content of the input signal. In general, the performance of any classifier is heavily influenced by the transformation method used.


Multi-layer learning architectures bring a different perspective to this problem, which proposes not only to learn a classifier, but also to learn the required transformation operations directly from the data. This form of learning is often referred to as "representation learning", or "deep learning" when applied to deep multi-layer architectures.


A multi-layer architecture can be defined as a computational model that allows useful information to be extracted from multiple layers of abstraction from the input data. In general, multi-layer architectures are designed with the goal of highlighting important aspects of the input at higher layers, while becoming increasingly robust to less important changes. Most multi-layer architectures are stacks of simple building blocks with alternating linear and nonlinear functions. Over the years, researchers have proposed many different types of multi-layer architectures, and this chapter will cover the most prominent such architectures used in computer vision applications. Artificial Neural Networks are the focus of attention because this architecture performs so well. For the sake of simplicity, we will refer to this type of network directly as a "neural network" later.


2.1.1 Neural Network


A typical neural network consists of an input layer, an output layer, and multiple hidden layers, each of which contains multiple units.


640?wx_fmt=jpeg

Figure 2.1: Schematic diagram of a typical neural network architecture, from [17]


An autoencoder can be defined as a multi-layer neural network consisting of two main parts. The first part is the encoder, which transforms the input data into feature vectors; the second part is the decoder, which maps the generated feature vectors back to the input space.


640?wx_fmt=jpeg

Figure 2.2: The structure of a typical autoencoder network, from [17]


2.1.2 Recurrent Neural Network


When it comes to tasks that depend on sequential inputs, recurrent neural networks (RNNs) are one of the most successful multi-layer architectures. RNNs can be thought of as a special type of neural network where the input to each hidden unit is the data observed at its current time step and the state of its previous time step.


640?wx_fmt=jpeg

Figure 2.3: Schematic diagram of the operation of a standard recurrent neural network. The input to each RNN unit is the new input at the current time step and the state of the previous time step; a new output is then 640?wx_fmt=pngcalculated , which in turn can be fed to the next layer of the multi-layer RNN for processing.


640?wx_fmt=jpeg

Figure 2.4: Schematic diagram of a typical LSTM cell. The input to this unit is the input for the current time and the input for the previous time, then it returns an output and feeds it to the next time. The final output of the LSTM cell is controlled by the input gate, output gate and memory cell state. Figure from [33]


2.1.3 Convolutional Networks


Convolutional networks (CNNs) are a class of neural networks that are particularly suitable for computer vision applications because of their ability to abstract representations hierarchically using local operations. There are two key design ideas that drive the success of convolutional architectures in computer vision. First, CNNs exploit the 2D structure of images and pixels within adjacent regions are often highly correlated. Therefore, instead of using one-to-one connections between all pixel units (as most neural networks do), CNNs can use grouped local connections. Second, the CNN architecture relies on feature sharing, so each channel (ie, the output feature map) is generated by convolution with the same filter at all locations.


640?wx_fmt=jpeg

Figure 2.5: Schematic diagram of the structure of a standard convolutional network, from [93]


640?wx_fmt=jpeg

Figure 2.6: Schematic diagram of Neocognitron, from [49]


2.1.4 Generative Adversarial Networks


A typical Generative Adversarial Network (GAN) consists of two competing modules or sub-networks, namely: a generator network and a discriminator network.


640?wx_fmt=jpeg

Figure 2.7: Schematic diagram of the general structure of a generative adversarial network


2.1.5 Training of Multilayer Networks


As discussed earlier, the success of various multi-layer architectures is largely dependent on the success of their learning process. Its training process is usually based on backpropagation of errors using gradient descent. Gradient descent is widely used for training multi-layer architectures due to its simplicity of use.


2.1.6 Briefly talk about transfer learning


The applicability of features extracted using multi-layer architectures to a variety of different datasets and tasks can be attributed to their hierarchical nature, where representations evolve from simple and local to abstract and global. Therefore, the features extracted at lower levels in its hierarchy are often shared by a variety of different tasks, thus making the multi-layer structure easier to implement transfer learning.


2.2 Spatial Convolutional Networks


In theory, convolutional networks can be applied to data of any dimension. Their two-dimensional instances are well suited to the structure of a single image, and thus have received considerable attention in the field of computer vision. Armed with large datasets and powerful computers to train on, CNNs have recently seen a dramatic increase in the use of CNNs for a variety of different tasks. This section introduces the more prominent 2D CNN architectures that introduce relatively novel components to the original LeNet.


2.2.1 Key Architectures in Recent Developments of CNN


640?wx_fmt=jpeg

Figure 2.8: AlexNet architecture. It should be pointed out that although this is a two-stream architecture from the diagram, it is actually a single-stream architecture. This diagram only illustrates the parallel training of AlexNet on 2 different GPUs. Figure from [88]


640?wx_fmt=jpeg

Figure 2.9: GoogLeNet architecture. (a) A typical inception module showing operations performed sequentially and in parallel. (b) Schematic diagram of a typical inception architecture consisting of many inception modules stacked in layers. Figure from [138]


640?wx_fmt=jpeg

Figure 2.10: ResNet architecture. (a) Residual module. (b) Schematic diagram of a typical ResNet architecture consisting of many residual modules stacked in layers. Figure from [64]


640?wx_fmt=jpeg

Figure 2.11: DenseNet architecture. (a) The dense module. (b) (b) Schematic illustration of a typical DenseNet architecture consisting of many dense modules stacked in layers. Figure from [72]


2.2.2 Implementing the invariance of CNN


One of the challenges with using CNNs is that very large datasets are required to learn all the basic parameters. Even large-scale datasets such as ImageNet with more than 1 million images are still considered too small for training specific deep architectures. One way to meet the requirements of such large datasets is to artificially augment the dataset by randomly flipping, rotating, and jittering the images. A big advantage of these augmentation methods is that the resulting network is better kept invariant in the face of various transformations.


2.2.3 Implementing CNN localization


In addition to simple classification tasks such as recognizing objects, CNNs have recently performed well on tasks that require precise localization, such as semantic segmentation and object detection.


2.3 Spatiotemporal Convolutional Networks


The use of CNNs has brought significant performance improvements for a variety of image-based applications, and has sparked interest in extending 2D spatial CNNs to 3D spatiotemporal CNNs for video analysis. In general, various spatiotemporal architectures proposed in the literature are just attempts to extend the 2D architecture of the spatial domain (x,y) into the temporal domain (x,y,t). There are three distinct architectural design decisions that stand out in the field of training-based spatiotemporal CNNs: LSTM-based CNNs, 3D CNNs, and Two-Stream CNNs.


2.3.1 LSTM-based spatiotemporal CNN


LSTM-based spatiotemporal CNNs were some early attempts to extend 2D networks to handle spatiotemporal data. Their operation can be summarized in three steps as shown in Figure 2.16. In the first step, each frame is processed using a 2D network and feature vectors are extracted from the last layer of these 2D networks. In the second step, these features from different time steps are used as input to the LSTM to obtain the temporal results. In the third step, these results are then averaged or linearly combined before being passed to a softmax classifier to get the final prediction.


2.3.2 3D CNN


This prominent spatiotemporal network is the most direct generalization of 2D CNNs to the image spatiotemporal domain. It directly processes the temporal stream of RGB images and processes these images by applying the learned 3D convolutional filters.


2.3.3 Two-Stream CNN


This type of spatiotemporal architecture relies on a two-stream design. The standard dual-stream architecture employs two parallel paths—one for appearance and the other for motion; this approach is similar to the dual-stream hypothesis in the study of biological visual systems.


2.4 Overall discussion


It is important to point out that although these networks achieve very competitive results in many computer vision applications, their main drawbacks remain: limited understanding of the exact nature of the learned representations, dependence on large Large-scale data training sets, lack of ability to support accurate representation boundaries, unclear choice of network hyperparameters.



03 Understand the building blocks of CNNs


Given the large number of unanswered questions in the CNN field, this chapter will describe the role and significance of each processing layer in a typical convolutional network. To this end, this chapter will outline the most prominent efforts to address these issues. In particular, we will demonstrate how CNN components are modeled from both theoretical and biological perspectives. The introduction to each component is followed by a summary of our current level of understanding.


3.1 Convolutional layer


Convolutional layers are arguably one of the most important steps in a CNN architecture. Basically, convolution is a linear, translation-invariant operation that consists of performing locally weighted combinations on the input signal. Depending on the chosen set of weights (ie the chosen point spread function), different properties of the input signal will also be revealed. In the frequency domain, associated with the point spread function is the modulation function - describing how the frequency components of the input are modulated by scaling and phase shifting. Therefore, choosing the right kernel is crucial to capture the most salient and important information contained in the input signal, which allows the model to make better inferences about the content of the signal. This section will discuss some different ways to implement this kernel selection step.


3.2 Rectification


Multilayer networks are usually highly nonlinear, and rectification is usually the first processing stage in which nonlinearity is introduced into the model. Rectification refers to applying a point-wise nonlinearity (also known as an activation function) to the output of a convolutional layer. The term is borrowed from the field of signal processing, where rectification refers to changing alternating current to direct current. This is also a processing step that can find causes both biologically and theoretically. Computational neuroscientists introduce the rectification step to find suitable models that best explain current neuroscience data. On the other hand, machine learning researchers use rectification to make models learn faster and better. Interestingly, researchers in both fields tend to agree on this: they not only need rectification, but they all end up in the same rectification.


640?wx_fmt=jpeg

Figure 3.7: Nonlinear rectification functions used in the literature for multilayer networks


3.3 Normalization


As mentioned earlier, multi-layer architectures are highly nonlinear due to the cascading nonlinear operations in these networks. In addition to the rectification nonlinearity discussed in the previous section, normalization is another nonlinear processing module that plays an important role in the CNN architecture. The most widely used form of normalization in CNNs is the so-called Divisive Normalization (DN, also known as Local Response Normalization). This section will introduce the role of normalization and describe the way it corrects the shortcomings of the first two processing modules (convolution and rectification). Again, we will discuss normalization from both biological and theoretical perspectives.


3.4 Pooling


Whether biologically inspired, purely learning-based, or completely human-designed, almost all CNN models contain a pooling step. The goal of the pooling operation is to bring some degree of invariance to changes in position and size and to aggregate responses within and across feature maps. Similar to the three CNN modules discussed in the previous sections, pooling has support from both biological and theoretical studies. At this processing layer of the CNN network, the main point of contention is the choice of pooling function. The two most widely used pooling functions are average pooling and max pooling. This section will explore the advantages and disadvantages of various pooling functions described in the related literature.


640?wx_fmt=jpeg

Figure 3.10: Comparison of average pooling and max pooling on Gabor filtered images. (a) shows the effect of average pooling at different scales, where the upper row in (a) is the result applied to the original grayscale image, and the lower row in (a) is the result applied to the Gabor filtered image. Average pooling produces a smoother version of the grayscale image, while sparse Gabor filtered images fade away. Relatively speaking, (b) shows the effect of max pooling at different scales, where the upper row in (b) is the result applied to the original gray value image, and the lower row in (b) is applied to Gabor filtering. result on the image. Here it can be seen that max pooling results in degraded grayscale image quality, while sparse edges in the Gabor filtered image are enhanced. Figure from [131]



04 Current status


The discussion of the roles of various components in the CNN architecture highlights the importance of the convolutional module, which is largely responsible for capturing the most abstract information in the network. Relatively speaking, we have the least understanding of this processing module, because it requires the most heavy calculation. This chapter presents current trends in trying to understand what different CNN layers learn. At the same time, we will also highlight the issues that remain to be addressed with respect to these trends.


4.1 Current trends


While various CNN models continue to advance state-of-the-art performance in a variety of computer vision applications, limited progress has been made in understanding how these systems work and why they are so effective. This problem has attracted the interest of many researchers, and many methods for understanding CNN have also emerged. In general, these methods can be divided into three directions: visualization of learned filters and extracted feature maps, ablation studies inspired by biological approaches to understanding the visual cortex, and analysis by introducing analytical principles into network design. Minimize the learning process. This section provides a brief overview of each of these methods.


4.2 Still to be resolvedThe problem


Based on the above discussion, the following key research directions exist for visualization-based methods:


  • First and foremost: it is important to develop methods to make visual assessments more objective, which can be achieved by introducing metrics that assess the quality and/or meaning of the generated visualisations.

  • Also, while it seems that network-centric visualization methods are more promising (as they do not rely on the network itself to generate visualizations), it also seems necessary to standardize their evaluation process. A possible solution is to use a benchmark to generate visualizations for networks trained under the same conditions. Such a standardized approach would in turn enable indicator-based assessments rather than current interpretive analyses.

  • Another development is to visualize multiple units simultaneously to better understand the distributed aspects of the representation under study, even while following a controlled approach.


The following are potential research directions for methods based on ablation study:


  • Use a common systematically organized dataset with different challenges common to computer vision (such as viewing angle and lighting changes), and necessitating more complex categories (such as complexity on textures, parts, and objects) . In fact, such datasets have appeared recently [6]. Using an ablation study on such a dataset, coupled with analysis of the resulting confusion matrix, can identify patterns in which CNN architectures go wrong, leading to better understanding.

  • Furthermore, a systematic study of the way in which multiple synergistic ablations affect model performance is of great interest. Such research should extend our understanding of how individual units work.


Finally, these controlled methods are promising future research directions; as they allow us to gain a deeper understanding of the operations and representations of these systems than would be possible with purely learning-based methods. These interesting research directions include:


  • Fix network parameters step by step and analyze the effect on network behavior. For example, the convolution kernel parameters are fixed one layer at a time (based on the current prior knowledge of the task) to analyze the applicability of the adopted kernel at each layer. This progressive approach promises to reveal the role of learning and can also be used as an initialization method to minimize training time.

  • Similarly, the design of the network architecture itself (such as the number of layers or the number of filters in each layer) can be studied by analyzing the properties of the input signal (such as what is common in the signal). This approach helps to bring the architecture to the appropriate complexity for the application.

  • Finally, the use of controlled methods on network implementations allows for a systematic study of the role of other aspects of CNNs, which have received less attention due to the focus on learned parameters. For example, the role of various pooling strategies and residual connections can be investigated when most of the learned parameters are fixed.


Source: Heart of the Machine (ID: almosthuman2014) Compilation

Original from: arXiv


Recommended reading


Summary of 100 Big Data Tools in the World (Top 50)

10 major changes in the era of big data

Seven big data trends The first trend is the Internet of Things

The Japanese grandpa insisted on painting with Excel for 17 years. I may have used a fake Excel...



Q:  Regarding CNN, did you get the point?

Welcome to leave a message to share with you

Feel good, please share this article with your friends

For reprint/submission, please contact: [email protected]

For more exciting articles, please click "Historical Articles" in the background of the official account to view

640?wx_fmt=jpeg

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326384084&siteId=291194637