Comprehensive Design Guidelines for CNN Image Classification

Comprehensive Design Guidelines for CNN Image Classification

Yesterday, I saw an article on Hackernoon pushed by Mr. Aikeke. It is a solid experience summary. After reading it, it feels similar to my own experimental experience, but it is more detailed than what I summarized, so I will translate it for everyone to see. Original link
https://hackernoon.com/a-comprehensive-design-guide-for-image-classification-cnns-46091260fb92

Translator's introduction

   This blog is basically a review. It talks about some fundamental issues when designing CNN. It is not suitable for beginners who have no experience in alchemy, and is suitable for junior alchemists who have some experimental experience but are confused about how to design the network. In the section on how to reduce storage consumption, only a few papers are given, and no detailed explanations are given, so it is best to read the original paper. After all, deep learning still needs to read more papers and multiple implementations is the best way to learn.


   Want to complete an image classification task but don't know how to start. For example which pretrained network to choose? How can I modify it to suit my needs? Should a 20 or 100 layer network be used? Which network is the fastest and which is the most accurate? These all need to be considered when trying to choose the best network model for an image classification task.

   When choosing a CNN for its own image classification task, there are 3 main metrics: accuracy, speed, storage consumption. These performance metrics depend on which CNN network was chosen and what modifications were made. Different neural networks such as VGG, Inception, and Resnet have made their own trade-offs against the indicators. Also, you can make modifications yourself, such as trimming some of the network layers, adding more layers, using wider networks, or using different training techniques, etc.

   This post will be your design guide for designing CNN networks for your task. We will look at the three main metrics of accuracy, speed, and storage consumption. We will examine some of the different CNN networks, examining these three metrics. We will also explore what modifications we can make to these underlying networks, and how that affects these metrics. At the end we will learn how to design CNNs optimally for a specific image classification task.

Network Type

image
   The figure clearly shows the trade-offs of these networks for different metrics. First of all, we should choose Resnet or Inception, which are better than VGG in both accuracy and speed than VGG and AlexNet. Justin Johnson of Stanford University provided a benchmark for this model

   Between Inception and Resnet is the real speed-versus-accuracy trade-off. Want accuracy? Use extremely deep Resnet. Want speed? with Inception

Cut runtime memory consumption with clever convolutional design

   In recent years, CNNs have developed some schemes to reduce runtime memory consumption without significant loss of accuracy. These techniques can be easily integrated into any CNN network.
- MobileNets use depth-wise separable convolutions (depth-wise separable convolutions) to greatly reduce runtime computational and storage costs, but only sacrifice 1% to 5% of accuracy. The level of sacrifice depends on how much computation you want to save.
- XNOR-Net uses binary convolution e.g. convolution has only two values ​​of 01. In this design the network is highly sparse and therefore easy to compress without taking up too much storage
- ShuffleNet uses pointwise group convolution and channel shuffle to greatly reduce computation while sacrificing very little accuracy volume, even better than MobileNets. In fact, it is able to achieve the accuracy rate achieved by previous state-of-the-art networks 10 times faster.
- Network Pruning is a technique that removes parts of the CNN network in order to reduce runtime and memory consumption without reducing accuracy. To maintain accuracy, the removed parts should not affect the final output. This linked article shows how easy it is to implement this technique on Resnet.

network depth

   It's easy to think that adding more layers usually improves accuracy, reduces speed, and increases storage consumption. But note that the more layers there are, the less accuracy improvement each layer brings.

Translator's Note: The cost-effectiveness of increasing the number of layers decreases as the number of layers increases.

image
image

activation function

   Activation function selection is highly debated, but a very good guideline is to start with ReLU. A good result is often obtained with ReLU, slightly inferior to ELU, PReLU or LeakyReLU, but these require tedious tuning. Once it is determined that a certain network design works well under ReLU, consider using other activation functions and adjust their parameters to get the last bit of accuracy improvement.

Translator's Note: This means that changing the activation function is done after everything else is done, don't use a function other than ReLU in the first place.

convolution kernel size

   Some people may think that using a larger convolution kernel will produce high accuracy but lose computational speed and storage space, but this is not the case. It has been found many times that the use of large convolution kernels makes the network difficult to derive. A better option is to stack smaller convolution kernels like 3x3. Both ResNet and VGG explain and demonstrate this conclusion very thoroughly. We can also do the same as the two papers and use 1x1 convolution kernels as bottleneck layers to reduce the number of feature maps.

Dilated Convolutions

(The translation of dilated convolution dilated convolution hole convolution is not uniform)

   Dilated convolutions use spatially spaced convolutions inside the kernel to use information further from the center. This makes the receptive field of the network grow exponentially without increasing the parameters (storage consumption does not grow at all). The article shows that using Dilated convolutions can improve network accuracy with only a small slowdown.
Dilated convolutions

data augmentation

   We pretty much have to do data augmentation. Using more data can continue to improve performance, and even adding data quickly can improve performance with large amounts of data. Using data augmentation, we were able to get more data. And the type of data augmentation depends on the needs of the application, e.g. we won't encounter upside-down trees, cars, and houses, so it doesn't make sense to flip the image vertically. But we always encounter lighting effects from weather changes, and scene changes. So it makes sense to use light changes and horizontal flips. See this data augmentation library

optimizer

  When training the network, there are several optional optimization algorithms. Many people say that SGD can get the best accuracy, and my experiments found that it is true. But its tuning is challenging and boring. But on the other hand, algorithms using adaptive learning rates such as Adam, Adagrad, or Adadelta are simple and fast, but may not achieve the highest accuracy that SGD can achieve.
  The best way to do this is to use a similar strategy as when choosing an activation function: use the simple first to see if the network we designed works, then adjust the parameters and use some more complex methods. I personally recommend starting with Adam. According to my experience, the algorithm is very easy to use: setting a not too high learning rate, usually the default is 0.0001 (1e-4), you can get a good result! You can then start from scratch with SGD, or use Adam first and then fine-tune with SGD. In fact this paper found that switching from Adam to SGD in training was the easiest way to achieve the best results.
picture

class balance

  Most of the time we are dealing with unbalanced data, especially in real applications. Here's a simple example: For safety reasons, train a network to predict whether someone in a video is holding a deadly weapon. But in your training data there are only 50 videos with someone holding a weapon and 1000 without. If only this data were used to train the network, the network would tend to have no one with a gun!

(Translator's Note: This is because if the network simply judges that no one is holding a gun, it will also have a 95.2% accuracy rate in this overall situation, and judging anyone as holding a gun needs to take the risk of significantly reducing the accuracy rate)

We have the following methods to compensate:
- Weighting : use different weights for different classes in the loss function. Classes with a small number of samples have a larger weight in the loss function, which makes any misclassification of any class a larger loss.
- Oversampling : Repeat sampling for a class with a small number of samples. This method works well when there is less data available
- undersampling : Ignore the total data of some categories with a large sample size. This method is better when the amount of data is extremely large
- data augmentation : data augmentation for categories with fewer samples

Optimizing your transfer learning

  For most applications, using transfer learning is more appropriate and faster than training from scratch. But which layers to keep and which ones need to be retrained is still a question, depending on your dataset. The more your data resembles the dataset used for pretraining (most networks are pretrained with ImageNet), the fewer layers will need to be retrained, and vice versa. For example, if we train the network to determine when there are grapes in an image, and we have a set of images, some have grapes and others don't. This image is also very similar to the image in the ImageNet dataset, so only the last few layers of the network need to be retrained, or even the last fully connected layer. But if you train the network to determine whether there is a planet in outer space, the data is very different from ImageNet, and some subsequent convolutional layers need to be retrained. In short, follow the diagram below:image

(Translator's Note: Most of the shallow layers of the network do not need to be trained, or they are slightly trained)

in conclusion

   You already have it! This is a comprehensive guide to designing CNNs. Hope you enjoyed this article and learned something useful.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325978074&siteId=291194637