Dropout4

also arrived at 4

Introducing Dropout into the SR task, it is found that the generalization performance of the model can be greatly improved, and even the performance of SRResNet can be improved to a level superior to that of RRDB, and the parameter amount of the latter is more than 10 times that of the former!

Let the dropout operation be great again in the field of single image super resolution

Paper name: Reflash Dropout in Image Super-Resolution

Paper address:

https://arxiv.org/pdf/2112.12089.pdf

Image super-resolution (SR) is a classic low-level vision task that aims to recover high-resolution images from low-resolution inputs. Thanks to the powerful convolutional neural network (CNN), SR networks can easily fit the training data and achieve impressive results. To further extend them to real-world images, the researchers set out to design Blind SR methods that can handle unknown downsampling kernels or degradations. However, the research on SR training strategies is relatively scarce. When the network size increases significantly, the overfitting problem will become prominent, resulting in weak generalization ability.

Traditional SR tasks:

  • The Dropout operation was originally designed to alleviate the overfitting problem in High-level visual tasks, which seems to conflict with the nature of SR, a Low-level task.

  • Dropout is to randomly discard some neurons during training to generate several sub-networks, so that each sub-network can be trained. But as a classic regression problem, SR has a different behavior from High-level tasks: it is very sensitive to dropout operations . If we randomly drop some features or pixels, the output performance may be severely degraded.

  • Overfitting does not seem to be a very serious problem in traditional SR tasks.

The above reasons are why Dropout is not widely used in traditional SR tasks. But now this situation has changed: overfitting has been a major problem in recent Blind SR tasks: overfitting to a degenerate effect causes the model to perform poorly in real-world scenarios, and simply increasing the data and network size cannot Continue to improve generalization ability.

Therefore, the author started to study the effect of Dropout from the traditional SR task. The conclusion is that the performance of the SR model can be significantly improved by properly using the Dropout operation. As shown in Figure 1 below, the performance comparison before and after using Dropout under different experimental settings can be seen To: The introduction of Dropout can greatly improve the generalization performance of the model , and even improve the performance of SRResNet to a level superior to that of RRDB, and the latter has more than 10 times the number of parameters of the former! The important thing is that adding Dropout only needs one line of code, which is really painless!

Dropout operation

The key idea of ​​the Dropout operation is to randomly drop some units (and their connections) from the neural network during training. So in the training phase, the Dropout operation allows us to update only a part of the sub-network each time, instead of the overall large network, and average the results of all sub-networks in the inference phase.

During training, some elements in the input are randomly set to 0 with probability p according to the Bernoulli distribution, and the output is multiplied by 1/1-p.
The dropout operation has been proven to be an effective regularization technique.
Input can be of any dimension, and the dimensions of Output and Input are the same.

Next, we will explain the three Dropout operations that PyTorch provides for us.

 torch.nn.Dropout (p=0.5,in place=False)

Function: Randomly set some elements in Input to 0, and multiply by 1/1-p.
Input: (H, W)
Output: (H, W)

m = nn.Dropout(p=0.2)
input = torch.randn(3, 2)
output = m(input)
print(input)
print(output)

# Input:
tensor([[ 0.7843,  0.0706],
        [ 0.4554, -0.3986],
        [-0.5532,  0.2141]])

# Output: 
tensor([[ 1.5686,  0.0000],
        [ 0.9108, -0.7971],
        [-1.1065,  0.0000]])

 torch.nn.Dropout2d (p=0.5,inplace=False)

Function: Randomly set all elements of a channel in Input to 0, and then multiply by 1/1-p.
Input: (N, C, H, W)
Output: (N, C, H, W)

m = nn.Dropout2d(p=0.5)
input = torch.randn(1, 3, 2, 2)
output = m(input)
print(input)
print(output)

# Input:
tensor([[[[ 0.9778, -1.0291],
          [ 1.9370,  0.6675]],

         [[ 0.3541, -1.5406],
          [ 0.8875, -0.2548]],

         [[ 0.9533,  0.1804],
          [-2.1946, -1.9770]]]])

# Output: 
tensor([[[[0., -0.],
          [0., 0.]],

         [[0., -0.],
          [0., -0.]],

         [[0., 0.],
          [-0., -0.]]]])

 torch.nn.Dropout3d(p=0.5,inplace=False)

Function: Randomly set all elements of a channel in Input to 0, and then multiply by 1/1-p.
Input: (N, C, D, H, W)
Output: (N, C, D, H, W)

m = nn.Dropout3d(p=0.5)
input = torch.randn(1, 3, 2, 2, 2)
output = m(input)
print(input)
print(output)

# Input:
tensor([[[[[-0.5544, -0.9302],
           [ 0.2269,  2.7334]],

          [[-1.3619,  0.5699],
           [ 0.0862,  0.9609]]],


         [[[-1.8406, -2.6052],
           [-0.0212,  0.1684]],

          [[ 0.7024, -0.2568],
           [ 0.3187, -0.7208]]],


         [[[ 1.0922,  0.5909],
           [-0.7926,  1.9536]],

          [[ 1.0438, -0.3441],
           [-0.5067, -0.0417]]]]])

# Output:
tensor([[[[[-1.1088, -1.8604],
           [ 0.4539,  5.4667]],

          [[-2.7237,  1.1397],
           [ 0.1724,  1.9217]]],


         [[[-3.6813, -5.2105],
           [-0.0424,  0.3367]],

          [[ 1.4049, -0.5136],
           [ 0.6374, -1.4417]]],


         [[[ 0.0000,  0.0000],
           [-0.0000,  0.0000]],

          [[ 0.0000, -0.0000],
           [-0.0000, -0.0000]]]]])

Interesting Super-Resolution Experimental Observations

Observation 1: Dropout that is bad for performance

The experiments in this subsection are performed under the regular SR setting, where the only degradation is bicubic downsampling. The Dropout strategy used is channel-wise Dropout (that is, the characteristics of the entire channel are randomly set to 0). As shown in Figure 2 below, the performance drops sharply after the Dropout operation is added (Figure 2a). This result is in line with our common sense. This shows that regression models are different from classification models. In regression, each element in the network contributes to the final output, which is a continuous RGB value rather than a discrete class label.

Observation 2: Dropout does not affect performance

The author also found a special case (Figure 2b) that contradicts the above conclusion, that is, the Dropout operation is only added before the convolution of the last layer . It is found that there is no effect on the performance of the model, which means: the features of the last layer can be randomly discarded and will not affect the regression results. What happened to these features? Does this mean that regression and classification networks have something in common?

Observation 3: Dropout is good for performance

The author also found a situation contrary to the above conclusion (Fig. 2c,d), under multiple-degradation, the Dropout operation is beneficial to super-resolution . This means: The dropout operation can improve the generalization performance of the super-resolution model to a certain extent .

The setting of this experiment is: the training data contains multiple degradation effects, that is, Real-SRResNet. The author added a Dropout operation in the penultimate convolutional layer. The test data includes bicubic downsampling data (included in the training data) and nearest neighbor downsampling data (not included in the training data). It can be seen from Figure 2 (c)(d) that Dropout improves the performance and improves the generalization ability of the model.

Using Dropout in super-score tasks

The experimental models in this section are SRResNet and RRDB. The conclusions in this subsection can be easily generalized to other SR networks based on CNN models, since they share similar architectures. As a simple and flexible operation, Dropout has many application methods. Generally speaking, the effect of Dropout mainly depends on two aspects, one is  the location of the Dropout , and the other is  the strategy of the Dropout (the dimension used and the probability p of the Dropout)  .

The location of the dropout

Figure 3 below is a schematic diagram of the different usage locations of the Dropout operation, which can be divided into three categories:

  • Dropout before the last-conv

  • Dropout at middle of network

  • Dropout in residual network

Dropout before the last-conv:  As shown in Figure 3(a), Hinton et al. first introduced Dropout in the High-level task and used it before the final classifier. Similarly, the authors also apply Dropout before the output convolutional layer (from 64 channels to 3 channels, called last-conv).

Dropout at middle of network:  As shown in Figure 3(a), without loss of generality, the author divides the SRResNet residual blocks (16 blocks) into four groups. Each group consists of four residual blocks. The author chooses B4, B8, B12, and B16 as the representative positions, and the numbers in them indicate after which block the Dropout is added.

Dropout in residual network:  As shown in Figure 3(c), multiple Dropout operations are added in a block. According to previous experiments, using this dropout block deep in the network can produce good results. The author of this paper designed three different methods to use Dropout block in the SR network, and named them all-part, half-part and quarter-part respectively.

Dropout strategy (dimensions used and dropout probability p)

Dropout operations are initially used in fully connected layers, so there is no need to determine which dimension to drop. However, after being used in the convolutional layer, it will bring different effects in different dimensions (element-dropout and channel-dropout).

The dropout probability p determines the dropout ratio of the channel or element. In general, too large a dropout probability p will make the performance worse. In classification networks, 50% probability will not affect the final result but will improve generalization performance. However, this probability may be too large for SR networks because their robust performance is much less than that of classification networks. To obtain possible performance gains without affecting the model, the authors tried 3 probabilities: 10%, 20% and 30%.

To sum up: there are 8 different settings for the dropout location, 2 dropout dimensions (channel and dimension), and 3 dropout probabilities (10%, 20%, 30%). The author conducted experiments to observe their results separately.

Single degeneration experiment results

Experimental settings:  There are usually two commonly used settings for SR tasks, namely single-degradation setting and multi-degradation setting. Multiple degradation effects better simulate real-world degradation processes. Currently, the performance of SR networks mainly depends on their generalization ability. The authors followed the high-order degradation modeling introduced in Real-ESRGAN.

Super-detailed interpretation of the underlying tasks (3): Only use pure synthetic data to train the real-world blind super-resolution model Real-ESRGAN

In this part of the experiment, the author explores the use of Dropout in a bicubic degradation configuration. Figure 3 below shows the comparison of experimental results on the location, probability and form of Dropout usage, from which we can see:

  1. Figure 4 (d): The performance of the model brought by different Dropout positions is also different. In the case of using Dropout only once, the closer its position is to the output, the less performance degradation.

  2. Figure 4 (a): When multiple Dropouts are used, an increase in the number of Dropouts leads to a decrease in performance.

  3. Figure 4 (a)(b): Using element-wise Dropout will degrade performance, while channel-wise Dropout usually performs better.

  4. Figure 4 (b): The larger the dropout probability p, the worse the performance of the SR model, but using 10% dropout for last-conv brings about a slight performance improvement, which shows that the practical combination of last-conv + channel-wise dropout can lead to meaningful and robust results.

 

Multiple degradation experiment results

Multiple degradation effects better simulate real-world degradation processes. In the experiments in this section, the author followed the high-order degradation modeling introduced in Real-ESRGAN.

In the multi-degradation training setting, the SR network needs to learn how to recover multiple different degradations simultaneously. Letting the SR network directly learn to solve all degradation effects will make the SR network perform poorly on a single degradation effect. However, the author found that in the setting of multiple degradations, the performance can be significantly improved by introducing Dropout. As shown in Figure 5 below, the authors tested the performance of some common degradations and combinations of complex degradations.

Each row in Figure 5 is the Dropout results of 8 different degradation effects and the results without Dropout and the improvement brought about. The author uses bicubic, blur, noise and jpeg to generate the degradation. Gaussian blur ('b'): kernel size=21, standard deviation is 2; Gaussian noise ('n'): standard deviation is 20; JPEG compression ('j'): quality=50.

The red font represents that Real-SRResNet (w/ Dropout) outperforms Real-RRDB (wo Dropout), and nearly half of the fonts are red. But the parameter amount of Real-RRDB is more than ten times that of Real-SRResNet! And, adding Dropout is just one line of code: One line of code is worth a ten-fold increase in the model parameters  (from the original).

As shown in Figure 6 below, it is a comparison of the visual effect of the model before and after using Dropout. It can be seen that the model using Dropout has better content reconstruction, artifact removal and noise reduction effects

Comparative experiment of dropout probability p

In the single degradation experiment, the optimum point for p is 10%. But in multiple degeneration experiments, the optimum point for p will be larger. As shown in Figure 7 below, the results of the Real-SRResNet ×4 task under different Dropout rates, Set1 is Manga109 (noise), and Set2 is Urban100 (noise).

Explanation of Dropout experiment phenomenon through Channel Saliency Map

The authors of the following two sections want to explore how Dropout improves the generalization ability of the SR network. The author explained the experimental phenomenon of Dropout from the perspective of attribution map through Channel Saliency Map and from the perspective of generalization performance through Deep Degradation Representation.

The meaning represented by the Attribution map is a salience map, that is, the brighter pixels in the Attribution map indicate a greater impact on the SR results. As can be seen from Figure 8 below, when we mask some features, we can get different PSNR values, lower PSNR values ​​correspond to brighter saliency maps, and brighter saliency maps mean that the super-resolution results greater impact. Obviously: different features have different influences on the final result . After some features Mask is dropped, the PSNR drops seriously, and the Attribution map of these features is also brighter. 

Figure 10 shows that the performance of the Real-SRResNet without dropout model drops sharply when more channels are set to 0, but the performance of the Real-SRResNet with dropout model remains unchanged when more channels are set to 0. This shows that for a model with Dropout, PSNR no longer depends on several specific channels. Even only 1/3 of the channels of the network is enough to maintain the model performance. This shows that Dropout can help the model prevent co-adapting and bring better performance. 

Explain the Dropout experimental phenomenon through Deep Degradation Representation

Deep Degradation Representation represents "semantic information of super-resolution model", see below for interpretation.

Super-detailed interpretation of the underlying tasks (7): Exploring what the "semantic" information in the super-resolution model represents

(a)(b)(c) in Fig. 11 represent an input sample of 128×128, 500 points formed by 5 different degradations (each degradation is 100 points). Deep Degradation Representation shows that the super-resolution model can cluster input images of different degradation types into different categories according to different semantic information . For example, in Fig. 11(a), points of different colors represent input images with different degradation effects.

That is, input images with similar degradation effects are clustered together. If the boundaries between different clustering results are clear, the network tends to only deal with specific degenerate clusters and ignore others, resulting in poor generalization performance . Conversely, if different clustering results are mixed together, it means that the network is able to handle all inputs well.

From Figure 11(a) and (b), it can be concluded that the clustering result of the original SRResNet is better than that of Real-SRResNet, which indicates that the generalization performance of the network with various degradation effects is stronger in the training data. whaosoft  aiot  http://143ai.com

From Figure 11(b) and (c), it can be concluded that after adding Dropout, the clustering result of Real-SRResNet is better than that without Real-SRResNet, which shows that the generalization performance of the model is stronger.

The Calinski-Harabaz Index (CHI) indicator is used to measure the degree of semantic discrimination: the stronger the degree of discrimination between different feature clusters, the higher the closeness within the same class, the higher the CHI score. That is, when the clusters are well separated, the CHI score is higher, which indicates stronger semantic discrimination.

We can also find that when the Dropout rate p gradually increases, the CHI gradually decreases, which proves that the clustering effect is gradually getting worse, that is, the generalization performance of the model is also getting stronger.

Another interesting finding is that the distribution of sample clusters with noise (green dots in Fig. 11) is always the most different. This is also why the performance obtained on noisy data is much worse than that obtained on clean data.

Summarize

The Dropout operation was originally designed to alleviate the overfitting problem in High-level visual tasks, which seems to conflict with the nature of SR, a Low-level task. The author of this paper starts to study the role of Dropout from the traditional SR task. The conclusion is that the performance of the SR model can be significantly improved by using the Dropout operation properly. The introduction of Dropout can greatly improve the generalization performance of the model, and even improve the performance of SRResNet to be better than The degree of RRDB, and the parameter amount of the latter is more than 10 times that of the former! Importantly, adding Dropout is just one line of code. For a single degradation task, the optimal use of Dropout is last-conv + channel-wise, and the optimal Dropout rate is 10%. For multiple degradation tasks, Dropout still adopts the last-conv + channel-wise method, and the optimal Dropout rate is larger. Finally, the author explained the experimental phenomenon of Dropout from the perspective of attribution graph through Channel Saliency Map and from the perspective of generalization performance through Deep Degradation Representation.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130673653