Interpretation of the paper Perceptual Loss (Perceptual Loss) & Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Since the traditional L1 and L2 loss are for pixel-level loss calculations, and L2 loss does not match the image quality perceived by the human eye, the image recovered by using L1 or L2 loss alone often shows detailed performance for tasks such as super-resolution. Neither is good.

In current research, L2 loss is gradually replaced by human eye perception loss. Human eye perception loss is also called perceptual loss. The difference between it and MSE (L2 loss), which uses image pixels to calculate the difference, is that the calculated space is no longer the image space.

Researchers often use the characteristics of networks such as VGG, let φ represent the loss network, Cj represents the jth layer of the network, CjHjWj represents the size of the feature map of the jth layer, the perceptual loss is defined as follows: It can be seen that it has the same L2 loss has the same form, except that the calculation space is converted to feature space.

 

This article proposes perceptual loss for the first time, conducts experiments and comparisons on image style conversion and single image super-resolution, and proves the effectiveness of perceptual loss.

The following explains the super-resolution of a single image.

The perceptual loss passes through a fixed network (VGG16, VGG19..), using prediction and groundtruth as VGG16 input respectively, and obtains the corresponding output features, pre-vgg, gt-vgg.

Then let pre-vgg and gt-vgg construct the loss (L2-Loss). Keep the deep information (perceptual information) between the predicted results and the real values ​​close.

Compared with ordinary L2-Loss, using perceptual loss can better enhance the detailed information of the output.

Build Perceived Loss Preparedness:

1. The pre-trained VGG network only requires the prediction stage (forward).

2. Send the prediction and ground truth forward through VGG respectively, and obtain pre-vgg and gt-vgg.

3. Calculate the L2 loss of pre-vgg and gt-vgg. Note: Often pre-loss (perceptual loss) is a regular term and needs to be coordinated with other loss functions for guidance. The penalty (adjustment) of perceptual loss can be adjusted by customizing parameters.

When extracting features, we usually do not use only one layer of features, but use shallow, middle, and deep features in the network to combine. Bitmap vgg16 will use 3, 5, and 7 layers to combine and accumulate the output features.

In the original paper, the problem of single image super-resolution was addressed. The authors do not encourage the pixels of the output image ˆy = fW(x) to exactly match the pixels of the target image y, but rather encourage them to have similar feature representations as those calculated by the loss network φ.

Let φj (x) be the activation of the jth layer of network φ when processing image x; if j is a convolutional layer, then φj (x) will be a feature map of shape Cj×Hj×Wj. The loss for feature reconstruction is the (squared, normalized) Euclidean distance between feature representations:

As shown in the figure below, finding an image ˆy that minimizes the loss of early layer feature reconstruction often results in an image that is visually indistinguishable from y. When reconstructed from higher levels, image content and overall spatial structure are preserved, but color, texture, and precise shape are not. Trained using a feature reconstruction loss, our image transformation network encourages the output image ˆy to be perceptually similar to the target image y, but does not force them to match exactly.

 

 In a nutshell, the above description is as follows: For the super-resolution problem of a single image, choose the shallow feature map (the paper says it is relu2_2) for loss calculation, and for the style transformation task, choose deep features or combine deep features with shallow features. The feature map of the layer is used to calculate the loss.

The picture below is the experimental result:

Code:

2023.2.16 

When using vgg as the perceptual loss derived from the model, I suddenly discovered a problem. It is said that shallow and deep networks are generally selected as the calculation layer of the loss function. For example, here I take the third layer of convolution and the 13th layer of convolution, but The third layer of convolution has 64 channels, and the 13th layer of convolution has 256 channels. How to calculate the loss with so many channels? Is it averaging? Later I looked at other people's code and handled it this way. Here is the overall code directly. The key line is: np.mean(), which means it should be the average of 64 channels.

There is another point to pay special attention to. I saw that many people shared codes that defined the relu layer as the loss calculation layer. I personally tested it and found that after relu, the feature visualization will become black, or even completely black at the beginning of training. Yes, when calculating the loss for a completely black image and a half-black GT image, the loss may become 0.0000000007. Therefore, when selecting a layer, be sure to select the convolutional layer instead of the relu layer. Currently, the convolutional layer is selected. The feature visualization and loss function of layer [0,14,24] are normal.

Update the results of the perceptual loss. After the training is close to fitting, print out the feature maps of prediction and GT. They are very similar. The loss function has also reached 0.000001. The feature maps are as follows: (The left one is the predicted result, and the right one is GT)

 After testing, it was found that there was no obvious increase, but theoretically, the perceptual loss will not increase significantly. The main function of the perceptual loss is to enhance the details of the image. Although this test did not see the enhancement of the details, the reason should be The model has not been fully fitted, and the results will be updated later.

renew. .

The above said that the convolutional layer should be selected as the feature map is the conclusion of my own analysis. The experimental results found that it is feasible, but the original paper explained the relu2_2 layer, and then the original paper also said that L2loss should be selected as the calculation function of the perceptual loss. , but other studies have shown that L1loss+L2loss has a better effect. As for which loss to choose and which layer to choose, it is better to try it yourself. Here are some results of the first experiment.

 The above three pictures from top to bottom are GT, the results without perceptual loss, and the results with perceptual loss (accumulation of 3, 14, 27 convolutional layers).

There is no obvious change in psnr. For details, if you look carefully, you can see the detail enhancement effect of perceptual loss (the noise is reduced, and the details are also enhanced corresponding to gt).

2.28 continues to update the perceptual loss. This time L1 is used to calculate the perceptual loss, and different ratios are added to enhance or suppress the ratio of perceptual loss to original loss. The detail effect is significantly improved. The effect is as follows:

 

The left picture is the inference result without adding perceptual loss, and the right picture is the result with perceptual loss added. The detail improvement is very obvious.

 

 

Guess you like

Origin blog.csdn.net/qq_40962125/article/details/128630162
Recommended