Computer Vision - day 90 Salient Object Detection with Cascaded Convolutional Neural Networks and Adversarial Learning

I. INTRODUCTION

Salient object detection has received extensive attention and achieved great success in the past few years. It remains a challenge to obtain clear boundaries and consistent saliency, which can be considered as the structural information of salient objects. A popular solution is to do some post-processing (e.g., conditional random field (CRF)) to refine this structural information.

In this work, we propose a novel salient object detection method (CCAL) based on cascaded convolutional neural networks and adversarial learning.

In summary, the main contributions of this paper are as follows:

1) A novel salient object detection network framework is designed, which consists of two convolutional neural networks combined in a cascaded manner. They focus on global saliency estimation and local saliency refinement, respectively. With the help of step by step, the detection results are gradually improved.

2) The CGAN algorithm is adopted for salient object detection, and the performance is improved by introducing adversarial loss to implicitly learn structural information (i.e., clear boundaries and consistent saliency).

3) We evaluate the proposed method on 8 benchmark datasets. Comprehensive experimental results show that the proposed method can generate high-quality saliency maps with clear boundaries and consistent saliency, significantly outperforming existing methods.

II. Network Architecture

image-20230513160832480

The proposed salient object detection model contains two components, the generator G and the discriminator D, as shown in Figure 1.

A. Generator G based on Cascaded Convolutional Neural Networks

Global saliency estimator E

Salient object detection can be regarded as a pixel labeling problem, assigning a large value (such as 1) to a salient object, and assigning a small value (such as 0) to a non-salient area. In this paper, drawing on the successful experience of the encoder-decoder network, an encoder-decoder network (global saliency estimator E) for initial saliency map estimation is constructed, which consists of two parts: encoder and decoder .

Specifically , we use a 4 × 4 convolution kernel and a convolution with a stride of 2 to replace the combination of 3 × 3 convolution with stride 1 and pooling with 2 × 2 and stride 2, which is VGGNet [30] A classic setup in

Here, our encoder has n1 = 8 convolutional layers, and the number of convolutional kernels in each layer is 64, 128, 256, 512, 512, 512, 512, 512, 512 respectively.

For the decoder , it performs the reverse process of the encoder, expanding the size of the feature map. The feature maps are upscaled using a deconvolution operation with a kernel size of 4 × 4 and a stride of 2. In addition, we also use skip connections to combine high-level features from the decoder and low-level features from the encoder to facilitate feature learning.

The last layer is the tanh activation function.

From Figure 1, given an input image, the output of e is a probability map of the same size as the input image, considered as the initial saliency map, where salient objects are highlighted and the background is suppressed

Local saliency refiner R

It is necessary to use the information provided by the initial saliency map to correct these poor estimates. Therefore, we design a deep residual network (called the local saliency refiner R) for local saliency refinement, where the input is the RGB image and the initial saliency map generated by the saliency estimator E The combination of , the output is the optimized saliency map as the final performance evaluation result.

B. Discriminator D.

As mentioned above, given an input image I, the generation process of its final saliency map X can be expressed as X = G(I) = R(I,E(I)).

Discriminators in generative adversarial networks (GANs) can be seen as attempts to explore structured loss functions.

Therefore, in order for the generator G to learn the structural information of the saliency well, we design a discriminator D, whose role is to combine the fake saliency map generated by the generator G with The real saliency map (ground truth) is used to distinguish. CGAN is a conditional version of GAN.

image-20230513163032454

Figure 2 presents three examples of salient object detection results produced by different model configurations, visually verifying the advantages of our local saliency refiner R and discriminator D. (e) is the model proposed in this paper.

Iv. experiment

A. Datasets and Evaluation Criteria

Performance evaluation on eight standard benchmark datasets: SED1[64], SED2[64], ECSSD[4], PASCAL-S[65], HKU-IS[20], SOD[66], DUT-OMRON[67] ] and DUTS-TE [32].

B. Experimental results

image-20230513163325281

Visual comparison of different saliency detection methods with our method (CCAL) on various challenging scenarios.

V. Conclusion

In this paper, we propose an end-to-end salient object detection model (CCAL) based on cascaded convolutional neural networks and adversarial learning. An encoder-decoder network composed of cascaded CNNs and a deep residual network are designed to perform global saliency estimation and local saliency refinement, respectively. Using a coarse-to-fine cascading approach, the performance of salient object detection can be gradually improved. As a structured loss function, the adversarial loss introduced by the recognizer helps CCAL to better learn the structural information of salient objects, and the experimental results illustrate its importance to improve the performance. The method produces accurate salient object detection results without any post-processing. Experiments show that CCAL not only achieves state-of-the-art performance on 8 benchmark datasets, but also achieves 17 frames per second on GPU.

Guess you like

Origin blog.csdn.net/qq_43537420/article/details/130659115