RCF for edge detection

code:https://github.com/meteorshowers/RCF-pytorch
paper:https://arxiv.org/abs/1612.02103

Summary

Edge detection is a basic problem in computer vision. In recent years, convolutional neural networks (CNNs) have made significant progress in this field. Existing methods use a specific level of depth cnn, which may not be able to capture complex data structures due to changes in scale and aspect ratio. In this article, we propose an accurate edge detector that uses richer convolution features (RCF). RCF encapsulates all convolutional features into a more discriminative representation, which makes good use of the rich feature hierarchy and can be trained through backpropagation. RCF makes full use of the target's multi-scale and multi-level information to comprehensively predict image to image. Using the VGG16 network, we achieved the most advanced performance on several available data sets. When evaluated on the well-known BSDS500 benchmark, we achieved an ODS F-measure of 0.811 while maintaining a relatively fast speed (8 FPS). In addition, our fast version of RCF reached an ODS F-measure of 0.806 at 30fps. By applying RCF edges to classic image segmentation, we also prove the generality of the method.

Network Architecture

Insert picture description here

A simple network is constructed based on VGG16, and the edge outputs of conv3_1, conv3_2, conv3_3, conv4_1, conv4_2 and conv4_3 are generated. It can be clearly seen that the convolution features gradually become thicker, and the intermediate layers conv3_1, conv3_2, conv4_1, and conv4_2 contain many useful details that other layers do not have.

Insert picture description here
It is based on VGG16 and is modified according to it. Compared with the original VGG16:

  • The RCF network removes all the original fully connected layers (the last three fully connected layers) and the final pooling layer. This is because it is different from the original design of the VGG network-the image classification problem. This network is designed for edge detection. The 1×1×4096 output obtained by the last fully connected layer of VGG is meaningless, so it is deleted.
  • In order to extract edge information, it is necessary to recalculate the pixel value itself. Therefore, after each convolutional layer of VGG16, a 1×1−21 convolutional layer is added, which is first upgraded, and then passed 1× 1−1 for dimensionality reduction.
  • It adds a cross-entropy loss / sigmoid layer at the end of each stage to calculate loss and update parameters.
  • There is a deconv layer in each layer for upsampling, mapping the image size back to the original size, and finally superimposing the output of each stage in the fusion part, and performing a 1×1−1 convolution to combine the multiple channels to achieve The ability to obtain a variety of mixed information.

loss function

Because the data set is usually marked by multiple Annotators. Although everyone’s perception is different, everyone’s results are highly consistent. For each picture, we take the average of all the tags to generate a probability map of the existence of edges. For each point, 0 means that no marker considers the point to be an edge, and 1 means that everyone considers the point to be an edge. Here defines a hyperparameter η: if the probability of a point being an edge is greater than η, then this point is considered to be an edge; if the probability of this point is 0, it is not an edge; in addition, those probabilities are considered to be between 0 and η The points in between are disputed points and are not included in the loss function. Therefore, the loss function of each point can be recorded as:
Insert picture description here
where:
Insert picture description here
|Y+| represents the number of points in the graph that must be edges, |Y−| represents the number of points that must not be edges in the graph, and λ is a hyperparameter . The feature vector at pixel i and the fact that it is an edge are represented by Xi and yi respectively. P(X) is a standard Sigmoid activation function, and W represents all learning parameters in the network. Therefore, the loss function of the entire picture can be written as:
Insert picture description here

Among them, X(k)i represents the feature vector of stage k, Xfusei represents the feature vector of stage fusion, |I| represents the number of pixels, |K| represents the number of stages (here 5)

Guess you like

Origin blog.csdn.net/weixin_42990464/article/details/112656043