[Paper Reading] CASENet: Deep Category-Aware Semantic Edge Detection

Paper address: http://arxiv.org/abs/1705.09759

Learning reference: [Reading Notes] CASENet: Deep Category-Aware Semantic Edge Detection

CASENet: Deep Category-Aware Semantic Edge Detection Paper Interpretation - My Blog has something - CSDN Blog

Learning process: self-understanding of original text translation, classic blog learning, forming one's own blog (including the main points of the paper and reading questions) must be repeated, learn to refine the key points, find the points that can be really learned, and gradually add the understanding of the formula

content

 Summary

 1. Problem description:

Multi-class loss function 

 Second, the network framework:

Basic network:

 basic structure:

Deep Supervision Framework:

 CASENet framework:

 experiment


 Summary

This paper mainly presents a class-aware semantic edge detection algorithm. Traditional edge detection itself is a challenging binary problem , in contrast, category-aware semantic edge detection is a more challenging multivariate problem . Because edge pixels appear in contours or connected points belonging to two or more semantic classes, this paper models the problem that each edge pixel is associated with at least two classes and proposes a new ResNet-based An end-to-end deep semantic edge learning architecture, and a new skip structure, where class edge features on top convolutional layers are shared and fused with the same set of underlying features. Finally, a multi-class loss function is proposed to supervise the fusion of features.

Multivariate problem, skip structure, feature sharing, feature fusion, multi-class loss function

Figure 1. Edge detection and classification with our method.
For Street View images, our goal is to simultaneously detect boundaries and assign one or more semantic categories to each edge pixel.
(b) and (c) are colors encoded with HSV, where hue and saturation together represent the composition and relative strength of the categories. 

The above picture is a road map in the CItyscapes dataset, which has several types of objects, such as: buildings , ground , sky , cars , etc. (question: what do the various colors mean? Do they just represent different categories? Typical category combination (you can see the legend of the meaning of the corresponding colors in Figure 1); green and blue indicate the correctly detected and missed edge pixels? (This is mentioned below, does this sentence only represent the color in Figure 2? The meaning has nothing to do with Figure 1? )).

  • where edge pixels located on building and pole outlines can be associated with both categories. The boundaries are visualized in the image above, listing the colors for typical class combinations such as "Building + Pole" and "Road + Sidewalk".
  • Degree of association: In our problem, each edge pixel is represented by a vector, where each element of the vector represents the degree of association between that pixel and a different semantic category .
  • Multi-label learning problem: While most edge pixels are associated with only two class objects, in the case of intersections, it is possible that edge pixels are associated with three or more class pixels. Therefore, there is no limit to the number of class objects that a pixel can be associated with , which also establishes that this paper is studied as a multi-label learning problem .

 figure 2. An example of a test image and the corresponding bounding box region of the scaled edge map.
The visualized edge maps belong to the three categories of people, cars, and roads, respectively.
Green and blue represent correctly detected and missed edge pixels.

In this paper, we propose CASENet, a deep network capable of detecting category-aware semantic edges. Given K semantic categories, the network actually generates an independent edge map, where each map represents the edge probability of a certain category (the edge map translated into mathematical ideas is the edge probability?) . The above figures give the edge maps of the test images, respectively.

The work in this paper adopts the same nested architecture as HED (described below) , but extends the problem to the more difficult problem of category-aware semantic edge detection.
( [Paper reading] (edge ​​detection related) HED: Holistically-Nested Edge Detection_dujuancao11's blog-CSDN blog )
The main contributions are as follows:

1. To address edge classification, a multi-class learning framework is proposed to learn edge features better than traditional multi-class frameworks.

2. A new nested structure is proposed without deep supervision on ResNet , where the features at the bottom are used to enhance the classification at the top.

This paper also finds that deep supervision is not beneficial to this problem (deep supervision can refer to the structure of DSN, that is , supervised learning is performed on the output of each side ). 

 1. Problem description:

  • Given an input graph, its goal is to compute the corresponding semantic edge map based on predefined categories. (End-to-end? Graph to graph?)
  • Quantity : output K semantic categories corresponding to each input image (the number of output images is the number of semantic categories 1------K)
  • Size : The size of each edge map is the same as the input image I (1------1) .
  • Network output conditional probability ? represents the edge probability of the kth semantic category of pixel p.

Multi-class loss function 

Probably due to the multi-class nature of semantic segmentation, several related articles on class-aware semantic edge detection all study the problem from the perspective of multi-class learning (one object belongs to one label) .
This paper argues that this problem should inherently allow a pixel to belong to multiple classes at the same time , and should be addressed by a multi-label learning framework.

The difference between multi-class learning and multi-label learning:

  • Multi-class learning : an object belongs to a label
  • Multi-label learning : allows a pixel to belong to multiple categories at the same time
    (a part of the edge of a person is also the edge of a building, then the label of this part of the edge has two categories.
    So one hot coding cannot be used for this kind of The official label
    code uses uint32 to represent the label, and judges the label of the pixel according to the number of digits. If the pixel belongs to the first category and the fifth category, then his label is 1000...010001, a total of 32 digits, and the highest digit represents whether the loss is Compute this sample.)

As a multi-label classification problem, the cross-entropy of the multi-classification problem cannot be used as a loss. Multi-label classification and multi-classification are not the same concept

 

 where \betais the percentage of non-edge pixels in the image to the number of samples skewness (used to balance?) used to balance sample inhomogeneity. p is only at every pixel location.

The loss function is easy to explain and can be viewed as the cross-entropy with a binary classification for each class. Finally, the k results are superimposed.

 Second, the network framework:

In the design of the network, it cannot be consistent with the design of semantic segmentation, and the encoder cannot be used, because the edge is a low-level feature, and the network needs to be designed from low to high. And semantics belong to high-level features, you can use the decoder structure to continuously refine the high-level semantic features.

This paper proposes CASENet, an end-to-end trainable Convolutional Neural Network (CNN) architecture (shown in Figure c) to address category-aware semantic edge detection. Before describing CASENet, we first propose two alternative network architectures. Although these two architectures can also detect edges, there are some unsolvable problems. This paper addresses these problems by proposing the CASENet architecture.

The insufficiency of the two optional network architectures ------- and then explain the advantages of your own network (and also express the design principles of your own network)

image 3. The three CNN architectures designed in this paper are shown in Figures (a)-(c).
Solid rectangles represent composite blocks of CNN layers . Any reduction in its width indicates a 2x drop in the spatial resolution of the block’s output feature map. The numbers next to the arrows indicate the number of channels for the block's output characteristics. The blue solid rectangle is the stack of ResNet blocks .
The purple solid rectangle is our classification module .
The red dotted line indicates that the output of the block is supervised by the loss function in Equation 1 .
The grey solid rectangle is our edge feature extraction module .
A dark green solid rectangle is where our fusion classification module performs K-grouped 1×1 convolution .
(d)-(h) describe more details of the individual modules used in (a)-(c), where the input and output feature maps are represented by rectangular boxes.

(It should be clear why the new network should be improved in this way?? And the meaning of the detailed design of each module??)

Basic network:

Some improvements based on the ResNet-101 framework

  1. This paper adopts the ResNet-101 framework by removing the original average pooling and fully connected layers and keeping the underlying convolutions .
    To better preserve low-level edge information , the base network is further modified.
  2. This paper changes the stride size of the first and fifth convolutional layers ("res1" and "res5" in the above figure) in ResNet-101 from 2 to 1.
  3. A dilation factor is introduced into subsequent convolutional layers to keep the receptive field of the same size as the original ResNet.

 basic structure:

 (d): Classification module:

  • A 1×1 convolutional layer is followed by a bilinear upsampling (implemented by K sets of deconvolutional layers ) to generate activation maps, each of the same size as the image ( what does bilinear upsampling do? 1× 1 The role of convolutional layer + bilinear upsampling combination? )
  • Calculates the probability that a pixel belongs to the edge of class K by the unit given in the formula (formula in the multi-class loss function) .

Deep Supervision Framework:

The Global Nested Edge Detection (HED) network is a deeply supervised nested architecture .

  • The basic idea is that in addition to the network loss at the top, the loss caused by the bottom convolutional layer is also calculated (the innovation point loss at that time is calculated at the same time) .
  • Additionally, a fused edge map is obtained by supervised linear combination of side activations.

HED ----》 CASENet

Extension : Binary edge detection multi-class multi-label (used to process the side output of K channels and the final output of K channels. Each activation map has K channels, and K channels correspond to K labels, right? Slice cascade The layer fuses 5 activation maps, resulting in an activation map of 5xK channels )

(DSN) slice cascade

These 5 activation maps are fused through the slice stitching layer (the color in Figure 3(g) indicates the channel index) to obtain a 5k channel activation map

The connection here adopts slice concatenation and group convolution for the following reasons:
since the 5 side activations are supervised, each channel of the side activation is restricted so that it brings the most relevant information to the corresponding class . (Some understanding of slicing operations?)

Through slice connections and grouped convolutions, the fused activation for pixel p is given by:

 

 This basically integrates the activations from the corresponding classes at different scales as the activations for the final fusion, and again, this design is used in this paper. 

 CASENet framework:

After introducing the basic framework and DSN framework, it is found that there are several potential problems in the task of category-aware semantic edge detection :

  1. First, the receptive field at the bottom is limited. Therefore, given the contextual information plays an important role in semantic classification , since the semantic classification given by the early network is unreasonable .
    This paper argues that semantic classification should happen at the top , where features are encoded by high-level information .
  2. Second, bottom side features help enhance top-level classification, suppress non-edge pixels and provide detailed edge localization and structural information. Therefore, they should be taken into account in edge detection.

The CASENet architecture proposed in this paper (Figure c) can solve the above problems. The network adopts a nested architecture, somewhat similar to DSN in a way, but also includes several key improvements. These improvements are summarized as follows:

  1. Replace the classification module at the bottom with a feature extraction module.
  2. Put the classification module on top of the network and supervise it.
  3. Do share cascades (Figure h) instead of slice cascades (what's the difference?) .

 ​​​​​​​

 

 

 

(use side feature extraction (not side classification), output only single-channel feature maps F^{i} instead of K-class activation maps, and finally use shared cascades (not slice cascades) to copy the underlying features from side 1-3 F={ }, and connect to each of the K top activations separately. The specific meaning of the formula needs to be understood again? {F^{1},F^{2},F^{3}}

 In general, CASENet can be thought of as a joint edge detection and classification network by involving low-level features and augmenting higher-level semantic classification with skip structures.

Reference study:  CASENet: Deep Category-Aware Semantic Edge Detection Paper Interpretation - My blog has something - CSDN Blog

insert image description here 

 Interpretation:

  • The backbone network adopts resnet101, and the features of the first three stages are sent to the side feature extractor (side extractor), and then fused with the features of the last stage to supervise the calculation of loss.
  • If the dataset has k classes, the final output of the network should be a feature map of thickness k, with k-dimensional vectors at each pixel location. The values ​​of the vector describe the probability of belonging to the corresponding class. This is a multi-label classification problem.
  • Instead of supervising the top layers of the resent, high-level features are augmented with low-level features (low-level features) for classification.
  • Among them, fused classification uses group convolution.
    The network has several characteristics:

  • Extract features for the features of the first three stages, and the output channel is 1 (gray module)

  • Classify the fifth stage and output the features of the channel k (purple module)

  • The above two modules have an upsampling link to ensure that the output resolution is consistent with the original image

The author believes that the receptive field of the underlying features is limited, so it is unreasonable to let the first few layers do semantic classification, so the first three features are only input into another side feature extractor as features. Therefore, semantic classification should be carried out in the high-level , because high-level features have strong high-level information.
At the same time, the low-level features are helpful for enhancing high-level classification tasks because they have more detailed edge location information and structural information. Therefore, such a structure is designed: the low-level features of the first three layers are merged into the high-level features of the fifth stage for the final supervision.

How can it be embodied that low-level features help high-level features for classification?

  • Note that the output channel of the side feature is 1, and the output channel of the classification is k, which is also the input of the category. The practice of shared concat is that for each feature map (for map in k maps) of the classification, the outputs of the first three stages are queued, so the final feature map channel is 4*k.
  • Moreover, in the fused classification, group convolution is used. Although a convolution kernel in the schematic diagram is used to convolute a feature map with a thickness of 3, the actual network here is a group of four maps to do convolution, and every four maps, of which One is from high-level features, and the remaining three are from low-level features.
  • Add expansion factor

    Because the structure of resnet is modified, the stride is changed, and the resolution of the final feature map is 1/8 of the original, so the receptive field is also smaller. In order to increase the receptive field , the convolutional layers after the layer whose stride is originally 2 are all convolutional with holes . ( Some clever use of convolutions during network design. )

See the original text for the general idea:

In general, CASENet can be thought of as a joint edge detection and classification network that engages lower-level features and enhances higher-level semantic classification through a cross-layer architecture.

 experiment

preprocessing part

Considering that there are errors in manual annotation and real edges, and the labels of pixels in the label field have ambiguous category information, the author generated a thicker label as the training GT of the network . This can be obtained by finding pixels in the neighborhood of the label edge that are different from the segmentation label, and classifying these pixels as edges. (If you want me to say that you can use morphological operations directly )

In addition to the preprocessing of labels, there are a few details to note:

resnet uses atrous convolution, in addition to the first conv, and 1x1 convolution, other convolutions use convolution with a hole rate of 2.
Use the pre-training model to do pre-training on ms coco.
See below for parameters and data enhancements

Guess you like

Origin blog.csdn.net/dujuancao11/article/details/122918643