The first step is to pass the complete image through several convolutional layers and max pooling layers to obtain a feature map.
The second step is to use the selective search algorithm to extract object proposals, namely RoI, from the complete picture.
In the third step, according to the mapping relationship, the feature map corresponding to each object proposal can be obtained.
In the fourth step, the feature map obtained in the third step is passed through the RoI pooling layer to obtain a fixed-size feature map (which becomes smaller).
In the fifth step, after 2 layers of fully connected layers (fc), a fixed-size RoI feature vector is obtained.
In the sixth step, the feature vector passes through the respective FC layers, and two output vectors are obtained: the first is the classification, using softmax, and the second is the bounding box regression of each class.
6. multi-task loss
There are two losses: classification loss (L cls ), which is an N+1-way softmax output, where N is the number of categories and 1 is the background;
The regression loss (L loc ) is a 4xN output regressor, which means that a separate regressor is trained for each category.
= 1 in the paper, used to adjust the balance between the two losses;
When the picture is the background: k*=0 → [k* ≥ 1]=0
When the picture is not the background: k*≥ 0 → [k* ≥ 1]=1
Here, the loss of the regressor is not L2, but a smooth L1 (preventing gradient explosion), as follows: