foreword
In the past few days, I have been reading CVPR2021's mid-draft paper YOLOF (You Only Look One-level Feature). The article reviews the single-stage feature pyramid network (FPN), and points out that the reason for the success of FPN lies in its divide-and-conquer approach to optimization problems in target detection. Solving strategy instead of multi-scale feature fusion. Although I often saw the related structure of the feature pyramid before, I didn't study it in depth. Today, I will briefly summarize the network structure characteristics of FPN through YOLOF.
01
The feature pyramid isMulti-scale (muiti-scale)An important part of the target detection field, but due to the computational and memory requirements of this method, deep learning tasks before FPN have deliberately avoided such models. In this paper, the author takes advantage of the inherent multi-scale and multi-level pyramid structure of deep neural networks, using atop-down side connectionHigh-level semantic feature maps are constructed at all scales, and the classic structure of the feature pyramid is constructed.
The specific method is actually not difficult to understand:
The high-level features with low resolution and high semantic information are integrated with the low-level features with high resolution and low semantic information from top to bottom, so that the features at all scales have rich semantic information.
02
Of course, FPN is not the only structure shown in the above figure. Here is a general introduction to the feature pyramid network:
Featurized image pyramid
A relatively stupid multi-scale method, which sets different scaling ratios for the input image to achieve multi-scale. This can solve multi-scale, but it is equivalent to training multiple models (assuming that the input size is fixed), even if the input size is not fixed, it also increases the memory space for storing images of different scales.
Single feature map
In fact, it is the early CNN model, which continuously learns the advanced semantic features of the image through the convolutional layer.
Pyramidal feature hierarchy
SSD earlier tried to use CNN pyramid-shaped hierarchical features, reusing the multi-scale feature map calculated by the forward process, so this form does not consume additional resources. However, in order to avoid using low-level features, SSD gave up the shallow feature map information, directly built a pyramid from conv4_3, and added some new layers, but these low-level, high-resolution feature map information is small for detection. Goals are very important.
Feature Pyramid Network
FPN designed the top-down structure and lateral connection for this purpose in order to be able to naturally utilize the pyramid form of CNN's hierarchical features and simultaneously generate feature pyramids with strong semantic information at all scales. This pyramid structure fuses shallow features with high resolution and deep features with rich semantic information. This enables fast construction of feature pyramids with strong semantic information at all scales from a single input image at a single scale without significant cost.
03
So, how to do top-down and
What about lateral connection?
top-down
def _upsample_add(self, x, y):
_,_,H,W = y.size()
return F.upsample(x, size=(H,W), mode='bilinear') + y
In other words, the implementation here uses the simplest upsampling, does not use linear interpolation, does not use deconvolution, but directly copies.
lateral connection
# init Lateral layers,其实就是做通道匹配任务
self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)
self.latlayer2 = nn.Conv2d( 512, 256, kernel_size=1, stride=1, padding=0)
self.latlayer3 = nn.Conv2d( 256, 256, kernel_size=1, stride=1, padding=0)
# forward
p4 = self._upsample_add(p5, self.latlayer1(c4))
p3 = self._upsample_add(p4, self.latlayer2(c3))
p2 = self._upsample_add(p3, self.latlayer3(c2))
Combining the above picture, we can understand the core idea of this article:
Through 2xup-sample, we get the high-level semantic features passed down from the upper layer, whose size is the same as the low-level feature map size in the lateral connection process;
Through 1x1 conv, the number of high-level feature channels and the number of low-level feature channels are unified, and the problem of channel number mismatch in the fusion (sum) process is solved.
04
How to realize the top-down network structure code of FPN?
'''FPN in PyTorch.
See the paper "Feature Pyramid Networks for Object Detection" for more details.
'''
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, in_planes, planes, stride=1):
super(Bottleneck, self).__init__()
self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False)
self.bn3 = nn.BatchNorm2d(self.expansion*planes)
self.shortcut = nn.Sequential()
if stride != 1 or in_planes != self.expansion*planes:
self.shortcut = nn.Sequential(
nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(self.expansion*planes)
)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
out += self.shortcut(x)
out = F.relu(out)
return out
class FPN(nn.Module):
def __init__(self, block, num_blocks):
super(FPN, self).__init__()
self.in_planes = 64
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(64)
# Bottom-up layers
self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
# Top layer
self.toplayer = nn.Conv2d(2048, 256, kernel_size=1, stride=1, padding=0) # Reduce channels
# Smooth layers
self.smooth1 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
self.smooth2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
self.smooth3 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
# Lateral layers
self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)
self.latlayer2 = nn.Conv2d( 512, 256, kernel_size=1, stride=1, padding=0)
self.latlayer3 = nn.Conv2d( 256, 256, kernel_size=1, stride=1, padding=0)
def _make_layer(self, block, planes, num_blocks, stride):
strides = [stride] + [1]*(num_blocks-1)
layers = []
for stride in strides:
layers.append(block(self.in_planes, planes, stride))
self.in_planes = planes * block.expansion
return nn.Sequential(*layers)
def _upsample_add(self, x, y):
'''Upsample and add two feature maps.
Args:
x: (Variable) top feature map to be upsampled.
y: (Variable) lateral feature map.
Returns:
(Variable) added feature map.
Note in PyTorch, when input size is odd, the upsampled feature map
with `F.upsample(..., scale_factor=2, mode='nearest')`
maybe not equal to the lateral feature map size.
e.g.
original input size: [N,_,15,15] ->
conv2d feature map size: [N,_,8,8] ->
upsampled feature map size: [N,_,16,16]
So we choose bilinear upsample which supports arbitrary output sizes.
'''
_,_,H,W = y.size()
return F.upsample(x, size=(H,W), mode='bilinear') + y
def forward(self, x):
# Bottom-up
c1 = F.relu(self.bn1(self.conv1(x)))
c1 = F.max_pool2d(c1, kernel_size=3, stride=2, padding=1)
c2 = self.layer1(c1)
c3 = self.layer2(c2)
c4 = self.layer3(c3)
c5 = self.layer4(c4)
# Top-down
p5 = self.toplayer(c5)
p4 = self._upsample_add(p5, self.latlayer1(c4))
p3 = self._upsample_add(p4, self.latlayer2(c3))
p2 = self._upsample_add(p3, self.latlayer3(c2))
# Smooth
p4 = self.smooth1(p4)
p3 = self.smooth2(p3)
p2 = self.smooth3(p2)
return p2, p3, p4, p5
05
In short, the main purpose of FPN is to pass down the high-level features and supplement the low-level semantics, so that high-level features with strong semantics can be obtained in the underlying network with high resolution, which is conducive to the detection of small targets.