YOLOF Prequel: Feature Pyramid (FPN)

foreword

In the past few days, I have been reading CVPR2021's mid-draft paper YOLOF (You Only Look One-level Feature). The article reviews the single-stage feature pyramid network (FPN), and points out that the reason for the success of FPN lies in its divide-and-conquer approach to optimization problems in target detection. Solving strategy instead of multi-scale feature fusion. Although I often saw the related structure of the feature pyramid before, I didn't study it in depth. Today, I will briefly summarize the network structure characteristics of FPN through YOLOF.

01

The feature pyramid isMulti-scale (muiti-scale)An important part of the target detection field, but due to the computational and memory requirements of this method, deep learning tasks before FPN have deliberately avoided such models. In this paper, the author takes advantage of the inherent multi-scale and multi-level pyramid structure of deep neural networks, using atop-down side connectionHigh-level semantic feature maps are constructed at all scales, and the classic structure of the feature pyramid is constructed.

The specific method is actually not difficult to understand:

The high-level features with low resolution and high semantic information are integrated with the low-level features with high resolution and low semantic information from top to bottom, so that the features at all scales have rich semantic information.

02

Of course, FPN is not the only structure shown in the above figure. Here is a general introduction to the feature pyramid network:

Featurized image pyramid

A relatively stupid multi-scale method, which sets different scaling ratios for the input image to achieve multi-scale. This can solve multi-scale, but it is equivalent to training multiple models (assuming that the input size is fixed), even if the input size is not fixed, it also increases the memory space for storing images of different scales.
insert image description here

Single feature map

In fact, it is the early CNN model, which continuously learns the advanced semantic features of the image through the convolutional layer.

Pyramidal feature hierarchy

SSD earlier tried to use CNN pyramid-shaped hierarchical features, reusing the multi-scale feature map calculated by the forward process, so this form does not consume additional resources. However, in order to avoid using low-level features, SSD gave up the shallow feature map information, directly built a pyramid from conv4_3, and added some new layers, but these low-level, high-resolution feature map information is small for detection. Goals are very important.

Feature Pyramid Network

FPN designed the top-down structure and lateral connection for this purpose in order to be able to naturally utilize the pyramid form of CNN's hierarchical features and simultaneously generate feature pyramids with strong semantic information at all scales. This pyramid structure fuses shallow features with high resolution and deep features with rich semantic information. This enables fast construction of feature pyramids with strong semantic information at all scales from a single input image at a single scale without significant cost.

03

So, how to do top-down and

What about lateral connection?

top-down

def _upsample_add(self, x, y):
    _,_,H,W = y.size()
    return F.upsample(x, size=(H,W), mode='bilinear') + y

In other words, the implementation here uses the simplest upsampling, does not use linear interpolation, does not use deconvolution, but directly copies.

lateral connection

# init Lateral layers，其实就是做通道匹配任务
self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)
self.latlayer2 = nn.Conv2d( 512, 256, kernel_size=1, stride=1, padding=0)
self.latlayer3 = nn.Conv2d( 256, 256, kernel_size=1, stride=1, padding=0)

# forward
p4 = self._upsample_add(p5, self.latlayer1(c4))
p3 = self._upsample_add(p4, self.latlayer2(c3))
p2 = self._upsample_add(p3, self.latlayer3(c2))

insert image description here

Combining the above picture, we can understand the core idea of this article:

Through 2xup-sample, we get the high-level semantic features passed down from the upper layer, whose size is the same as the low-level feature map size in the lateral connection process;

Through 1x1 conv, the number of high-level feature channels and the number of low-level feature channels are unified, and the problem of channel number mismatch in the fusion (sum) process is solved.

04

How to realize the top-down network structure code of FPN?

'''FPN in PyTorch.

See the paper "Feature Pyramid Networks for Object Detection" for more details.
'''
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.autograd import Variable


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, in_planes, planes, stride=1):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out


class FPN(nn.Module):
    def __init__(self, block, num_blocks):
        super(FPN, self).__init__()
        self.in_planes = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        # Bottom-up layers
        self.layer1 = self._make_layer(block,  64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)

        # Top layer
        self.toplayer = nn.Conv2d(2048, 256, kernel_size=1, stride=1, padding=0)  # Reduce channels

        # Smooth layers
        self.smooth1 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
        self.smooth2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
        self.smooth3 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)

        # Lateral layers
        self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)
        self.latlayer2 = nn.Conv2d( 512, 256, kernel_size=1, stride=1, padding=0)
        self.latlayer3 = nn.Conv2d( 256, 256, kernel_size=1, stride=1, padding=0)

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def _upsample_add(self, x, y):
        '''Upsample and add two feature maps.

        Args:
          x: (Variable) top feature map to be upsampled.
          y: (Variable) lateral feature map.

        Returns:
          (Variable) added feature map.

        Note in PyTorch, when input size is odd, the upsampled feature map
        with `F.upsample(..., scale_factor=2, mode='nearest')`
        maybe not equal to the lateral feature map size.

        e.g.
        original input size: [N,_,15,15] ->
        conv2d feature map size: [N,_,8,8] ->
        upsampled feature map size: [N,_,16,16]

        So we choose bilinear upsample which supports arbitrary output sizes.
        '''
        _,_,H,W = y.size()
        return F.upsample(x, size=(H,W), mode='bilinear') + y

    def forward(self, x):
        # Bottom-up
        c1 = F.relu(self.bn1(self.conv1(x)))
        c1 = F.max_pool2d(c1, kernel_size=3, stride=2, padding=1)
        c2 = self.layer1(c1)
        c3 = self.layer2(c2)
        c4 = self.layer3(c3)
        c5 = self.layer4(c4)
        # Top-down
        p5 = self.toplayer(c5)
        p4 = self._upsample_add(p5, self.latlayer1(c4))
        p3 = self._upsample_add(p4, self.latlayer2(c3))
        p2 = self._upsample_add(p3, self.latlayer3(c2))
        # Smooth
        p4 = self.smooth1(p4)
        p3 = self.smooth2(p3)
        p2 = self.smooth3(p2)
        return p2, p3, p4, p5

05

In short, the main purpose of FPN is to pass down the high-level features and supplement the low-level semantics, so that high-level features with strong semantics can be obtained in the underlying network with high resolution, which is conducive to the detection of small targets.

FPN network structure + source code explanation

foreword

01

02

03

04

05

Guess you like