Optical Flow Estimation (2) Interpretation of FlowNet Series Articles

        In the last article, we learned and understood some basic concepts and basic operations of optical flow (Optical Flow), but the traditional optical flow estimation method is computationally complex and expensive. In recent years, with the continuous development and maturity of CNN convolutional neural network, it has achieved great success in various computer vision tasks (mostly used for recognition related tasks). Therefore, the FlowNet series of articles was proposed by combining optical flow estimation with CNN deep learning. For the first time, CNN was applied to optical flow prediction, so that the network can predict the optical flow field from a pair of pictures, reaching a rate of 5 to 10 frames per second. And the accuracy has reached the industry standard.

1. FlowNet

        FlowNet (or FlowNet 1.0) is the first optical flow estimation network proposed by the FlowNet series, and it is also the most important and basic network. Its idea comes from the paper "FlowNet: Learning Optical Flow with Convolutional Networks", which is published In IEEE International Conference on Computer Vision (ICCV), 2015.

        This article first proposes an end-to-end neural network structure for optical flow estimation learning, which is generally an Encoder/Decoder structure . Information is first subjected to spatial compression and feature extraction in the shrinking part of the network, and then refined in the expanding part. The input of the FlowNet network is an image pair (including two adjacent frames) and its corresponding optical flow ground truth. Secondly, since the existing optical flow data set is not enough to train a large network, this article also synthesized a data set of chair images called Flying Chairs through virtual synthesis , which achieved better results in the network training process. .

1. Network structure

        According to the difference in the network structure of the encoding/shrinking part, the article subdivides FlowNet into two structures, FlowNet-Simple (FlowNet-S) and FlowNet-Correlation (FlowNet-C), which are consistent in the decoding/extension part. We will introduce them respectively as follows.

1.1 Contracting part coding/shrinking part

(1)FlowNet-S

        FlowNetSimple is the simplest and rude network structure. The network directly stacks (stitches) the input two frames of images together in the general dimension , and then uses a series of convolutional layers to down-sample and extract features, allowing the network to decide how to deal with it. image pairs to extract motion information from them, and the network structure is shown in the figure above. 

(2)FlowNet-C

        The difference between FlowNetCorrelation and FlowNet-S is that FlowNet-C first establishes two independent but identical processing streams for two input images, and extracts meaningful feature representations of the two images through a series of convolutional layers . These meaningful feature representations are then combined at for the subsequent downsampling process. The network structure is shown in the figure above.

        However, the combination of high-level features of two images is not a simple dimension stacking, but an association layer is used to facilitate the matching process of the network. We know that the ultimate goal of the network is to predict the optical flow between two images, and the optical flow is essentially a corresponding matching relationship between different images. In order to "help" the network to accelerate the calculation of this matching relationship and improve the network performance, and to improve the accuracy of estimation, we introduce a "correlation layer" to combine features . The "correlation" operation is a kind of correlation calculation (we will explain in detail later), which is used to calculate the correlation between different features/image blocks , and the result index is used to reflect the degree of matching. We use this advanced matching calculation Combining high-level features can be said to provide a strong guidance for the network's subsequent matching relationship learning.

        In summary, FlowNet-C first extracts features from two images, and then compares these feature vectors similar to standard matching methods, artificially imitating the standard matching process.

1.2 Expanding part decoding / expansion part

        In the decoding/expansion stage (as shown in the figure above), this stage mainly includes multiple upsampling operations to restore image size and information. In order to better fuse semantic information from different layers, in addition to the output of the previous layer , the input of each layer also has the "optical flow" predicted by the output of the previous layer and the features from the corresponding layer of the encoder . This way we preserve both the high-level information conveyed from the coarser feature maps and the fine local information provided by the lower layer feature maps.

        Each upsampling doubles the resolution and repeats the process 4 times, resulting in a predicted output stream. Note that its resolution is still 4 times smaller than the original size of the input image , and it can be restored by bilinear interpolation later . Because the article found that compared with the bilinear interpolation upsampling with lower computational cost of the full image resolution, continuing network learning from this resolution will not greatly improve the results, so direct bilinear interpolation can be obtained and Input the optical flow prediction map at the same resolution.

2. Detailed explanation of Correlation layer

        Correlation Operation is a calculation of correlation matching , and its calculation result indicates the matching degree of two image patches . Compared with FlowNetS, FlowNetC does not simply stack the input images together, but needs to artificially give the network guidance information on how to match the image details, and merge and activate the high-level extracted features in the two images, so it introduces Correlation layer.

         The specific calculation process of Correlation Operation is essentially a convolution operation in a one-step CNN , but compared to CNN using a specific convolution kernel for convolution, here one data (image1 patch) is used to do another data (image2 patch) Convolution operation, so this operation does not include training parameters . For the rectangular image block (patch) in image1(w,h,c) that extends up and down with length k (the length and width of the rectangular block is 2k+1) centered on x1, it is the same as that in image2(w,h, c ) A Correlation calculation between rectangular image blocks (patch) of length k extending up and down with x2 as the center can be expressed as:

         In fact, the patch of the first image is used to convolve the patch of the second image (corresponding to the inner product), and a Correlation calculation produces a result number , indicating the matching degree of the two patches. Then, for a patch of a certain pixel x1 in image1, theoretically it should be matched with all pixel patches in image2 (a total of w*h patches) , then x1 will generate a corresponding length of w*h The matching vector , then for the entire image1 and image2, the Correlation result is four-dimensional .

        Calculating Correlation once will generate  c*K^2 (K = 2k+1) multiplication. In theory, for each pixel patch in image1, we need to perform matching calculations with all patches in image2. Comparing all patches will involve  w^2 * h^2 such multiplication calculations. But the problem is that such a calculation is huge , so we need to optimize this process.

        The article assumes that the displacement of the corresponding pixel only exists within a fixed range . In this way, when actually calculating the associated information, the model only needs to maintain a fixed-size search window , and things beyond the range of the search window will not be considered. Given the maximum search range d, for each position x1, we can limit x2 to calculate the correlation c(x1, x2) only in the rectangular search window whose size (length and width) is D = 2d + 1 . At the same time, the article uses strides s1 and s2, quantizes x1 globally and quantizes and calculates x2 in the neighborhood centered on x1 . In this way, for the patch of a certain pixel x1 in image1, after optimization, it should be matched with the patch of the pixel in image2 within the range of the corresponding search window (a total of D*D patches) , then x1 will be correspondingly generated A matching vector with a length of D*D , for the entire image1 and image2, the Correlation result can be expressed as(w,h,D^2) . In the original paper, set parameters k = 0, d = 20, s1 = 1, s2 = 2 , in the code it is implemented by c++, some key source codes are as follows:

template<typename scalar_t>
//一次 Correlation Operation 计算过程
__global__ void correlation_forward(scalar_t* __restrict__ output, const int nOutputChannels,
                const int outputHeight, const int outputWidth, const scalar_t* __restrict__ rInput1,
                const int nInputChannels, const int inputHeight, const int inputWidth,
                const scalar_t* __restrict__ rInput2, const int pad_size, const int kernel_size,
                const int max_displacement, const int stride1, const int stride2) {

        int32_t pInputWidth = inputWidth + 2 * pad_size;
        int32_t pInputHeight = inputHeight + 2 * pad_size;

        int32_t kernel_rad = (kernel_size - 1) / 2;

        int32_t displacement_rad = max_displacement / stride2;

        int32_t displacement_size = 2 * displacement_rad + 1;

        int32_t n = blockIdx.x;
        int32_t y1 = blockIdx.y * stride1 + max_displacement;
        int32_t x1 = blockIdx.z * stride1 + max_displacement;
        int32_t c = threadIdx.x;

        int32_t pdimyxc = pInputHeight * pInputWidth * nInputChannels;

        int32_t pdimxc = pInputWidth * nInputChannels;

        int32_t pdimc = nInputChannels;

        int32_t tdimcyx = nOutputChannels * outputHeight * outputWidth;
        int32_t tdimyx = outputHeight * outputWidth;
        int32_t tdimx = outputWidth;

        int32_t nelems = kernel_size * kernel_size * pdimc;

        // element-wise product along channel axis
        for (int tj = -displacement_rad; tj <= displacement_rad; ++tj) {
                for (int ti = -displacement_rad; ti <= displacement_rad; ++ti) {
                        //get center x2,y2 in image2
                        int x2 = x1 + ti * stride2;
                        int y2 = y1 + tj * stride2;

                        float acc0 = 0.0f;

                        for (int j = -kernel_rad; j <= kernel_rad; ++j) {
                                for (int i = -kernel_rad; i <= kernel_rad; ++i) {
                                        // THREADS_PER_BLOCK
                                        #pragma unroll
                                        for (int ch = c; ch < pdimc; ch += blockDim.x) {

                                                int indx1 = n * pdimyxc + (y1 + j) * pdimxc
                                                                + (x1 + i) * pdimc + ch;
                                                int indx2 = n * pdimyxc + (y2 + j) * pdimxc
                                                                + (x2 + i) * pdimc + ch;
                                                acc0 += static_cast<float>(rInput1[indx1] * rInput2[indx2]);
                                        }
                                }
                        }

                        if (blockDim.x == warpSize) {
                            __syncwarp();
                            acc0 = warpReduceSum(acc0);
                        } else {
                            __syncthreads();
                            acc0 = blockReduceSum(acc0);
                        }

                        if (threadIdx.x == 0) {

                                int tc = (tj + displacement_rad) * displacement_size
                                                + (ti + displacement_rad);
                                const int tindx = n * tdimcyx + tc * tdimyx + blockIdx.y * tdimx
                                                + blockIdx.z;
                                output[tindx] = static_cast<scalar_t>(acc0 / nelems);
                        }
            }
        }
}

        The original c++ implementation is more complicated to use, so there are also many third-party Python packages for Correlation calculations, which we can use directly by importing them, such as spatial_correlation_sampler/spatial_correlation_sample and so on.

(1)spatial_correlation_sampler/spatial_correlation_sample

input (B x C x H x W) -> output (B x PatchH x PatchW x oH x oW)

  • The patch size  patch_sizerepresents the length and width of the entire rectangular patch, not just the radius.
  • stride1Now it is stride, stride2it is now dilation_patch, which behaves like a dilated convolution
  • The equivalent max_displacementis dilation_patch * (patch_size - 1) / 2.
  • To get the correct parameters for flownetc, you need to set 
    kernel_size=1
    patch_size=21,
    stride=1,
    padding=0,
    dilation_patch=2
def correlate(input1, input2):
    out_corr = spatial_correlation_sample(input1,
                                          input2,
                                          kernel_size=1,
                                          patch_size=21,
                                          stride=1,
                                          padding=0,
                                          dilation_patch=2)
    # collate dimensions 1 and 2 in order to be treated as a
    # regular 4D tensor
    b, ph, pw, h, w = out_corr.size()
    out_corr = out_corr.view(b, ph * pw, h, w)/input1.size(1)
    return F.leaky_relu_(out_corr, 0.1)

3. More implementation details

        FlowNet-S and FlowNet-C are roughly the same in network structure: they have 9 convolutional layers in the contraction part, 6 of which have a stride of 2, and each convolution has a ReLU nonlinear activation function. The size of the convolution kernel is 7×7 in the first convolution layer, 5×5 in the next two layers, and 3×3 from the fourth layer onwards. The number of feature map channels approximately doubles after each layer with a stride of 2.

        The network uses endpoint error (EPE) as the training loss, which is a standard error metric for optical flow estimation. Its meaning is the Euclidean distance between the predicted optical flow vector and the corresponding ground truth, and is averaged over all pixels.

        The article chooses Adam as the optimization method of gradient descent, where the parameters of Adam are fixed as: β1 = 0.9 and β2 = 0.999. Besides, the number of mini-batches of input images to the network is set to 8 image pairs. For the learning rate, the network structure of FlowNet-C starts training with a low learning rate of λ = 1e−6, slowly increases after 10k iterations to reach λ = 1e−4, and then every 100k times after the first 300k Iterate to divide it by 2.

        The article also found that enlarging the input image during the test may improve performance, so for FlowNetS, the article does not enlarge, but for FlowNetC, the article chooses to enlarge by 1.25 times. Since the datasets used vary greatly in terms of object types and contained motion information, etc., a standard solution is to continuously fine-tune the network and parameters on the target dataset.

4. Key code implementation (take FlowNet-C as an example)

(1) Definition of network structure

# 下采样卷积层结构定义
def conv(batchNorm, in_planes, out_planes, kernel_size=3, stride=1):
    if batchNorm:
        return nn.Sequential(
            nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.LeakyReLU(0.1,inplace=True)
        )
    else:
        return nn.Sequential(
            nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, bias=True),
            nn.LeakyReLU(0.1,inplace=True)
        )

# 上采样反卷积层结构定义
def deconv(in_planes, out_planes):
    return nn.Sequential(
        nn.ConvTranspose2d(in_planes, out_planes, kernel_size=4, stride=2, padding=1, bias=False),
        nn.LeakyReLU(0.1,inplace=True)
    )

# 光流估计的输出层
def predict_flow(in_planes):
    return nn.Conv2d(in_planes,2,kernel_size=3,stride=1,padding=1,bias=False)

# corrlation 相关性计算: 引用第三方 spatial_correlation_sample 包
def correlate(input1, input2):
    out_corr = spatial_correlation_sample(input1,
                                          input2,
                                          kernel_size=1,
                                          patch_size=21,
                                          stride=1,
                                          padding=0,
                                          dilation_patch=2)
    # collate dimensions 1 and 2 in order to be treated as a
    # regular 4D tensor
    b, ph, pw, h, w = out_corr.size()
    out_corr = out_corr.view(b, ph * pw, h, w)/input1.size(1)
    return F.leaky_relu_(out_corr, 0.1)

class FlowNetC(nn.Module):
    expansion = 1

    def __init__(self,batchNorm=True):
        super(FlowNetC,self).__init__()
        self.batchNorm = batchNorm
        # image 的特征提取流(两输入图象处理流一致)
        self.conv1      = conv(self.batchNorm,   3,   64, kernel_size=7, stride=2)
        self.conv2      = conv(self.batchNorm,  64,  128, kernel_size=5, stride=2)
        self.conv3      = conv(self.batchNorm, 128,  256, kernel_size=5, stride=2)
        self.conv_redir = conv(self.batchNorm, 256,   32, kernel_size=1, stride=1)
        # 收缩部分的后处理下采样流
        self.conv3_1 = conv(self.batchNorm, 473,  256)
        self.conv4   = conv(self.batchNorm, 256,  512, stride=2)
        self.conv4_1 = conv(self.batchNorm, 512,  512)
        self.conv5   = conv(self.batchNorm, 512,  512, stride=2)
        self.conv5_1 = conv(self.batchNorm, 512,  512)
        self.conv6   = conv(self.batchNorm, 512, 1024, stride=2)
        self.conv6_1 = conv(self.batchNorm,1024, 1024)
        # 扩张部分的上采样流
        self.deconv5 = deconv(1024,512)
        self.deconv4 = deconv(1026,256)
        self.deconv3 = deconv(770,128)
        self.deconv2 = deconv(386,64)
        # 扩张部分的光流估计层
        self.predict_flow6 = predict_flow(1024)
        self.predict_flow5 = predict_flow(1026)
        self.predict_flow4 = predict_flow(770)
        self.predict_flow3 = predict_flow(386)
        self.predict_flow2 = predict_flow(194)
        # 光流上采样操作
        self.upsampled_flow6_to_5 = nn.ConvTranspose2d(2, 2, 4, 2, 1, bias=False)
        self.upsampled_flow5_to_4 = nn.ConvTranspose2d(2, 2, 4, 2, 1, bias=False)
        self.upsampled_flow4_to_3 = nn.ConvTranspose2d(2, 2, 4, 2, 1, bias=False)
        self.upsampled_flow3_to_2 = nn.ConvTranspose2d(2, 2, 4, 2, 1, bias=False)

        for m in self.modules():
            if isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
                kaiming_normal_(m.weight, 0.1)
                if m.bias is not None:
                    constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                constant_(m.weight, 1)
                constant_(m.bias, 0)

    def forward(self, x):
        x1 = x[:,:3]
        x2 = x[:,3:]
        # 1.提取 image1 的高层特征 (batch,h,w,3) -> (batch,h/8,w/8,256)
        out_conv1a = self.conv1(x1)
        out_conv2a = self.conv2(out_conv1a)
        out_conv3a = self.conv3(out_conv2a)
        # 2.提取 image2 的高层特征 (batch,h,w,3) -> (batch,h/8,w/8,256)
        out_conv1b = self.conv1(x2)
        out_conv2b = self.conv2(out_conv1b)
        out_conv3b = self.conv3(out_conv2b)
        # 3.进一步提取 image1 与 corr 匹配后的特征融合 (batch,h/8,w/8,256) -> (batch,h/8,w/8,32)
        out_conv_redir = self.conv_redir(out_conv3a)
        # 4. corr 相关性匹配计算 (batch,h/8,w/8,D*D)
        out_correlation = correlate(out_conv3a,out_conv3b)
        # 5. 在 channel 方向将进一步提取的特征与corr融合作为后续输入 (batch,h/8,w/8,c)
        in_conv3_1 = torch.cat([out_conv_redir, out_correlation], dim=1)
        # 6. 下采样操作 (batch,h/8,w/8,c) -> (batch,h/64,w/64,1024)
        out_conv3 = self.conv3_1(in_conv3_1)
        out_conv4 = self.conv4_1(self.conv4(out_conv3))
        out_conv5 = self.conv5_1(self.conv5(out_conv4))
        out_conv6 = self.conv6_1(self.conv6(out_conv5))
        # 7.refinement 上采样/扩张部分
        #   (1)upconv1: 输出本层预测光流flow6+本层的上采样输出
        #       - flow6 (batch,h/64,w/64,2)
        #       - out_deconv5 (batch,h/32,w/32,,512)
        flow6 = self.predict_flow6(out_conv6)
        flow6_up = self.upsampled_flow6_to_5(flow6)
        out_deconv5 = self.deconv5(out_conv6)
        concat5 = torch.cat((out_conv5, out_deconv5, flow6_up), 1) # 拼接下层输入 = 本层输出deconv + 收缩部分上下文输出 + 本层输出预测光流
        #   (2)upconv2: 输出本层预测光流flow5+本层的上采样输出
        #       - flow5 (batch,h/32,w/32,2)
        #       - out_deconv4 (batch,h/16,w/16,256)
        flow5 = self.predict_flow5(concat5)
        flow5_up = self.upsampled_flow5_to_4(flow5)
        out_deconv4 = self.deconv4(concat5)
        concat4 = torch.cat((out_conv4, out_deconv4, flow5_up), 1)
        #   (3)upconv3: 输出本层预测光流flow4+本层的上采样输出
        #       - flow4 (batch,h/16,w/16,2)
        #       - out_deconv3 (batch,h/8,w/8,256)
        flow4 = self.predict_flow4(concat4)
        flow4_up = self.upsampled_flow4_to_3(flow4)
        out_deconv3 = self.deconv3(concat4)
        concat3 = torch.cat((out_conv3, out_deconv3, flow4_up), 1)
        #   (4)upconv4: 输出本层预测光流flow3+本层的上采样输出
        #       - flow3 (batch,h/8,w/8,2)
        #       - out_deconv2 (batch,h/4,w/4,256)
        flow3 = self.predict_flow3(concat3)
        flow3_up = self.upsampled_flow3_to_2(flow3)
        out_deconv2 = self.deconv2(concat3)
        concat2 = torch.cat((out_conv2a, out_deconv2, flow3_up), 1)
        # 输出最终预测光流 flow2 (batch,h/4,w/4,2)
        flow2 = self.predict_flow2(concat2)

        if self.training:
            return flow2,flow3,flow4,flow5,flow6
        else:
            return flow2

    def weight_parameters(self):
        return [param for name, param in self.named_parameters() if 'weight' in name]

    def bias_parameters(self):
        return [param for name, param in self.named_parameters() if 'bias' in name]

(2) Network training

# EPE Loss
def EPE(input_flow, target_flow):
    return torch.norm(target_flow-input_flow,p=2,dim=1).mean()

def realEPE(output, target):
    b, _, h, w = target.size()
    upsampled_output = F.interpolate(output, (h,w), mode='bilinear', align_corners=False)
    return EPE(upsampled_output, target)

# 多尺度训练损失(flow2~flow6的EPE损失求和权重不同)
def multiscaleEPE(network_output, target_flow, weights=None):
    def one_scale(output, target):
        b, _, h, w = output.size()
        # 为防止 target 和 output 尺寸不一,使用插值方式来统一图像尺寸
        target_scaled = F.interpolate(target, (h, w), mode='area')
        return EPE(output, target_scaled)

    loss = 0
    for output, weight in zip(network_output, weights):
        loss += weight * one_scale(output, target_flow)
    return loss

# 网络训练主体框架
def run(train_loader,val_loader,model):
    best_EPE = -1
    # 定义 Adam 优化器
    param_groups = [{'params': model.bias_parameters(), 'weight_decay': args.bias_decay},
                    {'params': model.weight_parameters(), 'weight_decay': args.weight_decay}]
    optimizer = torch.optim.Adam(params=param_groups,lr=0.0001,betas=(0.9, 0.999))
    # optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, betas=(0.9, 0.999), amsgrad=False)
    # 定义学习率调整策略 lr_scheduler.MultiStepLR
    # milestones : epochs at which learning rate is divided by 2
    scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[100,150,200], gamma=0.5)

    for epoch in range(args.start_epoch, args.epochs):
        # 调整学习率 lr
        scheduler.step()
        # train for one epoch
        train_loss, train_EPE = train(train_loader, model, optimizer, epoch)
        print(train_loss,train_EPE)
        # evaluate on validation set
        with torch.no_grad():
            EPE = validate(val_loader, model, epoch)

        if best_EPE < 0:
            best_EPE = EPE
        best_EPE = min(EPE, best_EPE)

# 单轮训练
def train(train_loader, model, optimizer, epoch):
    global n_iter, args
    # training weight for each scale, from highest resolution (flow2) to lowest (flow6)
    multiscale_weights = [0.005, 0.01, 0.02, 0.08, 0.32]
    # value by which flow will be divided. Original value is 20 but 1 with batchNorm gives good results
    div_flow = 20.0
    losses = 0.0
    flow2_EPEs = 0.0

    epoch_size = len(train_loader) if args.epoch_size == 0 else min(len(train_loader), args.epoch_size)

    # switch to train mode
    model.train()

    for i, (input, target) in enumerate(train_loader):
        target = target.to(device)
        input = torch.cat(input,1).to(device)

        # compute output
        output = model(input)
        # compute loss
        loss = multiscaleEPE(output, target, weights=multiscale_weights) # 多尺度训练损失
        flow2_EPE = div_flow * realEPE(output[0], target) # 最终输出光流flow2的单独损失
        # record loss and EPE
        losses += loss.item()
        flow2_EPEs += flow2_EPE.item()

        # compute gradient and do optimization step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        n_iter += 1
        if i >= epoch_size:
            break

    return losses, flow2_EPEs

2. FlowNet 2.0 and its follow-up

        Compared with FlowNet, FlowNet 2.0 is characterized by stacking multiple FlowNetC/FlowNetS sub-networks to build a larger network structure, gradually refining the output flow and obtaining better results. The network structure is as follows. Later, with the further development of optical flow deep learning, a large number of new networks and new optimization ideas emerged, such as PWC-Net, etc., which we will further explain and analyze in subsequent articles.

Guess you like

Origin blog.csdn.net/qq_40772692/article/details/128752525