【youcans动手学模型】目标检测之 OverFeat 模型

欢迎关注『youcans动手学模型』系列
本专栏内容和资源同步到 GitHub/youcans

【youcans动手学模型】目标检测之 OverFeat 模型

本文介绍目标检测之 OverFeat 模型，并给出 PyTorch 实现方法。

1. OverFeat 卷积神经网络模型

OverFeat 是一种基于卷积网络的图像特征提取器和分类器，是 2013年 ImageNet识别挑战赛（ILSVRC2013）定位任务的第一名。

Pierre Sermanet, Yann LeCun, et al., OverFeat：Integrated Recognition, Localization and Detection using Convolutional Networks

【下载地址】： https://arxiv.org/abs/1312.6229
【GitHub地址】： [OverFeat_pytorch] https://github.com/BeopGyu/OverFeat_pytorch

在这里插入图片描述

1.1 论文摘要

OverFeat 是一种特征提取算子（特征提取器）。

本文提出了一个使用卷积神经网络进行分类、定位和检测的集成框架，使用同一个卷积网络进行图像分类、定位和检测等不同的视觉任务。

我们在卷积网络中有效地实现了多尺度和滑动窗口方法。通过学习目标边界进行定位，累积而不是抑制边界框，以增加检测置信度。

该集成框架是 2013 ImageNet 视觉识别挑战赛（ILSVRC2013）定位任务的获胜者，并在检测和分类任务中获得了极具竞争力的结果。

1.2 技术背景

计算机视觉的三大基本任务：

（1）分类（classification）：给定一张图片，给每张图片一个标签（label），识别图片的类别（Top5）。

（2）定位（localization）：给定一张图片，识别图片的类别（Top5），并回归物体的边界框（bbox）。

（3）检测（detection）：给定一张图片，其中有一个或多个（包括 0个）目标，要找出所有目标物体的边界框并识别其类别，还要考虑假阳性（FP）。

在这里插入图片描述

卷积神经网络在图像分类视觉任务中的主要优点是，对图像进行端到端（end-to-end）的训练，不需要人工设计特征提取器（如SIFT，HOG），缺点是需要大规模的标注样本集进行训练。

虽然 ImageNet 数据集的图像包含一个大致充满图像的中心目标，但是目标在不同图像中的大小和位置仍有显著的差异，依旧存在位于图像边角位置的小目标。解决这个问题有以下几个思路：

（1）使用多个固定大小的滑动窗口，对每个扫描窗口用CNN进行图像分类。该方法的缺点在于活动可能窗口没有包含整个目标，只包含了目标的局部（比如狗头）。这导致分类性能良好，但定位和检测性能较差。

（2）训练一个卷积网络，不仅能进行图像分类，还能产生预测目标的边界框（bouding box）。

（3）累积每个位置和尺寸所对应类别的置信度。

1.3 基本方法

在这里插入图片描述

模型设计

OverFeat 模型的基本结构与 AlexNet 类似，详见下表。

OverFeat 模型包括快速模型（Fast model）和精确模型（Accurate model）。

快速模型
第1～5层是卷积层组成的特征提取器，第6～8层是全连接层组成的分类器。
精确模型
第1～6层是卷积层组成的特征提取器，第7～9层是全连接层组成的分类器。

多尺度分类

OverFeat 在模型训练阶段与 AlexNet 是类似的，但在模型测试阶段使用 6种不同尺度的测试图像，进行多尺度多视图表决以提高性能。

研究 [15] 对一组固定的 10 个视图（ 4 个角和中心，及其水平翻转）进行平均。然而这种方法会忽略图像的许多区域，并且当视图重叠时，在计算上是冗余的。此外，它仅在单个尺度上应用，该尺度可能不是卷积网络最佳置信度响应的尺度。

本文通过在每个位置和多个尺度上，密集运行网络来探索整个图像。滑动窗口方法对某些类型的模型来说计算代价很高，但在卷积网络是非常高效的。这种方法产生了明显更多视图用于表决，增加了鲁棒性，同时保持了高效。

在这里插入图片描述

滑动窗口（Sliding window）

卷积网络天然地适合滑动窗口的高效计算。

与为输入的每个窗口计算一个完整流水线的滑动窗口方法不同，卷积网络在以滑动窗口方式运行时本质上是高效的，因为它们自然地共享重叠区域共有的计算。

网络的最后几层是全连接的线性层。在测试阶段，用1*1 的逐点卷积层取代这些线性层（把FC层看成对图像的1*1卷积运算）。于是，整个网络成为只有卷积、池化和阈值运算（ReLU）的全卷积神经网络（Fully Convolutional Network, FCN）。

在训练阶段，网络只产生一个全尺度的空间输出。

在测试阶段，将网络应用于较大的图像时，我们只需对整个图像进行一次卷积层的运算，生成全尺度输出预测图。在滑动窗口时，每个输入“窗口”（视场）都有一个空间位置。这比在图像上滑动窗口，并对逐个窗口进行特征提取要高效得多。下图展示了卷积实现滑动窗口的效果。

在这里插入图片描述

定位（Localization）

基于图像分类任务所建立和训练得到的卷积神经网络模型，包括特征提取器和图像分类器两部分。使用特征提取器从图像样本集所提取的特征，只要修改网络最后几层的图像分类器，就可以实现不同的任务，而不需要从头开始训练整个网络的参数。这在本质上就是迁移学习。

基于在 ImageNet 数据集训练的预训练模型，保留预训练模型的特征提取器，但用回归网络代替预训练模型的分类器，构造新的回归网络模型，并训练它来预测每个空间位置和尺度上的对象边界框。

（1）生成预测

为了生成对象边界框预测，我们在所有位置和尺度上同时运行分类器和回归器网络。由于这些共享相同的特征提取层，因此在计算分类网络之后只需要重新计算最终回归层。在每个位置处的类别c的最终softmax层的输出提供类别c的对象存在于（不一定完全包含）对应窗口中的置信度得分。因此，我们可以为每个边界框指定一个置信度。

（2）训练回归器

回归网络以特征提取器输出的特征图作为输入，有两个全连接的隐藏层，分别为 4096 和 1024个通道，最终输出层有 4个节点，用于指定边界框边的坐标。

固定来自分类网络的特征提取层（包括模型结构和参数）不变，并以预测边界框和真实边界框之间的 L2 范数作为损失函数。

以多尺度方式训练回归器对于跨尺度预测组合很重要。训练多尺度将使预测在多尺度上正确匹配，并以指数方式增加合并预测的置信度。

（3）组合预测结果

通过应用于回归器边界框的贪婪合并策略，组合预测结果，将这些边界框合并并累积为少量对象。

最初重叠交错的大量边界框，都收敛到某一个位置和比例。这是通过累加与预测每个边界框的输入窗口相关联的检测类输出来计算的。

在合并后，大量边界框被融合为一个非常高的置信度框。假阳性由于缺乏边界框的一致性和置信度而低于检测阈值被剔除。与传统的非最大值抑制相比，通过奖励边界框的一致性，具有更好的鲁棒性。

在这里插入图片描述

目标检测（Object Detection）

检测任务是要找出图像中的一个或多个目标物体（分类并定位）。

与定位任务的主要区别在于，当不存在对象时，要预测背景类。传统上，会随机抽取负面例子进行训练。

我们动态地进行负面训练，为每张图像选择一些负面例子，如随机的或最坏的例子。这种方法在计算上更昂贵，但使过程更简单。由于使用来自分类网络的特征提取层，因此检测微调的速度并不会很慢。

1.4 总结

OverFeat 是一种特征提取算子。本文利用卷积神经网络的特征提取功能，将从图像分类任务中提取到的特征，用于定位、检测等各种视觉任务。

本文用卷积神经网络为分类、定位、检测任务提供了一个统一的框架，展示了卷积网络实现多尺度(multiscale) 滑动窗口方法。通过学习预测目标的边界来进行定位，累积而不是抑制边界框，以便增加检测置信度。

总结如下：

（1）使用一个卷积神经网络，同时处理图像分类，定位，检测三个任务。

（2）使用卷积神经网络有效地实现了一个多尺度的滑动窗口的方法。

（3）提出了一种通过累积预测来求边界框的方法。

2. 在 PyTorch 中定义 OverFeat 模型类

2.1 OverFeat_fast 模型类

快速模型（Fast model）

class OverFeat_fast(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()

        # train with 221x221 5 random crops and their horizontal filps
        # mini- batches of size 128
        # initialized weight randomly with mu=0, sigma=1x10^-2
        # SGD, momentum=0.6, l2 weight decay of 1x10^-5
        # learning rate 5x10^-2, decay by 0.5 after (30, 50, 60, 70, 80) epochs
        # Dropout on FCN?? -> dropout before classifier conv layer

        self.feature_extractor = nn.Sequential(
            # no contrast normalization is used
            # max polling with non-overlapping
            # 1st and 2nd layer stride 2 instead of 4

            # 1st
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),  # (b x 96 x 56 x 56)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 96 x 28 x 28)
            # 2nd
            nn.Conv2d(96, 256, 5, stride= 1),  # (b x 256 x 24 x 24)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 256 x 12 x 12)
            # 3rd
            nn.Conv2d(256, 512, 3, padding=1),  # (b x 512 x 12 x 12)
            nn.ReLU(),
            # 4th
            nn.Conv2d(512, 1024, 3, padding=1),  # (b x 1024 x 12 x 12)
            nn.ReLU(),
            # 5th
            nn.Conv2d(1024, 1024, 3, padding=1),  # (b x 1024 x 12 x 12)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 1024 x 6 x 6)
        )

        # fully connecyed layers implemented as a convolution layers
        self.classifier = nn.Sequential(
            # 6th
            nn.Dropout(p=0.5, inplace=False),
            nn.Conv2d(in_channels=1024, out_channels=3072, kernel_size=6),
            nn.ReLU(),
            # 7th
            nn.Dropout(p=0.5, inplace=False),
            nn.Conv2d(3072, 4096, 1),
            nn.ReLU(),
            # 8th
            nn.Conv2d(4096, num_classes, 1)
        )

        self.init_weight()  # initialize weight

    def init_weight(self):
        for layer in self.feature_extractor:
            if isinstance(layer, nn.Conv2d):
                nn.init.normal_(layer.weight, mean=0, std=0.01)

    def forward(self, x):
        """
        Pass the input through the net.
        Args:
            x (Tensor): input tensor
        Returns:
            output (Tensor): output tensor
        """
        x = self.feature_extractor(x)
        x = self.classifier(x)
        return x.squeeze()

OverFeat_fast 模型的结构如下。

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 96, 56, 56]          34,944
              ReLU-2           [-1, 96, 56, 56]               0
         MaxPool2d-3           [-1, 96, 28, 28]               0
            Conv2d-4          [-1, 256, 24, 24]         614,656
              ReLU-5          [-1, 256, 24, 24]               0
         MaxPool2d-6          [-1, 256, 12, 12]               0
            Conv2d-7          [-1, 512, 12, 12]       1,180,160
              ReLU-8          [-1, 512, 12, 12]               0
            Conv2d-9         [-1, 1024, 12, 12]       4,719,616
             ReLU-10         [-1, 1024, 12, 12]               0
           Conv2d-11         [-1, 1024, 12, 12]       9,438,208
             ReLU-12         [-1, 1024, 12, 12]               0
        MaxPool2d-13           [-1, 1024, 6, 6]               0
          Dropout-14           [-1, 1024, 6, 6]               0
           Conv2d-15           [-1, 3072, 1, 1]     113,249,280
             ReLU-16           [-1, 3072, 1, 1]               0
          Dropout-17           [-1, 3072, 1, 1]               0
           Conv2d-18           [-1, 4096, 1, 1]      12,587,008
             ReLU-19           [-1, 4096, 1, 1]               0
           Conv2d-20           [-1, 1000, 1, 1]       4,097,000
================================================================
Total params: 145,920,872
Trainable params: 145,920,872
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.61
Forward/backward pass size (MB): 14.03
Params size (MB): 556.64
Estimated Total Size (MB): 571.28
----------------------------------------------------------------

2.2 OverFeat_accurate 模型类

精确模型（Accurate model）

class OverFeat_accurate(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()

        # train with 221x221 5 random crops and their horizontal filps
        # mini- batches of size 128
        # initialized weight randomly with mu=0, sigma=1x10^-2
        # SGD, momentum=0.6, l2 weight decay of 1x10^-5
        # learning rate 5x10^-2, decay by 0.5 after (30, 50, 60, 70, 80) epochs
        # Dropout on FCN?? -> dropout before classifier conv layer

        self.feature_extractor = nn.Sequential(
            # no contrast normalization is used
            # max polling with non-overlapping
            # 1st and 2nd layer stride 2 instead of 4

            # 1st
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=7, stride=2),  # (b x 96 x 108 x 108)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=3),  # (b x 96 x 36 x 36)
            # 2nd
            nn.Conv2d(96, 256, 7, stride= 1),  # (b x 256 x 30 x 30)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 256 x 15 x 15)
            # 3rd
            nn.Conv2d(256, 512, 3, padding=1),  # (b x 512 x 15 x 15)
            nn.ReLU(),
            # 4th
            nn.Conv2d(512, 512, 3, padding=1),  # (b x 512 x 15 x 15)
            nn.ReLU(),
            # 5th
            nn.Conv2d(512, 1024, 3, padding=1),  # (b x 1024 x 15 x 15)
            nn.ReLU(),
            # 6th
            nn.Conv2d(1024, 1024, 3, padding=1),  # (b x 1024 x 15 x 15)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=3),  # (b x 1024 x 5 x 5)
        )

        # fully connecyed layers implemented as a convolution layers
        self.classifier = nn.Sequential(
            # 7th
            nn.Dropout(p=0.5, inplace=True),
            nn.Conv2d(in_channels=1024, out_channels=4096, kernel_size=5),
            nn.ReLU(),
            # 8th
            nn.Dropout(p=0.5, inplace=True),
            nn.Conv2d(4096, 4096, 1),
            nn.ReLU(),
            # 9th
            nn.Conv2d(4096, num_classes, 1)
        )

        self.init_weight()  # initialize weight

    def init_weight(self):
        for layer in self.feature_extractor:
            if isinstance(layer, nn.Conv2d):
                nn.init.normal_(layer.weight, mean=0, std=0.01)

    def forward(self, x):
        """
        Pass the input through the net.
        Args:
            x (Tensor): input tensor
        Returns:
            output (Tensor): output tensor
        """
        x = self.feature_extractor(x)
        x = self.classifier(x)
        return x.squeeze()

OverFeat_accurate 模型的结构如下。

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 96, 108, 108]          14,208
              ReLU-2         [-1, 96, 108, 108]               0
         MaxPool2d-3           [-1, 96, 36, 36]               0
            Conv2d-4          [-1, 256, 30, 30]       1,204,480
              ReLU-5          [-1, 256, 30, 30]               0
         MaxPool2d-6          [-1, 256, 15, 15]               0
            Conv2d-7          [-1, 512, 15, 15]       1,180,160
              ReLU-8          [-1, 512, 15, 15]               0
            Conv2d-9          [-1, 512, 15, 15]       2,359,808
             ReLU-10          [-1, 512, 15, 15]               0
           Conv2d-11         [-1, 1024, 15, 15]       4,719,616
             ReLU-12         [-1, 1024, 15, 15]               0
           Conv2d-13         [-1, 1024, 15, 15]       9,438,208
             ReLU-14         [-1, 1024, 15, 15]               0
        MaxPool2d-15           [-1, 1024, 5, 5]               0
          Dropout-16           [-1, 1024, 5, 5]               0
           Conv2d-17           [-1, 4096, 1, 1]     104,861,696
             ReLU-18           [-1, 4096, 1, 1]               0
          Dropout-19           [-1, 4096, 1, 1]               0
           Conv2d-20           [-1, 4096, 1, 1]      16,781,312
             ReLU-21           [-1, 4096, 1, 1]               0
           Conv2d-22           [-1, 1000, 1, 1]       4,097,000
================================================================
Total params: 144,656,488
Trainable params: 144,656,488
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.56
Forward/backward pass size (MB): 33.09
Params size (MB): 551.82
Estimated Total Size (MB): 585.47
----------------------------------------------------------------

3. OverFeat 模型的训练

if __name__ == '__main__':
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)

    Fast = True
    # Fast = False

    from torchsummary import summary
    if (Fast):
        model_fast = OverFeat_fast(num_classes=1000).to(device)
        summary(model_fast, (3, 231, 231))
    else:
        model_accurate = OverFeat_accurate(num_classes=1000).to(device)
        summary(model_accurate, (3, 221, 221))

【待续。。。】
3.2 模型训练
3.3 模型推理

参考文献:

Pierre Sermanet, Yann LeCun, et al., OverFeat：Integrated Recognition, Localization and Detection using Convolutional Networks
https://github.com/BeopGyu/OverFeat_pytorch/blob/master/230117.ipynb

【本节完】

版权声明：
欢迎关注『youcans动手学模型』系列
转发请注明原文链接：
【youcans动手学模型】目标检测之 OverFeat 模型
Copyright 2023 youcans, XUPT
Crated：2023-07-14