[youcans hands-on model] OverFeat model of target detection

Welcome to the "youcans hands-on model" series
The content and resources of this column are synchronized to GitHub/youcans



This article introduces the OverFeat model of target detection, and gives the PyTorch implementation method.


1. OverFeat convolutional neural network model

OverFeat is an image feature extractor and classifier based on convolutional networks, and it was the first place in the localization task of the 2013 ImageNet Recognition Challenge (ILSVRC2013).

Pierre Sermanet, Yann LeCun, et al., OverFeat:Integrated Recognition, Localization and Detection using Convolutional Networks

[Download address]: https://arxiv.org/abs/1312.6229
[GitHub address]: [OverFeat_pytorch] https://github.com/BeopGyu/OverFeat_pytorch

insert image description here


1.1 Abstract of the paper

OverFeat is a feature extraction operator (feature extractor).

This paper proposes an integrated framework for classification, localization, and detection using convolutional neural networks, using the same convolutional network for different vision tasks such as image classification, localization, and detection.

We efficiently implement multi-scale and sliding window methods in convolutional networks. Localization is performed by learning object boundaries, accumulating rather than suppressing bounding boxes to increase detection confidence.

The ensemble framework was the winner of the 2013 ImageNet Visual Recognition Challenge (ILSVRC2013) on the localization task and achieved highly competitive results on detection and classification tasks.


1.2 Technical Background

Three basic tasks of computer vision:

(1) Classification: Given a picture, give each picture a label (label), and identify the category of the picture (Top5).

(2) Localization: Given a picture, identify the category of the picture (Top5), and return the bounding box (bbox) of the object.

(3) Detection: Given a picture with one or more (including 0) targets, it is necessary to find the bounding boxes of all target objects and identify their categories, and also consider false positives (FP).

insert image description here

The main advantage of convolutional neural networks in image classification vision tasks is that end-to-end training of images does not require manual design of feature extractors (such as SIFT, HOG), and the disadvantage is that large-scale labeled sample set for training.

Although the images of the ImageNet dataset contain a central object that roughly fills the image, there are still significant differences in the size and location of the object in different images, and there are still small objects located in the corners of the image. There are several ideas to solve this problem:

(1) Using multiple fixed-size sliding windows, CNN is used for image classification for each scanning window. The disadvantage of this method is that the active possible window does not contain the entire target, but only a part of the target (such as a dog head). This results in good classification performance but poor localization and detection performance.

(2) Train a convolutional network, which can not only perform image classification, but also generate the bounding box of the predicted target.

(3) Accumulate the confidence of the category corresponding to each position and size.


1.3 Basic method

insert image description here


model design

The basic structure of the OverFeat model is similar to AlexNet, see the table below for details.

The OverFeat model includes a Fast model and an Accurate model.

  • Fast Model
    The first to fifth layers are feature extractors composed of convolutional layers, and the sixth to eighth layers are classifiers composed of fully connected layers.
    insert image description here

  • Accurate model
    The first to sixth layers are feature extractors composed of convolutional layers, and the seventh to ninth layers are classifiers composed of fully connected layers.
    insert image description here

multi-scale classification

OverFeat 在模型训练阶段与 AlexNet 是类似的,但在模型测试阶段使用 6种不同尺度的测试图像,进行多尺度多视图表决以提高性能。

研究 [15] 对一组固定的 10 个视图( 4 个角和中心,及其水平翻转)进行平均。然而这种方法会忽略图像的许多区域,并且当视图重叠时,在计算上是冗余的。此外,它仅在单个尺度上应用,该尺度可能不是 卷积网络最佳置信度响应的尺度。

本文通过在每个位置和多个尺度上,密集运行网络来探索整个图像。滑动窗口方法对某些类型的模型来说计算代价很高,但在卷积网络是非常高效的。这种方法产生了明显更多视图用于表决,增加了鲁棒性,同时保持了高效。

insert image description here

insert image description here


滑动窗口(Sliding window)

卷积网络天然地适合滑动窗口的高效计算。

与为输入的每个窗口计算一个完整流水线的滑动窗口方法不同,卷积网络在以滑动窗口方式运行时本质上是高效的,因为它们自然地共享重叠区域共有的计算。

网络的最后几层是全连接的线性层。在测试阶段,用1*1 的逐点卷积层取代这些线性层( 把FC层看成对图像的1*1卷积运算 )。于是,整个网络成为只有卷积、池化和阈值运算(ReLU)的全卷积神经网络(Fully Convolutional Network, FCN)。

在训练阶段,网络只产生一个全尺度的空间输出。

在测试阶段,将网络应用于较大的图像时,我们只需对整个图像进行一次卷积层的运算,生成全尺度输出预测图。在滑动窗口时,每个输入“窗口”(视场)都有一个空间位置。这比在图像上滑动窗口,并对逐个窗口进行特征提取要高效得多。下图展示了卷积实现滑动窗口的效果。

insert image description here


定位(Localization)

基于图像分类任务所建立和训练得到的卷积神经网络模型,包括特征提取器和图像分类器两部分。使用特征提取器从图像样本集所提取的特征,只要修改网络最后几层的图像分类器,就可以实现不同的任务,而不需要从头开始训练整个网络的参数。这在本质上就是迁移学习。

Based on the pre-training model trained on the ImageNet dataset, retain the feature extractor of the pre-training model, but replace the classifier of the pre-training model with a regression network, construct a new regression network model, and train it to predict each spatial position and scale Object bounding box on .

(1) Generate predictions

To generate object bounding box predictions, we run the classifier and regressor networks simultaneously at all locations and scales. Since these share the same feature extraction layer, only the final regression layer needs to be recomputed after computing the classification network. The output of the final softmax layer for class c at each location provides a confidence score that an object of class c is present (not necessarily fully contained) in the corresponding window. Therefore, we can assign a confidence level to each bounding box.

(2) Training regressor

The regression network takes the feature map output by the feature extractor as input, has two fully connected hidden layers with 4096 and 1024 channels respectively, and the final output layer has 4 nodes for specifying the coordinates of the bounding box edges.

The feature extraction layer (including model structure and parameters) from the classification network is fixed, and the L2 norm between the predicted bounding box and the ground truth bounding box is used as the loss function.

Training regressors in a multi-scale manner is important for predicting combinations across scales. Training on multiple scales will make predictions match correctly across multiple scales and exponentially increase the confidence of the merged predictions.

(3) Combined prediction results

The predictions are combined by a greedy merging strategy applied to the bounding boxes of the regressors, which are merged and accumulated into a small number of objects.

A large number of bounding boxes that overlap and stagger initially converge to a certain position and scale. This is computed by accumulating the detection class outputs associated with predicting the input window for each bounding box.

After merging, a large number of bounding boxes are fused into one very high confidence box. False positives below the detection threshold are rejected due to lack of bounding box consistency and confidence. Compared with traditional non-maximum suppression, it is more robust by rewarding the consistency of bounding boxes.

insert image description here


Object Detection

The detection task is to find one or more target objects in the image (classify and localize).

The main difference from the localization task is to predict the background class when no object is present. Traditionally, negative examples are randomly sampled for training.

We do negative training dynamically, choosing a number of negative examples, such as random or worst examples, for each image. This approach is computationally more expensive, but makes the process simpler. Detection fine-tuning is not slow due to the use of feature extraction layers from classification networks.


1.4 Summary

OverFeat is a feature extraction operator. In this paper, the feature extraction function of the convolutional neural network is used to use the features extracted from the image classification task for various visual tasks such as positioning and detection.

The convolutional neural network model established and trained based on image classification tasks includes two parts: feature extractor and image classifier. Using the features extracted from the image sample set by the feature extractor, as long as the image classifiers in the last few layers of the network are modified, different tasks can be achieved without the need to train the parameters of the entire network from scratch. This is essentially transfer learning.

This paper uses convolutional neural networks to provide a unified framework for classification, localization, and detection tasks, and demonstrates convolutional networks to implement multiscale sliding window methods. Localization is performed by learning to predict the boundaries of objects, accumulating rather than suppressing bounding boxes in order to increase detection confidence.

Summarized as follows:

(1) Use a convolutional neural network to simultaneously process three tasks of image classification, positioning, and detection.

(2) A multi-scale sliding window approach is efficiently implemented using convolutional neural networks.

(3) A method for finding bounding boxes by accumulating predictions is proposed.


2. Define the OverFeat model class in PyTorch

2.1 OverFeat_fast model class

Fast model

class OverFeat_fast(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()

        # train with 221x221 5 random crops and their horizontal filps
        # mini- batches of size 128
        # initialized weight randomly with mu=0, sigma=1x10^-2
        # SGD, momentum=0.6, l2 weight decay of 1x10^-5
        # learning rate 5x10^-2, decay by 0.5 after (30, 50, 60, 70, 80) epochs
        # Dropout on FCN?? -> dropout before classifier conv layer

        self.feature_extractor = nn.Sequential(
            # no contrast normalization is used
            # max polling with non-overlapping
            # 1st and 2nd layer stride 2 instead of 4

            # 1st
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),  # (b x 96 x 56 x 56)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 96 x 28 x 28)
            # 2nd
            nn.Conv2d(96, 256, 5, stride= 1),  # (b x 256 x 24 x 24)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 256 x 12 x 12)
            # 3rd
            nn.Conv2d(256, 512, 3, padding=1),  # (b x 512 x 12 x 12)
            nn.ReLU(),
            # 4th
            nn.Conv2d(512, 1024, 3, padding=1),  # (b x 1024 x 12 x 12)
            nn.ReLU(),
            # 5th
            nn.Conv2d(1024, 1024, 3, padding=1),  # (b x 1024 x 12 x 12)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 1024 x 6 x 6)
        )

        # fully connecyed layers implemented as a convolution layers
        self.classifier = nn.Sequential(
            # 6th
            nn.Dropout(p=0.5, inplace=False),
            nn.Conv2d(in_channels=1024, out_channels=3072, kernel_size=6),
            nn.ReLU(),
            # 7th
            nn.Dropout(p=0.5, inplace=False),
            nn.Conv2d(3072, 4096, 1),
            nn.ReLU(),
            # 8th
            nn.Conv2d(4096, num_classes, 1)
        )

        self.init_weight()  # initialize weight

    def init_weight(self):
        for layer in self.feature_extractor:
            if isinstance(layer, nn.Conv2d):
                nn.init.normal_(layer.weight, mean=0, std=0.01)

    def forward(self, x):
        """
        Pass the input through the net.
        Args:
            x (Tensor): input tensor
        Returns:
            output (Tensor): output tensor
        """
        x = self.feature_extractor(x)
        x = self.classifier(x)
        return x.squeeze()

The structure of the OverFeat_fast model is as follows.

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 96, 56, 56]          34,944
              ReLU-2           [-1, 96, 56, 56]               0
         MaxPool2d-3           [-1, 96, 28, 28]               0
            Conv2d-4          [-1, 256, 24, 24]         614,656
              ReLU-5          [-1, 256, 24, 24]               0
         MaxPool2d-6          [-1, 256, 12, 12]               0
            Conv2d-7          [-1, 512, 12, 12]       1,180,160
              ReLU-8          [-1, 512, 12, 12]               0
            Conv2d-9         [-1, 1024, 12, 12]       4,719,616
             ReLU-10         [-1, 1024, 12, 12]               0
           Conv2d-11         [-1, 1024, 12, 12]       9,438,208
             ReLU-12         [-1, 1024, 12, 12]               0
        MaxPool2d-13           [-1, 1024, 6, 6]               0
          Dropout-14           [-1, 1024, 6, 6]               0
           Conv2d-15           [-1, 3072, 1, 1]     113,249,280
             ReLU-16           [-1, 3072, 1, 1]               0
          Dropout-17           [-1, 3072, 1, 1]               0
           Conv2d-18           [-1, 4096, 1, 1]      12,587,008
             ReLU-19           [-1, 4096, 1, 1]               0
           Conv2d-20           [-1, 1000, 1, 1]       4,097,000
================================================================
Total params: 145,920,872
Trainable params: 145,920,872
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.61
Forward/backward pass size (MB): 14.03
Params size (MB): 556.64
Estimated Total Size (MB): 571.28
----------------------------------------------------------------

2.2 OverFeat_accurate model class

Accurate model

class OverFeat_accurate(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()

        # train with 221x221 5 random crops and their horizontal filps
        # mini- batches of size 128
        # initialized weight randomly with mu=0, sigma=1x10^-2
        # SGD, momentum=0.6, l2 weight decay of 1x10^-5
        # learning rate 5x10^-2, decay by 0.5 after (30, 50, 60, 70, 80) epochs
        # Dropout on FCN?? -> dropout before classifier conv layer

        self.feature_extractor = nn.Sequential(
            # no contrast normalization is used
            # max polling with non-overlapping
            # 1st and 2nd layer stride 2 instead of 4

            # 1st
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=7, stride=2),  # (b x 96 x 108 x 108)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=3),  # (b x 96 x 36 x 36)
            # 2nd
            nn.Conv2d(96, 256, 7, stride= 1),  # (b x 256 x 30 x 30)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (b x 256 x 15 x 15)
            # 3rd
            nn.Conv2d(256, 512, 3, padding=1),  # (b x 512 x 15 x 15)
            nn.ReLU(),
            # 4th
            nn.Conv2d(512, 512, 3, padding=1),  # (b x 512 x 15 x 15)
            nn.ReLU(),
            # 5th
            nn.Conv2d(512, 1024, 3, padding=1),  # (b x 1024 x 15 x 15)
            nn.ReLU(),
            # 6th
            nn.Conv2d(1024, 1024, 3, padding=1),  # (b x 1024 x 15 x 15)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=3),  # (b x 1024 x 5 x 5)
        )

        # fully connecyed layers implemented as a convolution layers
        self.classifier = nn.Sequential(
            # 7th
            nn.Dropout(p=0.5, inplace=True),
            nn.Conv2d(in_channels=1024, out_channels=4096, kernel_size=5),
            nn.ReLU(),
            # 8th
            nn.Dropout(p=0.5, inplace=True),
            nn.Conv2d(4096, 4096, 1),
            nn.ReLU(),
            # 9th
            nn.Conv2d(4096, num_classes, 1)
        )

        self.init_weight()  # initialize weight

    def init_weight(self):
        for layer in self.feature_extractor:
            if isinstance(layer, nn.Conv2d):
                nn.init.normal_(layer.weight, mean=0, std=0.01)

    def forward(self, x):
        """
        Pass the input through the net.
        Args:
            x (Tensor): input tensor
        Returns:
            output (Tensor): output tensor
        """
        x = self.feature_extractor(x)
        x = self.classifier(x)
        return x.squeeze()

The structure of the OverFeat_accurate model is as follows.

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 96, 108, 108]          14,208
              ReLU-2         [-1, 96, 108, 108]               0
         MaxPool2d-3           [-1, 96, 36, 36]               0
            Conv2d-4          [-1, 256, 30, 30]       1,204,480
              ReLU-5          [-1, 256, 30, 30]               0
         MaxPool2d-6          [-1, 256, 15, 15]               0
            Conv2d-7          [-1, 512, 15, 15]       1,180,160
              ReLU-8          [-1, 512, 15, 15]               0
            Conv2d-9          [-1, 512, 15, 15]       2,359,808
             ReLU-10          [-1, 512, 15, 15]               0
           Conv2d-11         [-1, 1024, 15, 15]       4,719,616
             ReLU-12         [-1, 1024, 15, 15]               0
           Conv2d-13         [-1, 1024, 15, 15]       9,438,208
             ReLU-14         [-1, 1024, 15, 15]               0
        MaxPool2d-15           [-1, 1024, 5, 5]               0
          Dropout-16           [-1, 1024, 5, 5]               0
           Conv2d-17           [-1, 4096, 1, 1]     104,861,696
             ReLU-18           [-1, 4096, 1, 1]               0
          Dropout-19           [-1, 4096, 1, 1]               0
           Conv2d-20           [-1, 4096, 1, 1]      16,781,312
             ReLU-21           [-1, 4096, 1, 1]               0
           Conv2d-22           [-1, 1000, 1, 1]       4,097,000
================================================================
Total params: 144,656,488
Trainable params: 144,656,488
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.56
Forward/backward pass size (MB): 33.09
Params size (MB): 551.82
Estimated Total Size (MB): 585.47
----------------------------------------------------------------

3. Training of OverFeat model

if __name__ == '__main__':
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)

    Fast = True
    # Fast = False

    from torchsummary import summary
    if (Fast):
        model_fast = OverFeat_fast(num_classes=1000).to(device)
        summary(model_fast, (3, 231, 231))
    else:
        model_accurate = OverFeat_accurate(num_classes=1000).to(device)
        summary(model_accurate, (3, 221, 221))


【to be continued. . .
3.2 Model training
3.3 Model reasoning


references:

  1. Pierre Sermanet, Yann LeCun, et al., OverFeat:Integrated Recognition, Localization and Detection using Convolutional Networks
  2. https://github.com/BeopGyu/OverFeat_pytorch/blob/master/230117.ipynb

【End of this section】


Copyright statement:
Welcome to the "youcans hands-on model" series
. For forwarding, please indicate the original link:
[youcans hands-on model] OverFeat model of target detection
Copyright 2023 youcans, XUPT
Crated: 2023-07-14


Guess you like

Origin blog.csdn.net/youcans/article/details/131715630