Paper Reading-MLPD: Multi-Label Pedestrian Detector in Multispectral Domain (Hikvision Research Institute Internship Program)

The source code of the project can be viewed here: MLPD source code
paper original address

Thesis title: Multi-Label Pedestrian Detector in Multispectral Domain
translated into Chinese is: Multi-spectral domain multi-label pedestrian detector
(in fact, a network is designed, and then multi-modality is used for training, and then the data set is divided into RGB image and Thermal ( Thermal image), but RGB and Thermal are not fully registered)

In the following, the chapters of the thesis will be introduced separately.

Summary

  Multispectral (RGB and thermal imagery two modalities) pedestrian recognition has been actively researched as a promising multimodal solution to deal with severe weather situations. But most multimodal methods assume that all inputs are fully overlapping. But due to the complexity of sensor configuration, such data pairs are not common in practical applications.
  So in this paper, the authors deal with multispectral pedestrian detection, where all the input data 没有进行配对. To this end, the authors propose a new 单阶段检测框架framework that leverages 多标签to learn input state-aware features by assigning a single label to a given state of an input image pair.
  The authors also propose a novel augmentation strategy that applies geometric transformations to synthesize unpaired multispectral images. In extensive experiments, the authors demonstrate the effectiveness of the proposed method under different real-world conditions, such as fully overlapping images and partially overlapping images in stereo vision.

introduce

  This part mainly introduces the importance of pedestrian detection, and the robustness of multi-modal pedestrian recognition in the all-day state is very good.
  The KAIST dataset is then introduced: this dataset provides fully overlapping RGB and thermal imaging pairs. But although most fusion methods preferentially use such fully overlapping datasets, such datasets are difficult to use in real-world applications because special equipment is required to acquire two images that overlap completely.
insert image description here
  Then the author posted a result picture, mainly comparing the method proposed by the author with the existing method. The results show that the results are the best for both paired and unpaired multispectral inputs.
(a) Fully paired RGB-Thermal
(b) Incompletely paired RGB-Thermal (stereo camera)
(c) Incompletely paired RGB-Thermal (EO/IR configuration) The pictures taken by the stereo camera
here look like this   From a practical point of view, stereo cameras are used as an alternative, as shown in Figure 1-(b) and Figure 1-(c). Unlike the sensor system in Fig. 1-(a), this system allows a certain distance between the two sensors. However, two issues arise from this that affect the fusion method and detection performance.   The first problem is that there are non-overlapping regions in the image with information from only one sensor.   Another issue is pixel-level alignment issues due to parallax.
insert image description here

  From this point of view, the author solves most of the existing multispectral pedestrian detection methods, mainly in the case of completely overlapping multispectral images. In this paper, the term "paired images" is defined as fully overlapping image pairs, while "unpaired images" is defined as partially overlapping image pairs, including both overlapping and non-overlapping regions.
  Because unpaired datasets are difficult to obtain, the authors use fully overlapping multispectral datasets to deal with overlapping and non-overlapping regions in the images. (The vernacular is: use paired data sets to train, and then the obtained model can be applied to non-completely paired images)
   To this end, the author introduced some new methods and training strategies, called multi-label learning, to learn more discriminative features, and propose a semi-unpaired augmentation to randomly generate unpaired inputs. By applying the proposed method to an SSD-based baseline, we show significant improvement under both paired and unpaired conditions with fast inference time.
   The authors' contributions are as follows:
1) We address the constraints of previous fusion methods that hinder their use in real-world applications, and introduce a new perspective on multispectral pedestrian detection in pair-free conditions; 2) In ideal and
practical Under image conditions of , we propose a generalized multispectral pedestrian framework based on multi-label learning and a new augmentation strategy; 3)
We tested the proposed method on various unpaired situations, and it achieved comparable State-of-the-art methods compared to competition and better results.

I won’t write this part of RELA TED WORKS, just skip to the next part

METHODS

The authors propose a generalized multispectral pedestrian detection framework consisting of three new contributions,
shared multi-fusion layers ,
multi-label learning,
and semi-unpaired augmentation schemes
. In this section, the details of each contribution are explained:

network structure

First of all, the network structure designed by the author is based on the SSD network structure.

Here is a network diagram of an SSD first:
insert image description here

  • SSD first continuously performs feature extraction through convolution. In the network that needs to detect objects, it directly obtains an output through a 3*3 convolution. The number of convolution channels is determined by the number of anchors and the number of categories, specifically (number of anchors × (number of categories +4)).
  • SSD compares the YOLO series of target detection methods. The difference is that SSD obtains the final bounding box through convolution, while YOLO uses a fully connected form of the final output to obtain a one-dimensional vector, and disassembles the vector to obtain the final detection frame.

Below is a schematic diagram of the network in this paper.
insert image description here   The network structure proposed by the author is an SSD-like network consisting of two independent branches (RGB and Thermal). They use separate convolutional layers before the fifth layer of convolutional Conv5. They then share the remaining convolutions until the end.
  In the multimodal module, the features of each modality are concatenated ( concat, pooling of channel numbers), followed by additional convolutional layers to reduce the number of channels. The output signal is then sent to the detection head.
The framework of this paper is different from the SSD framework in the following three points:
  1) We use this architecture for multimodal fusion;
  2) We use multi-label learning for training;
  3) We use scoring function method for final prediction .
As shown in the figure above, the model consists of a single-modal part, a modality-sharing part, and a detection head. Generally speaking, the feature maps of each modality part are fused, and then these feature maps are input into the modality-sharing part. , the input features for generating the detection head are as follows:
insert image description here

where φF used is the fusion feature map. f spc R, f spc T, and f shr denote the modality-specific part and modality-shared part of a given RGB, thermal input image, respectively. IR and IT refer to images corresponding to RGB and thermal domains, and (⊕) indicates stitching.

We observe that 检测头的输入特征通常会失去模态特定信息. We argue that the modality-sharing part does not preserve per-modal information given the merged feature input. Therefore, we introduce a reparameterization technique for fusion layers. Instead of feeding concatenated feature maps to the shared part, we feed the feature maps of each modality tower (Gaussian pyramid? FPN) separately to the shared part, and merge them before feeding them into the detection head.
The input features of the detection head can preserve modality-specific information by adding several fusion layers. Reparameterization is performed as described in the following formula, which is the inverse of the above formula:
insert image description here

Where F in the formula is a quasi-fusion layer. Before feeding the fused features into the detection head, we design the fusion layer to be as light as possible for real-time application. As shown in the network structure diagram,融合层是基于一个带有激活函数的卷积层。

Multi-label input

Multi-label learning methods assign more detailed class labels to encourage the model to learn more discriminative features. In most previous fusion methods, completely overlapping image pairs are used as input images, so that all objects can be localized and seen in both RGB and thermal imaging modalities. However, these methods fail to detect objects when one of the input data has some problems, such as sensor failure, power outage or saturation.
To this end, we introduce a multi-label learning strategy in our multispectral pedestrian detection framework.

Let insert image description heredenote the RGB label vector and Thermal vector of a bounding box. After applying the semi-unpaired augmentation constraint, the label vectors yR, yT depend on the input state. More specifically, to assign multi-label vectors representing input pair states, three cases of label vectors are defined:
1) y1 = [1,0]
2) y2 = [0,1]
3) y3 = [1,1 ].
Basically, the label vector is assigned as y1 or y2 when the pedestrian is only visible in either mode, which happens when semi-unpaired augmentation is applied to either mode. Likewise, when a pedestrian is visible in both modes, it is labeled y3. Note that these label vectors are used as input states when training the model. With this strategy, the model can adaptively generate feature maps according to the states of the input pairs, so as to detect objects robustly in both paired and unpaired situations.

Semi-Unpaired Augmentation

Obtaining realistic unpaired datasets is a challenge. Therefore 文章中提出了一种简单而有效的方法来应对这种情况,即应用一种简单的数据增强策略. called semi-unpaired enhancement.
As mentioned earlier, the main goal of this paper is to detect the generality of the framework in both paired and unpaired situations. That is, the model can distinguish which form the pedestrian is affected by. To this end, we generate unpaired images from paired multispectral images. To prevent distortion of the enhanced image, 我们只使用几何变换such as horizontal flipping (在深度学习中经常应用图像旋转进行图像增强。图像旋转一般分为两种,第一种要保持图像大小,但是会丢失部分图像信息;第二种是根据旋转角度建立大小变化的新的图像,这可以保持图像信息的完整性。)and random resizing cropping. More specifically, horizontal flipping is applied to each modality independently with a probability of 0.5.

Similarly, the probability of randomly resizing crops is 0.5. In other words, the augmentation technique breaks the pair with a probability of 0.75. Note that we apply the technique independently to the two modalities, so all boxes augmented by geometric transformations are used as the ground truth for the multi-label defined earlier.

Optimization

As mentioned earlier, φi is the fused feature map of the feed detection head. The detection head takes as input the fused features of multiple different resolution maps to detect pedestrians of different sizes. The concatenated feature map (φ*) is defined as follows:
insert image description hereThen we define ˆyR and ˆyT, respectively pointing to the confidence score vector corresponding to the same bounding box, as follows:
insert image description herewhere f cls and σ are the classification layer and sigmoid function, respectively. The prediction score is computed by taking the mean RGB and Thermal confidence scores corresponding to the same bounding box. For multi-label classification, our network is optimized by minimizing the binary cross-entropy (BCE) loss function end-to-end. The specific formula is as follows:
insert image description herethe function for category loss is:
insert image description here

Our localization loss function is the same as SSD. The final loss function is the sum of the first two loss terms:
insert image description here

where lamda is a weighting factor that balances the two loss terms. Lloc and Lcls denote the loss terms for localization and classification, respectively. For simplicity, we set λ to be 1 in our experiments.

Therefore, this does not affect the result.

experiment

Experimental setting
baseline: SSD in Pytorch
backbone: SSD使用VGG16进行特征提取。The latter SSD seems to use VGG19 for feature extraction.
Because most pedestrians can be represented by a vertical bounding box, we set the parameters of the anchor box to 1/1 and 1/2 of the aspect ratio.
insert image description hereWe 使用在ImageNet上预训练的VGG16进行批处理归一化, from Conv1 to Covn5, use the remaining convolution kernels from the normal distribution ( std=0.01) to initialize with the value extracted. The model is trained with stochastic gradient descent (SGD) with an initial learning rate, momentum decay, and weight decay of 0.0001, 0.9, and 0.0005, respectively. The mini-batch size is set to 6 and the input image is resized to 512 (H) x 640 (W). We provide additional hyperparameters in our implementation.

Datasets used in the experiments

KAIST Dataset : Multispectral Pedestrian Dataset: Consists of 95328 fully overlapping RGB-Thermal pairs image pairs in urban environments. The provided ground truth contains 103 128 pedestrian bounding boxes in 1182 instances.

In the experiments, we follow the standard guideline of train02, sampling 1 frame every 2 frames, and a total of 25076 frames are used for training. For evaluation, we also follow the standard evaluation criterion test20, sampling 1 frame every 20 frames, so all results are evaluated on 2252 frames, of which 1455 during the day and 797 at night. Note that we use pairwise annotations for training and processed annotations for evaluation. (What I understand is to use paired annotations for training, that is to say, use completely overlapping RGB-Thermal pairs images to train the data set, and then use the processed (that is, non-paired evaluation) when evaluating) ).

CVC-14 dataset :
The CVC-14 dataset is a multispectral pedestrian dataset captured using a stereo camera configuration.
The dataset consists of gray-thermal pairs of 7085 frames and 1433 frames for training and testing, with separate annotations in each modality. Unlike the KAIST dataset where two sensors are mechanically aligned, this dataset initially provides multispectral image pairs with non-overlapping regions and overlapping regions containing some misalignment issues. However, the authors of the dataset publish cropped image pairs that do not contain non-overlapping regions. Therefore, we treat this dataset as a fully overlapping (pairwise) dataset, but it still suffers from pixel-level misalignment. Besides, there are some other problems, such as inaccurate ground truth boxes, incorrect extrinsic parameters, and out-of-sync capture system. Nonetheless, this dataset has been used by many people in many papers, as it is one of the few practical datasets captured in stereo cameras.

Synthetic Datasets of Unpaired Images :
We introduce real synthetic datasets to demonstrate robustness to unpaired inputs containing both overlapping and non-overlapping regions. As shown in the image below, non-overlapping regions are defined as locations where only a single modality is visible. This area is natural and varies according to the relative position of each RGB and thermal sensor.
insert image description here
Among them, we define the most common case of non-overlapping regions, as shown in Fig. 3. Given a KAIST multispectral image pair, we generate four mispairing cases:
(a) RGB blackout;
(b) thermal image blackout;
(c) side blackout;
(d) surrounding blackout.
The power outage here refers to: missing image information.
The first two cases (a) and (b) represent the situation of sensor failure, called sensor failure, where one of the sensors does not work at all. For example, RGB sensors have poor visibility at night, and thermal sensors sometimes experience crossover. To generate this, we randomly pad all zero values ​​in RGB or thermal images. We simulate the stereo camera setup and EO/IR configuration in (c) and (d), respectively.

The (c) case can be generated by vertically splitting the original image into three equal-sized segments. Finally, to generate case (d), we select a fully aligned RGB and thermal image, crop it to a smaller size, and insert the cropped image into the original copy. The cropping range is 96 pixels top and bottom, and 120 pixels left and right.

We believe that this synthetic image is useful for validating the robustness of the multispectral fusion model under unpaired conditions. This synthetic image has small differences relative to the real-world unpaired image, because we carefully based on the real-world sensor configuration. All parameters are selected to generate synthetic cases.

Evaluation metrics : The standard logarithmic mean miss rate (LAMR) sampling the FPPI of each image in the range [0.01, 1] is used as a representative score, which is the most commonly used metric in pedestrian detection tasks. This metric only focuses on high-precision areas, not low-precision areas, so it is more suitable for commercial solutions.

Evaluation of KAIST dataset and CVC-14 dataset

Since the method proposed in this paper aims to improve the generality of pedestrian detection in both paired and unpaired cases, it is important to demonstrate the superiority in the paired case. The results are posted in the table below

Evaluation of Unpaired Datasets

We demonstrate the robustness and generality of the proposed method on synthetic datasets. This experiment is meaningful, 因为之前的大多数融合方法都不能处理两个输入图像都不配对的情况(包括重叠区域和非重叠区域).

Ablation experiment

While the proposed method shows significant improvements, we would like to further understand the role of each component and how their combination works. We performed a series of ablation experiments and the results are shown in Table V.
The baseline network is an SSD-like half-fusion model with a false negative rate of 11.77%. After using only semi-unpaired augmentation, the performance improves to 9.51%. After using the multi-fusion method, this figure is further reached to 8.49% due to the ability to preserve the modality information to the last layer. Finally, a multi-label learning strategy is adopted, which greatly improves the performance. From this, we conclude that the proposed method can encourage the model to learn more generalized and discriminative features to detect pedestrians.
insert image description here
Attached are a few experimental results:

1. Experimental results on KAIST dataset 2.
insert image description here
Experimental results on CVC-14 dataset
insert image description here
3. Experimental results on KAIST data about sensor failure
insert image description here

4. Experimental results on KAIST data when the two cameras are not paired

insert image description here

Supplementary knowledge points

1. The evaluation indicator Miss Rate in pedestrian detection

  1. TP (True Positive) is predicted as a positive sample and the prediction result is correct. The closer the index is to the number of annotations of pedestrians in the validation set, the higher the detection rate of the detector.
  2. FP (False Positive) is predicted as a positive sample but the prediction result is wrong. This indicator reflects the false detection rate, and the lower the false detection rate, the better.
  3. FN (False Negative) The prediction result is a negative sample but the prediction result is wrong, that is, the sample that should have been detected has not been detected. This index reflects the missed detection rate. The smaller the index, the better.
  4. Precious describes the proportion of TP in the detection results. Its calculation method is Precious = TP / (TP+FP). The larger the index, the higher the detection accuracy.
  5. Recall describes the detection rate of marked pedestrians, which is calculated as Recall = TP / (TP+FN) = TP / GT
  6. FPPI (False Positive Per Image) describes the average false positive rate of each image. Suppose there are N pictures, and the number of false detections in the result is FP, then
    FPPI = FP / N
  7. MR (Miss Rate) describes the index of the missed detection rate in the detection results, that is, MR = FN / GT = 1 - Recall, the
    smaller the index, the better. MR-FPPI is similar to the Precious-Recall used in target detection. They are two mutually exclusive indicators. The improvement of one performance will inevitably lead to the decline of the other performance, which can reflect the overall performance of the detector. 由于在行人检测中每幅图像的FPPI上限与行人的密度有关系,所以在行人检测领域采用MR-FPPI曲线比Precious-Recall曲线更加合理.

Introduction to SSD:

SSD is a single-stage detection method, which combines the anchor mechanism of RCNN and the regression idea of ​​YOLO.
SSD introduces a multi-scale detection method, which is detected on the feature map extracted at each scale.
At the beginning, the single-stage detection method lagged behind the two-stage detection method in accuracy, because the category imbalance in the training process caused the single-stage method to have a disadvantage in accuracy, so Focal Loss was later proposed instead. The traditional cross-entropy improves the weight of the background samples, making the model more biased towards the target samples that are difficult to detect during the training process.

What is the difference between One-stage target detection and Two-stage target detection?

Two-stage目标检测算法: First perform region generation (region proposal, RP) (a pre-selection box that may contain the object to be inspected), and then classify the samples through the convolutional neural network. Its precision is higher, but its speed is slower.

Main logic: feature extraction --> generate RP --> classification/positioning regression.

Common Two-stage target detection algorithms include: Faster R-CNN series and R-FCN.

One-stage目标检测算法: Without RP, directly extract features from the network to predict object classification and location. Its speed is faster, and its accuracy is slightly lower than that of the Two-stage algorithm.

Main logic: feature extraction -> classification/positioning regression.

Common one-stage target detection algorithms include: YOLO series, SSD and RetinaNet, etc.

For small target problems

Small target detection method based on multi-scale:
SSD: It is the first to introduce multi-scale thinking, and predicts on the feature map extracted at each scale, and the detection of small targets has a better improvement compared with the YOLO algorithm.

Of course, in the follow-up, some people combined the idea of ​​FPN with SSD, so as to improve the detection effect of SSD algorithm on small targets.

Guess you like

Origin blog.csdn.net/toCVer/article/details/125969008