Joint Attentive Spatial-Temporal Feature Aggregation for Video-Based Person Re-Identification

Summary

This paper proposes a joint attention spatiotemporal feature aggregation network (JAFN) based on video-based face recognition , and simultaneously learns the quality perception model and the frame perception model to obtain attention-based spatiotemporal feature aggregation.
Specifically:

Use CNN to learn spatial features, while introducing LSTM to learn temporal features separately . For feature aggregation, we have introduced two attention mechanisms to generate quality and frame saliency scores . The quality score measures the image quality used to focus on spatial feature aggregation , and the frame score measures the attention to temporal features. The saliency of the image frame .
The pooling method is used to concentrate the spatial characteristics of quality perception and the temporal characteristics of frame perception . Introduce residual learning between LSTM and CNN for adaptive spatio-temporal feature fusion.
We use data balance to alleviate the data imbalance problem in the video-based Re-ID data set .

Introduction

Recently, more and more researches have begun to focus on Video-based ReID. Some research methods are directly fusing features through maximum pooling or average pooling, but some images are not clear enough so it is difficult to get effective results. In order to solve this problem, some studies have begun to focus on how to choose the most discriminative frames of the person. The work of [10] uses the quality-aware network model and pays more attention to high-quality images.

However, these methods only consider the spatial characteristics, and the spatial characteristics are susceptible to changes in the camera's viewing angle . The works of [11] and [12] use deep recursive network RNN to re-identify people based on video and extract temporal features. The temporal features are also accumulated on average from the features of each frame, ignoring the different salience of the frames that help temporal feature learning .

The work proposed in [14] uses the attention model to pay attention to more important regions and frames to make the features learned by RNN more effective. However, it is known that RNN cannot fully integrate all period information of all sequence frames, and its output can easily lose some important information of early human image frames . The time feature lacks sufficient appearance information, which limits performance. How to bring together the spatial and temporal characteristics is still a promising and unsolved problem.

(The thought of the article is here!!! It is almost the same as the abstract)

To solve the above problems, we propose a joint attention spatiotemporal feature aggregation network (JAFN) for video-based person re-identification.
JAFN combines spatial and temporal features to obtain more discriminative features, thereby improving the performance of video-based Re-ID.
Insert picture description here

As shown in Figure 1, we propose to learn a quality and frame perception model to obtain attention-based spatiotemporal feature aggregation. Specifically, we use CNN to learn spatial features, and introduce LSTM to learn temporal features separately. In feature aggregation, we introduce two attention mechanisms to generate quality scores and frame scores respectively. The quality score measures the image quality focused on spatial feature aggregation, and the frame score measures the saliency of image frames that contribute to temporal features. .
On this basis, the pooling method is used to concentrate the spatial characteristics of quality perception and the temporal characteristics of frame perception. For the adaptive feature fusion between two features, we introduce residual learning between LSTM and CNN to better improve performance. Elements are added between the extracted temporal features and the reference spatial features to obtain more discriminative fusion features.
We also propose data balance to alleviate the data imbalance problem in the video-based reid data set.

The contribution of the work is summarized as follows:
(1) A joint aggregation mechanism of attention features is proposed, which combines spatial and temporal features for video-based person re-identification;
(2) A residual learning mechanism is proposed to Automatically learn more discriminative spatio-temporal feature fusion;
(3) A comprehensive comparison and discussion of different representative data sets are carried out, and the effectiveness and generalization of the method are analyzed.

Related work

He et al. [31] proposed a residual learning framework , which relied on elements between deeper and higher-level input references, and through learning residuals, the deeper indirect and better fit the ideal optimal mapping . It is easier to understand the residuals than to directly understand the output of the desired layer [31]. The performance improvement of this classification framework proves the effectiveness of residual learning . Using residual learning to improve the existing deep pedestrian re-recognition architecture is potentially helpful for realizing adaptive spatio-temporal feature fusion.

method

Insert picture description here
JAFN has two branches, which are used to generate scores and features. The score generation branch is to generate quality scores and frame scores to make the system focus on more meaningful features. The feature generation branches generate spatial and temporal features respectively. Therefore, JAFN mainly includes three parts: spatial feature aggregation quality perception attention, temporal feature aggregation frame perception attention and spatio-temporal fusion residual learning.

In addition, we have adopted data balance to further improve JAFN.

1. QUALITY-AWARE ATTENTION

As shown in the figure above, the image sequence is passed through two fully convolutional networks (FCN1 and FCN2) to generate quality scores and feature representations, respectively . The design of the quality-aware attention module is inspired by [10], and the purpose is to measure the usefulness of the input image for spatial feature aggregation . Intuitively, because high-quality images are easier to recognize, and low-quality images are usually less helpful to collective representation. Therefore, if the image has high definition and less clutter, theoretically the quality score will be higher, which can be given More attention is paid to the characteristics of the image.
Insert picture description here The input image vector s enters the fully connected layer FCN1 to obtain a 3-dimensional score vector, and then passes through the sigmod function and normalization to obtain the quality score.
The specific parameters of the FCN1 layer are as follows:

during the period, the spatial features obtained by the FC2 layer are aggregated to form a spatial feature finally.
The formula is as follows:
Insert picture description here

2.FRAME-AWARE ATTENTION

Spatial features are often challenged by changes in viewpoints. In this section, we recommend focusing on more reliable temporal features to help video-based Re-ID.
We introduce recurrent neural network (LSTM) to learn the temporal characteristics of image sequences separately. At the same time, a certain sequence of frames also has different meanings for temporal feature extraction.
As shown in Figure 1, because the temporal features mainly contain periodic information such as gait , in theory, images without leg or hand clutter can provide more stable time information, so these images need to be given more attention. Under the stimulation of these observations, we proposed a framework awareness attention module to obtain the temporal characteristics of attention.
The principle diagram of frame awareness attention is also shown in Figure 2. The LSTM in JAFN receives the feature vector output by CNN , which is used for feature accumulation of the previous image in the video sequence.
The input of LSTM is the feature vector obtained after CNN. LSTM learns the long-term dependence relationship in the person sequence and memorizes information for a long time. It can be expressed by the following formula:
Insert picture description here

3.RESIDUAL LEARNING MECHANISM

Residual learning mechanism:
For JAFN, use twin network, ternary loss, softmax loss to optimize, make full use of label information, push positive samples together, and negative samples together.
For the twin network and the ternary loss, the image is divided into pairs, and then the sample is told whether it is positive or negative. In our example, the positive sample contains three sequences, namely "anchor", "positive", "negative", "anchor" and "positive" come from the same person under different cameras, and "negative" comes from different random cameras people.
The formula is as follows:

Insert picture description here

4.DATA BALANCE

In order to further improve the performance of the JAFN model, we propose to perform data balancing to alleviate the problem of data imbalance between identities.
In the person Re-ID task, most images will be concentrated in a few categories, while a few images belong to other data sets. This creates difficulties for learning algorithms because they will be biased towards most groups. In order to alleviate this inconsistency, we propose to balance its identity distribution based on the original image.
We amplify the original data set based on ourselves to make the data distribution balanced, that is, each identity contains the same image of a person. Specifically, for a data set D containing N identities, where person i contains pi images, we find the largest number p to set the target expansion number, and then copy the original image to make up for the insufficient sequence of certain pedestrians.

result