读《GaitEdge: Beyond Plain End-to-end Gait Recognition for Better Practicality》

Summary

Problem 1:
Previous methods focus on contours, but some end-to-end methods that directly examine RGB perform better
Problem 2:
End-to-end methods are noisy, i.e. low-level texture and color information

This article focuses on cross-domain evaluation

introduction

The old method takes two steps, 1. Extract intermediate modalities from RGB, such as contour masks or skeleton key points, 2. Put them into the downstream gait recognition network. Insufficient in terms of efficiency and
effectiveness

The middle mode of GaitEdge is a new synthetic silhouette, while its edges are composed of a trainable floating mask, and the other regions are classic binary silhouettes.
The noise of RGB information is mainly distributed in non-edge regions, such as between the human body and the background. Therefore, treating these regions as binary silhouettes can effectively prevent the leakage of gait-uncorrelated noise.
Marginal regions play a crucial role in describing the shape of the human body.
Size normalized alignment [9] is necessary for silhouette preprocessing. The aspect ratio of the body must be preserved in silhouette preprocessing. But this operation is offline, so it is not differentiable and cannot be directly applied to the alignment of synthetic silhouettes. To solve this problem, inspired by RoIAlign [6], the GaitAlign module is proposed to complete the framework of GaitEdge. It can be viewed as a fine-tunable version of the alignment method proposed by [9].

The main contributions of this paper:

  1. Pointing out the problem of gait-independent noise mixing into the final gait representation, a cross-domain test is introduced to verify the leakage of RGB noise. Furthermore, due to the lack of gait datasets providing RGB videos, this paper collects Ten Thousand Gaits (TTG-200), which is roughly equal in size to the popular CASIA-B [26].
  2. Propose GaitEdge, an end-to-end gait recognition framework. Experiments with CASIA-B and TTG-200 prove that irrelevant RGB information noise can be effectively prevented
  3. A module called GaitAlign is proposed for silhouette-based end-to-end gait recognition, which can be considered as a differentiable version of size normalization [9].

related work

2.1 Gait recognition

Gait [23] is defined as a person's reproducible and characteristic walking pattern. Another similar task, person re-identification [31], aims to find the same person in another place through another camera.
Despite their similarities, they are still fundamentally different: the first task focuses on walking patterns , while the second mainly uses clothing for recognition. Therefore, the gait recognition network cannot be made to acquire information other than gait patterns, such as texture and color based on RGB information.(ReID信息干扰gait,所以从定义上到实现上都要区别开这俩)

mainstream method

  1. Model-based methods [15, 2, 14, 22] usually first extract the underlying structure of the human body, such as 2D or 3D skeleton key points, and then model the walking patterns of the person. In general, these methods better mitigate the effects of clothing and more accurately describe the pose of the body. Nevertheless, all these methods are difficult to model human structure in real surveillance scenarios due to low video quality.(类似于指静脉细节点的不确定性吧)
  2. Appearance-based gait recognition methods [24, 28, 4, 29, 5, 7, 16]. GaitSet [4] takes a sequence of silhouettes as input and achieves great progress. Subsequently, Fan et al. [5] proposed a focal convolutional layer to learn part-level features and utilized a micro-motion capture module to simulate short-range temporal patterns. In addition, Lin et al. [16] proposed a 3D CNN-based global and local feature extractor module to extract discriminative global and local representations from frames, which significantly outperforms other methods.

2.2 End-to-end

The end-to-end concept [30, 14, 20] is applied to gait recognition. Zhang et al. [30] proposed an autoencoder to separate appearance and gait information without explicit supervision of appearance and gait labels (自监督啊). Li et al. [14,13] used a newly developed 3D human mesh model [17] as an intermediate mode to make the silhouettes generated by the neural 3D mesh renderer [12] consistent with those segmented from RGB images. Since 3D mesh models provide more useful information than silhouettes, this method achieves state-of-the-art results (也是一种GAN的实现思路啊). However, using a 3D mesh model requires higher resolution of the input RGB images, which is not feasible in practical surveillance scenarios (依然是对初始像素清晰度有依赖,和细节点殊途同归). Different from the former two, Song et al. [20] proposed another type of end-to-end gait recognition framework. It is formed by directly connecting pedestrian segmentation and gait recognition networks, which are supervised by a joint loss, namely segmentation loss and recognition loss (真正意义上的端到端). This approach seems more applicable, but it can also cause gait-independent noise to infiltrate the recognition network since there is no clear limit (莫须有). GaitEdge in this paper mainly proposes and solves two key problems: cross-domain evaluation and silhouette misalignment .

3 Interdisciplinary

Existing end-to-end methods [30, 20, 14, 13] have greatly improved the accuracy, and the introduction of RGB information is suspected to be the reason for the improvement. To test this conjecture, two gait recognition paradigms are introduced and compared experimentally.
First, one of the best performing two-step gait recognition methods, namely GaitGL [16], is used as the baseline. Furthermore, a simple and straightforward end-to-end model named GaitGL-E2E is introduced to provide a fair comparison. As shown in Fig. 2(a) and (b), both methods use the same modules except that GaitGL-E2E replaces binary masks with float-encoded silhouettes via a trainable segmentation network, namely U-Net [18]. . Experiments define single-domain evaluation as training and testing on CASIA-B 5 [26]. Correspondingly, cross-domain evaluation is defined as training on another dataset, namely TTG-200, but testing the trained model on CASIA-B. More implementation details will be elaborated in Section 5. As shown in the single-domain part of Figure 2(d), GaitGL-E2E easily outperforms GaitGL because it has more trainable parameters and more information is contained in floating-point masks than binary masks. However, it is unavoidable that the floating-point numbers flowing into the recognition network bring the texture and color of the RGB image, which makes the recognition network learn gait-independent information, resulting in a decrease in cross-domain performance. On the other hand, the cross-domain part of Fig. 2(d) shows that GaitGL-E2E does not achieve the same advantage as single-domain, and in the most challenging case, namely CL (changing clothes), is even much lower than GaitGL. This phenomenon suggests that end-to-end models are more likely to learn easily recognizable coarse-grained RGB information rather than fine-grained imperceptible gait patterns. The above two experiments show that GaitGL-E2E does absorb RGB noise, so it is no longer reliable for gait recognition with practical cross-domain requirements. GaitEdge thus consists of a gait synthesis module and a differentiable GaitAlign module, as shown in Fig. 2(c). The most important difference between GaitEdge and GaitGL-E2E is that the transmission of RGB information is controlled by manual silhouette synthesis.
insert image description here

4 methods

4.1 Gait Synthesis

Edges (outlines of silhouettes) contain the most discriminative information in silhouette images [25]. The interior of the silhouette can be regarded as low-frequency content with less information, and if the interior is removed, the information will be too compact to train the recognition network. Therefore, the designed module, named "Gait Synthesis", mainly combines trainable edges with fixed interiors through mask operations. It only trains the edge part of the silhouette image, and the regions outside the edge are extracted by the frozen segmentation network.
insert image description here
M s = M e × P + M i M_s=M_e \times P + M_iMs=Me×P+Mishield the noise

preprocessing

A non-trainable preprocessing operation is designed to obtain Me and Mi. First, the input RGB image is segmented with the trained segmentation model to obtain the silhouette M. Then, in the second step, dilated and eroded silhouettes (Mi) with 3×3 planar structural elements are obtained using classical morphological algorithms. Finally, Me is obtained by element exclusivity or ⊻.
M i = erode ( M ) M_i=erode(M)Mi=erode(M)
M e = M i ⊻ d i l a t e ( M ) M_e=M_i ⊻ dilate(M) Me=MiDi l a te ( M )
preserves the most valuable silhouette features by limiting the adjustable area, while removing most of the low-level RGB information noise . It is worth mentioning that due to the simplicity of the design, the Gait Syn paper can be detachably integrated into previous silhouette-based end-to-end methods.

4.2 Gait Alignment

Alignment is critical for all silhouette-based gait recognition methods. Since the size normalization of silhouettes was first used in the OU ISIR gait database [9], almost all silhouette-based methods have preprocessed the silhouette input by size normalization, which removes noise and is good for recognition. However, the previous end-to-end method, namely GaitNet [20], which feeds the segmented silhouette directly into the recognition network is difficult to handle the above cases. Therefore, a differentiable gait alignment module named GaitAlign is proposed, which makes the body the center of the image and fills the whole image vertically. First review dimension normalization [9], since GaitAlign can be viewed as a differentiable version. In size normalization, by finding the top, bottom and horizontal center of the body, the body can be aspect-ratio scaled to the target height, and then zero-filled on the x-axis to achieve the target width.
The pseudocode in Algorithm 1 describes the procedure of GaitAlign. First of all, you need to pad half of the width zeros on the left and right sides, so as to ensure that the cropping operation will not exceed the boundary. According to the aspect ratio and target size, the precise values ​​of four regular sampling points are then calculated. Finally, RoIAlign [6] is applied to the positions given in the previous step. The result is a standard-sized, image-filling silhouette whose aspect ratio remains unchanged (see the output of GaitAlign in Figure 3). Another notable point is that the GaitAlign module is still separable, making end-to-end training feasible.
insert image description here

算法1 GaitAlign伪代码。
# s_in : 来自分割输出的剪影,(n,1,h,w)
# size:目标尺寸,(H,W)
# r:人体的长宽比,(n)
# s_out:对齐的剪影,(nx1xHxW)
# 沿着X轴填充,以便不超过边界
s_in = ZeroPad2d((w/2, w/2), 0, 0)) # (n,1,h,2w)
binary_mask = round(s_in) # 二进制轮廓
# 计算坐标并恢复长宽比 r
left, top, right, bottom = bbox(binary_mask, r, size)
# 通过可微调的roi_align得到新的剪影
s_out = roi_align(s_in, (left, top, right, bottom), size)
bbox: 在保持长宽比的情况下,获得边界框的四个固定位置。
roi align: 在不损失空间对齐的情况下,裁剪和调整感兴趣的区域的大小。

5 experiments

5.1 Dataset

An ideal gait dataset possesses several important properties: available RGB video, rich camera views, and diverse walking conditions.
insert image description here

data set. There are some datasets available for gait recognition, e.g., CASIA B [26], OUMVLP [21], Outdoor-Gait [20], FVG [30], GREW [32], etc. However, not all of them are useful for end-to-end based gait recognition methods. For example, the proposed work cannot apply two of the world's largest gait datasets, OUMVLP [21] and GREW [32], since neither of them provide RGB videos.

CASIA-B[26]

There were 124 subjects walking indoors. It is probably the most popular dataset, including 11 views ([0◦-180◦]) and three walking conditions, namely normal walking (NM#01-06), walking with bag (BG#01-02) , and change clothes and walk (CL#01-02). Strictly following previous studies, the first 74 subjects were classified into the training set, and the other subjects were classified into the test set. Furthermore, for the testing phase, the first 4 sequences (NM#01-04) were considered as gallery sets, while the remaining 6 sequences were grouped into 3 probe subsets, namely NM#05-06, BG#01- 02. CL#01-02. In addition, since the silhouette of CASIA-B is obtained by outdated background subtraction, there are many noises caused by the background and the clothes of the subjects. Therefore, the silhouette of CASIA-B was relabeled and named CASIA-B*. All experiments are performed on the basis of this new annotation.

TTG-200

This dataset contains 200 subjects walking in the wild, and each subject needs to walk in 6 different conditions, namely carrying, wearing, answering the phone and so on. During each walk, the subject will be filmed by 12 cameras (unmarked) located around different viewpoints, which means that each subject preferably has 6 × 12 = 72 gait sequences. The experiment takes the first 120 subjects for training, and the last 80 subjects for testing. Furthermore, the first sequence (#1) is considered as the gallery set, and the remaining 5 sequences (#2-6) are considered as the probe set.
Compared with CASIA-B, TTG-200 has three main differences: (1) the background of TTG-200 is more complex and diverse (collected in multiple different outdoor scenes); (2) the data of TTG-200 are mostly bird's eye view , while the data of CASIA-B are mostly horizontal images; (3) The image quality of TTG 200 is better. The two datasets can thus be considered as distinct domains.

accomplish

data preprocessing

ByteTrack [27] is first adopted to detect and track pedestrians from raw RGB videos of CASIA-B [26] and TTG-200, followed by body segmentation and silhouette alignment [9] to extract gait sequences. The obtained silhouettes are resized to 64×44 and can be used as input to these two-stage gait recognition methods, or as ground truth for pedestrian segmentation networks in these end-to-end methods.

pedestrian segmentation

The popular U-Net [18] is used as the segmentation network, which is supervised by the binary cross-entropy [10] loss Lseg. Set the input size to 128×128×3, and the channels of U-Net to be {3, 16, 32, 64, 128, 64, 32, 16, 1}, and train by SGD[19] (batch size = 960, momentum=0.9, initial learning rate=0.1, weight decay=5×10-4). For each dataset, the network is trained with a learning rate of 1/10 in every 10000 iterations until convergence.

Gait recognition

Use the latest GaitGL [16] as the recognition network, and strictly follow the settings of the original paper.

Joint Training Details

In this step, the sampler and batch size of the training data are similar to those of the gait recognition network. The segmentation and recognition networks are jointly fine-tuned with a joint loss Ljoint = λLseg + Lrec, where Lrec represents the loss of the recognition network. λ represents the loss weight of the segmentation network and is set to 10. In addition, in order to make the joint training process converge faster, the parameters of the trained segmentation network and recognition network are used to initialize the end-to-end model, and correspondingly, their initial learning rates are set to 10-5 and 10-4, respectively. Furthermore, the first half of the segmentation network, U-Net, is fixed to keep the segmentation result in human shape. Jointly train the end-to-end network for 20,000 iterations and reduce the learning rate by 1/10 at the 10,000th iteration.

5.2 Performance Comparison

single domain evaluation

The performance of traditional two-step gait recognition methods is much lower than that of two end-to-end methods.
The accuracy of GaitEdge is slightly lower than that of GaitGL-E2E,
but GaitGL-E2E has a higher risk of overfitting in gait-uncorrelated noise, because it directly uses the floating mask generated by the segmentation network work as the input of the recognition network. Therefore further cross-domain evaluations are carried out to support the experiments of this concept.

cross domain evaluation

If some irrelevant noise dominates the gait representation for human recognition, i.e., texture and color, then in the case of cross-domain settings, the recognition accuracy will drop sharply because the extracted features cannot represent relatively robust gait patterns. .
All these methods suffer from significant performance drop compared to single domain. Although GaitGL-E2E has the highest accuracy in a single domain, it achieves the worst performance across domains from CASIA-B*/ to TTG-200. In contrast, GaitEdge achieves better performance than any other published method in cross-domain evaluation, although it is about 2% lower than GaitGL-E2E in single domain. Therefore, this cross-domain evaluation not only shows that Gait Edge is far more robust than GaitGL-E2E, but also claims that GaitEdge is a practical and state-of-the-art framework for end-to-end gait recognition tasks.

Comparison with other end-to-end methods

GaitEdge is compared with three previous end-to-end gait recognition methods with different views on CASIA-B*. Table 3 shows that GaitEdge almost achieves the highest accuracy in various walking conditions, especially CL (+5.7% than MvModelGait), which shows that GaitEdge is obviously robust to color and texture (clothes changes).

5.3 Ablation studies

marginal impact

Edge extraction by structuring elements of several sizes - the larger the structuring element, the larger the edge area. According to the results shown in Table 4, as the structured element size increases, the single-domain performance increases correspondingly, but the cross-domain performance decreases almost simultaneously. This result declares that the area of ​​the floating mask occupying the intermediate synthetic contour is negatively correlated with the cross-domain performance of GaitEdge. So the reason why GaitGL E2E fails in the cross-domain evaluation is that it is equivalent to GaitEdge in the case of infinite structural elements. Furthermore, the contours of those non-marginal regions, i.e., human body and background, are not suitable for end-to-end gait recognition frameworks in floating encoding.

Impact of GaitAlign

The results of pedestrian detection in natural scenes (upstream task) tend to be much worse than those in controlled environments (i.e. CASIA-B*/ and TTG-200). To simulate this complex situation, the CASIA-B* video is first subjected to object detection, followed by a random pixel offset with a probability of 0.5, and simultaneous vertical and horizontal co-correlation. As shown in Figure 6(a), the bottom image is disturbed in order to simulate the natural situation. Figure 6(b) shows that the average accuracy after alignment is significantly improved. In addition, the accuracy of normal walking (NM) drops a bit, i.e. -0.38%. However, this is because the accuracy of NM is close to the upper limit.

5.4 Visualization

To better understand the performance degradation of GaitGL-E2E and the effectiveness of GaitEdge, the intermediate results produced by GaitGL E2E and GaitEdge respectively, and the ground truth corresponding to the same frame are visualized. Specifically, for GaitGL-E2E, the intermediate results in (a), (b), © and (d) capture more background and texture information, and some body parts are eliminated, such as (e) and (f ) in the legs. While for GaitEdge, the intermediate result is more stable and reasonable, making it more robust.
insert image description here

6 Conclusion

This paper proposes a novel end-to-end gait recognition framework, Gait Edge, which can address the performance degradation in cross-domain situations. Specifically, a gait synthesis module is designed to mask the fixed body with adjustable edges obtained through morphological operations. In addition, an adjustable module named GaitAlign is proposed to address the body position jitter caused by the upstream pedestrian detection task. Extensive and comprehensive experiments are also performed on two datasets, including CASIA-B* and the newly established TTG-200. Experimental results show that GaitEdge significantly outperforms previous methods, suggesting that GaitEdge is a more practical end-to-end paradigm that can effectively block RGB noise. Furthermore, this work exposes cross-cutting issues neglected by previous studies, which provides a new perspective for future research.

Guess you like

Origin blog.csdn.net/weixin_40459958/article/details/129666630