[Paper notes] A review of gait recognition based on deep learning "Gait Recognition Based on Deep Learning: A Survey"

"Gait Recognition Based on Deep Learning: A Survey"
is published from the journal "ACM COMPUTING SURVEYS", and the release time is January 2022.

Original link .

1 Introduction

As an important branch of biometrics, gait recognition focuses on recognition and detection by measuring the relationship between the human body and its structure, such as: the measurement of the size of the human body's torso and limbs, and the spatial-temporal relationship information related to the intrinsic form of personal movement.
Gait recognition methods are effective in surveillance systems or in ambiguous detection environments where it is difficult to detect using biologically unique elements such as fingerprints, faces, iris recognition, etc. In addition, compared with other biometric models, the gait recognition method also has the advantages that it is difficult for hackers to attack (the recognition system has high security ), and the gait information is easy to collect .
But identifying a human individual by walking and moving is not easy. Standard gait recognition methods (including data preprocessing, feature extraction, etc.) are often subject to some limitations and challenges due to the complexity of the environment and tasks. Thanks to the emergence of deep learning technology, it provides innovative ideas for further research on gait recognition.
There are three main purposes of writing this article:

  1. Systematically introduce the latest and most famous research results;
  2. Provides a substantive and illustrative theoretical background on gait recognition, explores its roots in biometrics, and reveals popular tools for gait feature extraction and architectures that address limitations;
  3. Lists a catalog of public datasets available for gait recognition tasks.

2. Theoretical background

2.1 Biometrics

Long before the advent of computers, the problem of human identification posed a huge challenge to humans. At that time, experts and scholars artificially constructed restrictive information (such as bank transactions) by analyzing and comparing documents, signatures and other features. With the further development of social informatization, the importance of improving recognition accuracy has been strengthened, and a large number of related technologies have also appeared in the literature.
A standard biometric sample needs to have the following characteristics:

  • Universality : Everyone should have this characteristic;
  • Particularity : the characteristics of any two persons are different;
  • Permanence : The feature is constant for a certain period of time;
  • Collectability : This characteristic can be measured quantitatively.

In addition, in an actual biometric system, there are some problems that cannot be ignored:

  • Performance : Factors such as data set resources, operation, and environment may affect recognition accuracy and recognition speed;
  • Acceptability : the extent to which people are willing to accept a certain biometric identifier for use in their daily lives;
  • Evasion : The ease with which a recognition system can be "fooled" by fraudulent methods.

Under the above constraints, fingerprint, iris, face and gait recognition technologies perform particularly well, and these biometric methods will be briefly reviewed below.

2.1.1 Fingerprint identification

The academic name of fingerprint identification technology is Dactyloscopy, which is widely used due to the singularity of the surface folds of fingers, which endows each person with unique intrinsic characteristics. In addition, the fingerprint characteristics are stable, and the degree of degradation is extremely low over time, making the establishment of digital fingerprint image databases extremely reliable.
The first fingerprinting models were devised in the late 1960s, based on a system created by Francis Galton in the 19th century, known as Galton points. Since then, many works have addressed the problem of fingerprint recognition from different perspectives, such as: digital image processing, generative adversarial networks, filter representations, etc.
Fingerprint identification systems are considered to be the most accurate and reliable biometric identification systems, but there are still some challenges in this field, such as unsatisfactory identification accuracy under non-ideal conditions, and security issues such as many spoofing attacks.

2.1.2 Iris recognition

The iris refers to a thin, colored, circular structure in the human eye that controls the diameter of the pupil and thus the amount of light entering the retina. This structure has great advantages in the field of human recognition, because it is basically unchanged as the environment changes and time passes. In addition, iris recognition is one of the most accurate, low-cost, and convenient identification methods because it is performed through images and does not require human contact.
Most commercial iris recognition models use an integral-differential operator to identify the upper and lower boundaries of the iris. This operator acts as a circular boundary detector assuming that the pupil is circular. Subsequent studies have introduced different mathematical Operations (such as: parabola identification, normalization, etc.), making it more flexible and robust.

2.1.3 Face recognition

Face recognition has been widely used in identification and authentication systems in diverse fields, such as banking, military services, and public security. Face recognition technology became popular after the Eigenfaces method was proposed by Turk and Pentland in the early 1990s. Over the next decade, several global recognition methods emerged from low-dimensional distributional representations such as linear sub-spaces, sparse representation, etc. However, the global recognition method has poor robustness to face positioning changes, so a local recognition method has been formed, including local binary patterns (Local Binary Patterns, LBP) and Gabor feature-based classification (the Gabor feature-based classification), the purpose is to train Filters for iconographic feature extraction such that differences between images of the same person are minimized.
A few years ago, deep learning models gained extra attention in biometrics, including face recognition, after the AlexNet neural network won the ImageNet competition. A special "family" of networks known as convolutional neural networks is capable of approaching human cognition.
Although face recognition technology has developed to an unprecedented level in the past few years, the field still faces the influence of factors such as expression, posture, illumination, aging, and facial occlusion. Tiong et al. used multi-modal biometric technology to solve these problems. challenge.

2.1.4 Multimodal biometrics

One of the most significant advantages of the human brain in person recognition over computer-based methods is its ability to evaluate multiple modalities of descriptive information (face, gait, hair and eye color, etc.). In order to fit the function of the human brain, a new biometric mode, namely "multimodal biometrics", is proposed. This approach aims to combine different biometric methods and ancillary information, thereby increasing the performance and reliability of a single technology.

2.1.5 Other methods

The features mentioned so far may not be well detected in some environments, either due to the inaccuracy of the model, or because of the lack of special equipment for correct recognition, resulting in some recognitions that are not very common method:

  • Ear-Based Biometrics
  • Identification based on smartphone app
  • Biometrics based on heartbeat frequency
  • Gesture Recognition Based on Internet Environment
  • Feature Recognition Based on Eye-Movement Pattern
  • nose-based biometrics
  • Biometrics based on vein configuration

2.2 Gait recognition by deep learning method

Although traditional machine learning and gait recognition strategies have achieved satisfactory results in the past few years, these methods are usually limited by the manual extraction of features and inherent patterns in the learning data, which limits the recognition ability. Based on this situation, the emergence of deep learning methods is a good solution, and it also shows excellent performance when dealing with the timing features of images or videos. The popular network architectures for gait recognition are briefly introduced below one by one.
fig1.2

2.2.1 Convolutional Neural Networks (CNN)

Since the convolutional neural network gained popularity in early 2010, it has become a key technology for solving image processing problems. The basic module of the convolutional neural network is composed of convolutional neurons. The size of the convolution kernel is usually 3 3 or 5 5. After the convolution operation, a new set of matrices is output, which is then used as a subsequent layer of the model. Therefore, a CNN can be understood as a stack of convolutional kernels that produce successive layer inputs.

2.2.2 Capsule Networks (CapsNet)

Although CNN works well in processing image features, it is easy to confuse complex spatial relationships. To illustrate with a popular example, if a trained CNN finds the dog's body, face, tail and other components, even if these parts are combined in a different order, CNN can easily identify that this is a dog. Dogs, while CapsNet can recognize these parts hierarchically. The model consists of two layers. The first is a convolutional encoder, which performs the recognition of low-level features; the second is a linear decoder representing a full connection, using the protocol Routing algorithm to process low-level features to the correct position in high-level. Therefore, CapsNet is more robust to object localization and better able to recognize multiple or overlapping objects in the scene.

2.2.3 Recurrent Neural Networks (RNN)

Much of the literature treats the gait recognition problem as a sequence of images of an individual's movement, which is computed in a recurrent fashion for each neuron, taking into account the information of the input data as well as the outputs of other neurons. This architecture can also be combined with CNN to extract more information about the input image for recognition reasoning.
Since describing a person's gait usually requires a considerable number of sequence features, a specific group of RNNs (i.e., gated RNNs) are more suitable for this task due to their ability to handle long sequences. In this context, we can refer to two main architectures, the Long Short Term Memory Model (LSTM) and the Gated Recurrent Unit (GRU).

  • The long-short-term memory model (Long-short Term Memory, LSTM)
    was first proposed in 1997 by Hochreiter and Schmidhuber to improve the results of long-sequence data. The working method of LSTM is similar to that of traditional RNN, that is, the output of a given neuron depends on the recursive information of previous neuron results. The main difference is that the structure of LSTM unit is more complicated. The specific construction is explained as follows: Forget Gate
    : Defined How much information should be retained. The previous and current state data are passed through a sigmoid function, the output value of this function is between 0 and 1, the closer to 1, the more information is retained.
    Input Gate : Computes a new value to update the current hidden state. The input gate mainly considers two values: a sigmoid function calculates the importance of the previous hidden state; the original value is passed to a tanh function, which is responsible for flattening this value between -1 and 1. The multiplication of these two values ​​defines the current hidden state.
    Output Gate : Defines the output value of the cell. The values ​​from the forget and input gates are summed and passed into a tanh function; the previous state of the unit is passed into a sigmoid function; the first two are multiplied to get the output of the unit.
  • Gated Recurrent Unity (Gated Recurrent Unity, GRU)
    Gated Recurrent Unit is a recurrent (recurrent) neural network, originally an idealized model to improve the results of neural machine translation. Like LSTMs, GRUs have internal gates that control the flow of information. The main difference is the number of gates available in each model. GRUs remove the input gates and only contain forget and output gates. Studies have shown that although GRUs use fewer gates than LSTMs, they can achieve similarly accurate results, with the advantage of reducing the computational burden and running faster.

2.2.4 Auto Encoders (AE)

An autoencoder is a generative neural network commonly used for data restoration and image denoising. The model consists of two important parts:
Encoder : responsible for encoding the input information into a usually small feature space.
Decoder : Performs unsupervised reconstruction of encoded data.
autoencoder

2.2.5 Deep Belief Networks (DBN)

Deep Belief Network is a stochastic neural network ideally designed for generating tasks. It is a hybrid generative model composed of Restricted Boltzmann Machine (RBM) and sigmoid belief network (SBN) .

2.2.6 Generative Adversarial Networks (GAN)

Generative adversarial networks have become popular in the past few years due to their remarkable ability to generate realistic synthetic images. The model consists of two different networks:
Generator : used to learn the distribution characteristics of the data and generate synthetic samples
Discriminator : used to identify whether a particular instance is real or generated by the generator.
The two compete in an adversarial manner, The generator tries to generate enough realistic samples to fool the discriminator, and the discriminator improves itself to recognize the generated samples.
HOWEVER

2.2.7 Summary

The following table summarizes the main tasks of the above methods in the field of gait recognition:

technology Task
Convolutional Neural Networks (CNNs) Extract features from images or video frames
Recurrent Neural Networks (RNNs) Including LSTM and GRU, through several gate units to control the flow of information, is used to deal with timing issues
Autoencoder (AE) Recognition is achieved by compressing and decompressing input features
Capsule Network (CapsNet) Improving the semantic features of CNN output
Deep Belief Network (DBN) Compress the encoded feature image
Generative Adversarial Networks (GANs) The training method is to repeatedly distinguish and distinguish the data from the model (such as: CNN) and the generated data

2.3 Gait recognition

So far, several biometric identification methods have been proposed, and although the reliability and security of these methods have been proven in banking and public governance systems, there are still two main obstacles:

  1. Information sources are passive
  2. Rely on professional equipment

Fortunately, models that process gait recognition information can effectively deal with the above obstacles, because the acquisition of biometric information in the vast majority of cases relies on a camera or non-invasive sensor without special functions, which can Passive collection without legal considerations (where legal).
In addition, the current gait recognition technology can be divided into two categories: template-based and non-template:

  • Template-based methods
    aim to obtain torso or leg motion, which can be segmented by canonical correlation analysis, joint sparse models, and group lasso motions. Commonly used are walking path images (Walking Path Image, WPI), gait information maps ( Gait Information Image, GII), Gait Energy Image (GEI), etc.
  • Non-template-based approach
    This approach considers shapes and their properties to be closely related, and thus recognizes them by measuring the shape.

3. Gait recognition based on deep learning

3.1 Convolutional Neural Network

Convolutional neural networks are based on the concept of neurons in the mammalian visual cortex. They were originally applied to digit classification tasks and have been widely used in classification, reconstruction and object detection at this stage.
Many experts and scholars have proposed gait recognition models based on convolutional neural networks, and this blog will not list them one by one (if necessary, please refer to the original text of the paper).
Furthermore, they compared three types of data arrangements:

  • Local Bottom : Combine between input data, and then directly judge whether the input data comes from the same person (or different people).
  • Mid-level Top : Before the neural network combines two input data, it extracts some features of them and then determines whether they are from the same person.
  • Global Top : Similar to the previous network, but it has an additional layer of convolutions and perceptrons so that the combination of features is done in the penultimate layer.

3.2 Capsule Network

Another well-known type of deep architecture for gait recognition is the capsule neural network. The network was developed for image classification by modeling the hierarchical relationship between objects (i.e. capsules) in a scene.

3.3~3.6 Recurrent Neural Networks, Autoencoders, Deep Belief Networks, Generative Adversarial Networks

Deep belief networks are stochastic neural networks that use restricted Boltzmann machines as building blocks. This model has become very popular due to its ability to multitask.

3.7 Summary

3.7.1 Solutions

In general, Convolutional Neural Networks are the most popular choice when considering deep learning solutions for gait recognition , especially regarding image/video based problems, as CNNs achieve excellent results in a variety of applications and It has achieved good performance results in accuracy tests in the past few years. It is worth mentioning that other architectures also provide valuable contributions to the field of gait recognition, performing better in some specific tasks.
For example, capsule networks can extract partial gait representations in a hierarchical manner, providing better results when there is overlap in the scene; recurrent neural networks can handle continuous data (such as videos) well.
While most deep learning based gait recognition methods include the image/video domain, other data sources (e.g. accelerometers, gyroscopes, sensor based, etc.) have also provided impressive results in quite a few papers , most of these papers involve unsupervised deep learning methods in the process, such as autoencoders and deep belief networks , which are often more expressive on these data types. These unsupervised methods can extract information about the distribution of data and process it, usually in a lower dimensional space, to provide more representative features for gait recognition.
Finally, Generative Adversarial Networks describe a special case where gait systems can learn a wider range of features (e.g. orientation, clothing, number of individuals in a scene, etc.) because they can generate synthetic data for training the model. Furthermore, GANs can also be used to evaluate deception in gait-based systems.

3.7.2 Representation method

Regarding the most commonly used method of representing gait image data, a gait energy image (GEI) reflects a sequence of simple energy image cycles using a weighted average method. Additionally, a sequence of marching cycles is processed to align binary silhouettes. Therefore, GEI maintains the static and dynamic features of human walking and greatly reduces the computational cost of image processing. Through an in-depth analysis of the method, we can observe several characteristics of the model:

  • GEI is sensitive to the silhouette noise of individual images.
  • It focuses on the concrete representation of human walking without softening the background of the vector image.
  • It represents human motion in a single image while preserving temporal information.

Likewise, cross-view-based gait recognition (cross-view-based) is another popular solution for handling different visual angles. This type of input requires multiple cameras and different environments and is therefore limited to real-world scenarios. Furthermore, it visually normalizes the gait features before any combination, which enables the model to learn the relationship between visual motions in the scene.

The main challenge of gait recognition based on deep learning can refer to the complexity of gait data, which stems from the interaction of many factors (such as: line of sight occlusion, camera viewpoint, individual appearance, sequence order, body movement of parts or light sources present in the data, etc.).
At present, there are still many fields related to gait recognition (such as: face recognition, emotion and pose estimation). separate explanatory factors. However, most gait recognition methods using deep learning have not explored this approach, making it difficult to unambiguously separate the underlying structure of gait data in the form of important irrelevant variables. Despite recent progress in using cluttered background methods in some gait recognition methods, there is still room for improvement and advancement.

The following table shows the papers, publication years, basic models used, and evaluation indicators of all methods introduced in this section:

Ref. Year Model Input Type Dataset Measure Result
GEINET: View-invariant gait recognition using a convolutional neural network 2016 CNN GEI OU-DONE Identification rate 94.6%
A comprehensive study on cross-view gait based human identification with deep CNNs 2017 CNN Cross-view CASIA-B Accuracy 90.8%
DeepGait: A learning deep convolutional representation for view-invariant gait recognition using joint Bayesian 2017 CNN + Joint Bayesian Sensors OU-DONE Identification rate 97.6%
Gait recognition based on convolutional neural networks 2017 CNN Optical flow TUM-GAID and CASIA-B Accuracy 97.52%
On input/output architectures for convolutional neural network-based cross-view gait recognition 2017 CNN + Siamese networks Cross-view OU-DONE Accuracy 98.8%
Pose-based deep gait recognition 2017 CNN + Nearest Neighbor Optical flow TUM-GAID, CASIA-B, and OU-ISIR Identification rate 99.8%
Invariant feature extraction for gait recognition using only one uniform model 2017 Autoencoders + PCA GEI CASIA-B and SZU RGB-D Identification rate 97.58%
Deep learning based gait recognition using smartphones in the wild 2018 CNN + LSTM Accelerometer and Gyroscope WhuGAIT and OU-ISIR Accuracy 99.75%
Multi-task GANs for view-specific feature learning in gait recognition 2018 HOWEVER PEI OU-ISIR, CASIA B and USF Accuracy 94.7%
Artificial neural networks classification of patients with Parkinsonism based on gait 2018 DBN Sensors Private datasets Accuracy 93%
Gait recognition based on capsule network 2019 Capsule LBC and MMF OU-DONE Accuracy 74.4%
Nonstandard periodic gait energy image for gait recognition and data augmentation 2019 CNN GEI + Data Augmentation CASIA-B Accuracy 98%
Gait recognition via disentangled representation learning 2019 Autoencoders + LSTM Cross-view and Frontal-View Gait (FVG) CASIA-B, USF and FVG Accuracy 99.1%
Person identification from partial gait cycle using fully convolutional neural networks 2019 Autoencoders + PCA GEI OU-ISIR and CASIA-B Accuracy 96.15%
Deep learning based gait abnormality detection using wearable sensor system 2019 LSTM Sensors Private datasets Prediction error 0.02
Attacking gait recognition systems via silhouette guided GANs 2019 HOWEVER GEI CASIA-A and CASIA-B Recognition result 82%
Human gait recognition based on frame-by-frame gait energy images and convolutional long short-term memory 2020 LSTM GEI CASIA-B and OU-ISIR Recognition result 99.1%
Cross-view gait recognition using pairwise spatial transformer networks 2020 CNN Cross-view OU-MVLP, OU-LP, and CASIA-B Identification rate 98.93%
Robust cross-view gait recognition with evidence: A discriminant gait GAN (DiGGAN) approach 2020 GAN Cross-view OU-MVLP and CASIA-B Identification rate 93.2%
Gait recognition using multiscale partial representation transformation with capsules 2020 Capsule Multi-scale representations CASIA-B and OU-MVLP Identification rate 84.5%
Continuous human gait tracking using sEMG signals 2020 DBN Sensors Private datasets RMSE 2.61
Multi-model long short-term memory network for gait recognition using window-based data segment 2021 LSTM IMU whuGAIT and OU-ISIR Accuracy 94.15%
Associated spatio-temporal capsule network for gait recognition 2021 Capsule Sensors Several Accuracy 99.69%

原文中的这个表格出现了一些错误,在本篇博客中都予以更正了。
(x, x)为表格行列数

位置 更正前 更正后
(3, 5) (5, 5) (7, 5) (8, 5) (10, 5) (13, 5)
(14, 5) (15, 5) (17, 5) (18, 5) (19, 5) (20, 5)
CASIA B CASIA-B
(17, 5) CASIA A CASIA-A
(14, 3) Autoencoders + LSTN Autoencoders + LSTM
(14, 4) Cross- and Frontal-view Cross-view and Frontal-View Gait (FVG)

因为作者在(21, 5)处使用了“CSAIA-B”带连字符的书写形式,而且这种书写形式(好像)更为普遍,所以统一改为带连字符的形式。

4. 数据集

机器学习模型的训练和评估步骤,无论采用何种模式(监督、无监督或任何其他模式),都取决于包括任务主题的数据集。此外,使用某一特定数据集可以判别方法对解决特定问题的有效性,且便于进行比较。
关于步态识别,考虑到公共或私人隐私等问题,在获取和创建数据集方面有两个突出的问题:

  1. 步态生物测量需要对一个对象进行合理数量的运动记录,这意味着为每个人都要被记录和生成多个视频,这些视频通常具有固有的高维度(导致数据集的增大),因此需要高存储容量。
  2. 生物识别数据的提取和公开发布需要得到每个参与者的许可。

4.1 CMU MoBo

CMU MoBo数据数包含的数据量相对较少,是从20个人身上提取的几个用于步态识别的视频。该数据集还提供了一组剪影掩码和边界框,从而减轻了分割过程的难度。
CMU MoBo数据集可以免费下载,无需保留或签署协议书,只需要连接Calgary University的文件传输协议(FTP)服务器即可。

4.2 TUM GAID

TUM GAID(Gait from Audio, Image, and Depth)数据集,由RGB图像、音频和深度复合而成,最初由305人录制,有三种不同的变化。后来,其中的32人被重新记录,数据集总共包括3370条记录。
TUM GAID数据集的一个强大优势是有一个明确的评估协议,该数据集需要签署请求文件才可以获得。
下图展示的是该数据集网站中的步态识别示例。
fig7

4.3 HID-UMD

HID(Human Identification at a Distance)-UMD数据集包含从四个不同角度拍摄的人走路的视频和各自用于前景分割的二进制掩码,其主要目的是帮助研究人员开发新的步态和面部生物统计学识别方法。此外,该数据集是由两个数据集组成的集合:

  • 数据集1:25个人在四个不同姿势下的行走。
    正面视图 / 向前走。
    正面视图 / 向后走。
    正面平行视图 / 向左走;
    正面平行视图 / 向右走。

  • 数据集2:55个人走过一个T形通道。
    这些序列是由两台摄像机获得的,且两台摄像机的拍摄视线是相互正交的。

关于此数据集的更多细节可参见HID-UMD Dataset官网,申请数据集可以在官网申请地址进行申请。

4.4 CASIA

中国科学院自动化研究所提供了CASIA步态数据库,这个数据库由四个数据集组成,主要应用于步态识别,具体描述如下:

  • CASIA-A: 创建于2001年12月,以前被称为NLPR步态数据库,包括20个个体,每个个体包括12个视频(即与图像平面平行、45和90度的三个方向各4个视频)。每个图像序列都有一定的持续时间,随着人行走速度的变化而变化。该数据集的总大小约为2.2 GB。下图展示了CASIA-A数据集中不同视角的一些示例。
    CASSIA-A
  • CASIA-B: 创建于2005年,包括从11个不同角度拍摄124个人。每个序列都重复了三次(衣着,行走速度等变化)。此外,该数据集还包括一组为所有序列提供的剪影,可用于前景分割。
    下图分别展示的是不同角度和不同衣着条件下的示例图。
    CASIA-B
  • CASIA-C: 创建于2005年,包含了153个由红外摄像机(热光谱)拍摄的对象,所有图像都是在夜间进行拍摄的,共有四种不同的变化:正常行走、缓慢行走、快速行走和背着背包正常行走。下图是CASIA-C的数据示例。
    CASIA-C
  • CASIA-D: 包含图像和累积的脚部压力信息,该数据集包括3,496张步态姿势图像和2,658张累积脚压信息。CASSIA-D
    CASIA数据集介绍及申请地址

4.5 OU-ISIR Biometric

OU(Osaka University)-ISIR(Institute of Scientific and Industrial Research)数据集是2007年以来世界上最大的步态识别数据库,包含8个数据集:

  • 跑步机数据集(Treadmill Dataset)

这组数据由人们在电子跑步机上行走的序列组成,周围有25台摄像机以每秒60帧的速度拍摄,分辨率为640×480,包含四个子数据集:

  1. A-速度变化: 包括34名受试者在横向视野中,速度在2到10公里/小时之间变化,间隔为1公里/小时。
  2. B-衣着变化: 由68个侧视的人组成,有32个服装变化。
  3. C-视角变化: 包括168人,年龄从4岁到75岁不等,具有25个不同的视角。
  4. D-步态波动: 由185名受试者的步态剪影序列组成,从侧面角度观察,速度按照高速和低速的变化被细分为两组,每组100名受试者(有15人重复)。
  • 大型人群数据集(Large Population Dataset, OU-LP)

建立于2009年,通过外展活动收集的大型人口数据集由4016名受试者组成,每个受试者从四个摄像机角度拍摄两次,分辨率为30FPS,640×480像素。

  • 速度过渡数据集(Speed Transition Dataset)

此数据集包括两个子集:

  1. Dataset A: 包含179个场景,这些场景是人们在跑步机或地面上以每小时4公里的速度匀速行走。在这组数据中,背景已被去除。
  2. Dataset B: 包括25个人在跑步机上行走的序列,速度在1到5公里/小时之间变化。每个人都被拍摄了两次,加速和减速在三秒钟内进行,中间的一秒钟的序列从这两段中提取出来。
  • 多视角大型人群数据集(Multi-view Large Population Dataset, OU-MVLP)

该数据集由10,307个样本组成,其中5,114个为男性,其余5,193个为女性,年龄在2-87岁之间,为交叉视觉的运动识别方法而开发。这些图像从14个不同角度拍摄,帧率为每秒25帧,分辨率为1,280 × 980。用于拍摄的设备被放置在横向距离和高度分别为8米和5米的地方。

  • 带有袋子的大型人群数据集(Large Population Dataset with Bag, OU-Bag)

该数据集侧重于对携带物体的人的步态识别,目的是不仅要依靠生物识别信息,而且要识别被运送部分(如果有的话)在身体上的位置。带袋子的大型人口数据集包括62,528人,年龄在2到95岁之间,通过一个距离约为8米、高度为5米的摄像机获得。这些序列的拍摄速度为每秒25帧,分辨率为1,280 × 980像素。每个人被拍摄了三次,这样,第一次,即A1,携带或不携带物体,而第二次和第三次则不携带任何东西。最后,在携带东西的情况下,总共标记了四个区域,即下侧、上侧、前侧和后侧。所有视频都有各自的二进制掩码,用于去除背景。

  • 基于年龄的大型人口数据集(Large Population Dataset with Age, OU-Age)

基于年龄的大型人口数据集是为了研究有关人们年龄和性别的步态识别。该数据集包括62,846个在特定路径上行走的人,摄像机以每秒30帧的速度捕捉640×480像素的分辨率。这些序列中的人年龄在2到90岁之间,所有视频都有各自的二进制掩码,用于去除背景。

  • 惯性传感器数据集(Inertial Sensor Dataset)

惯性传感器数据集被指定用于研究和评估通过运动传感器和加速度计进行个人识别的方法,它是最大的基于惯性传感器的步态数据库,由收集自744名受试者(389名男性和355名女性)的图像组成,其年龄范围为2至78岁。

  • 相似动作惯性数据集(Similar Actions Inertial Dataset)

相似动作惯性数据集包括460名年龄在8到78岁之间的参与者,其性别分布基本相等,数据集还呈现了地板的六个不同特征:无效、平坦、爬楼梯、爬楼梯、爬坡道和坡道下降。

4.6 USF

USF(University of South Florida)的数据集包括来自122名受试者的1,870个序列,使用两种不同的鞋子。该数据集还考虑了携带或不携带公文包的个体,不同的表面条件(草地和混凝土),以及不同的摄像机视角(左/右视角)。这些视频是在户外环境中拍摄的两个不同的时间点上拍摄的。

4.7 SOTON

SOTON(Southampton Human ID at a Distance)数据库是南安普顿大学创建的,由三个主要部分组成:

  • SOTON小型数据集(SOTON Small database) 由12名受试者组成,他们穿着不同的鞋子和衣服,携带或不携带袋子,以不同的速度在内部轨道上行走。
  • SOTON大型数据集(SOTON Large database) 包含114名受试者在室外、室内的实验室轨道上和室内的跑步机上行走。图像从六个不同的角度拍摄,共计5000+个序列。
  • SOTON短时数据集(SOTON Temporal) 该数据是使用多生物测量隧道采集的,它包含12个同步的摄像头来捕捉人们在一段时间内的步态。该数据集由动态环境组成,包括不同的背景、照明、行走表面和摄像机的位置。该数据集包括25名受试者(17名男性和8名女性),年龄从20岁到55岁不等。

注意,被拍摄者都是赤脚状态。

4.8 AVAMVG

AVAMVG(AVA Multi-view Dataset for Gait Recognition)数据集是一个专门为基于三维的步态识别算法设计的数据库,它包括来自20个演员的步态图像,描述了不同的轨迹。这些序列是使用专门为该任务校准的相机获得的,随后使用三维图像重建算法进行了后处理步骤。此外,每个序列还提供了各自的二进制剪影用于分割。最后,该数据库包含200个六通道多视图视频,也可以作为1200个单视图视频使用,即6×200。
AVA

4.9 KY4D

KY4D(Kyushu University 4D Gait Database)数据集是由42名受试者沿四条直线和两条曲线行走的连续三维模型和图像序列组成。这些视频由16台摄像机记录,分辨率为1,032 × 776像素,并分为三个子集:

  • 数据集A(直线): 由连续的三维模型和人们沿直线行走的图像序列组成。

fig14

  • 数据库B(曲线): 包括人们沿着曲线轨迹行走的图像序列

fig15

  • KY红外(IR)阴影步态数据库: 它是由54名受试者的时间序列阴影图像组成的。
    fig1617

4.10 WhuGAIT

武汉大学于2018年发布了whuGAIT数据集,并公开了源代码和预训练模型。与其他数据集不同,whuGAIT包括从118人身上收集的3D加速度计和3轴陀螺仪信息,其中20人在三天内收集,98人在一天内收集。根据所需的任务,该数据集被分为六个不同的子集。

  • 数据集#1: 由来自118人的33104个样本组成,用于训练,3740个用于测试,分为两步分割。
  • 数据集#2: 与数据集#1类似,包括一个两步分割数据集,由49,275个用于训练的样本和4,936个用于测试的样本组成,提取自20人的三天收集的数据。
  • 数据集#3: 这个子集被分为时间大小的窗口,每个样本包括2.56秒。该集包括26283个用于训练的实例和2991个用于测试的实例。
  • 数据集#4: 与数据集#3类似,该子集被划分为2.56秒的时间框架,但使用的是三天内收集的20个人的数据。该子集包括35,373个用于训练,3,941个用于测试。
  • 数据集#5: 这个子集被用于验证目的。它由118人的74,142个实例组成,从98人中提取的信息被用于训练,而其余20人被用于验证。认证程序是由一对来自一个或两个不同主体的样本组成的。这些实例包括两步的加速度和陀螺仪数据。
  • 数据集#6: 这个子集采用了与数据集#5相同的结构,但它使用的不是水平对齐,而是垂直对齐。

4.11 总结归纳

下表展示了各个数据集所含数据的属性特点:

    方面\数据集    CMU MoBo TUM GAID HID-UMD CASIA OU-ISIR USF SOTON AVAMVG KY4D WhuGAIT
Viewpoint
(固定视角)
× × × × × × × × ×
Pace
(步态图像)
× × × × × ×
Object
(携带物)
× × × ×
Shoe
(穿鞋)
× × × × × × × ×
Clothing
(穿衣服)
× × × × × × ×
Time
(时间序列)
× × × × ×
Surface
(?没看懂)
× × × × × × × ×
Silhouette
(步态剪影)
× × × ×
Gait Fluctuation
(步态波动)
×
Treadmill Walking
(跑步机行走)
× × ×
Overground Walking
(室外行走)
× × × × × × × ×
Foot Pressure
(脚部压力)
×

博主看了几篇近年来步态识别领域的最新成果(虽然不是很多篇),感觉还是CASIA和OU-ISIR数据集的利用率更高一些,并且评价指标在业界的认可度也更高,另外就是一些作者自建的私有数据集(Private datasets),那些数据集在论文中不能单独使用进行性能评估,必须结合公开数据集。其他的一些公开数据集的利用率并不是很高,初学者了解一下就可以了(大概)。

5. 总结和展望

关于基于视频构建的数据集,有以下几点需要注意:

  • 在用于测试或训练的视频序列中,有且仅有一个人出现
  • 数据集中除人以外的背景是保持不变的(暗示了对于背景环境不变的强约束条件?)

因此,基于此类数据集构建的步态识别模型应用于现实世界时(如:检测公共街道上行走的多个行人),很容易出现识别错误。
关于未来的工作展望,作者提出了以下4点:

  1. 注意力机制:此方法尚未在步态识别领域开展较为成熟的研究。【Attention is all you need】
  2. 性别和年龄识别:性别和年龄的识别在论文【A deep learning approach on gender and age recognition using a single inertial sensor】中被提及,并逐渐得到应用重视。
  3. Hazardous environment monitoring: Gait recognition is very similar to hazardous environment monitoring, which is covered in the paper [A novel siamese-based approach for scene change detection with applications to obstructed routes in hazardous environments] .
  4. Recognition of multi-person scenes: Most gait recognition work focuses on a single individual in a scene in a controlled environment, but real-life problems often require robust solutions to uncontrolled environments with multiple people in the scene.

In addition, the authors observe an increasing need for gait recognition of individuals wearing different clothes or carrying objects, as well as hybrid approaches consisting of gait and supplementary biometric features, such as faces, ears, etc. .

This review article on the field of gait recognition is mainly for a preliminary understanding of the research content in this field, the basic methods involved, data sets, and the latest research results.
Research content - be able to perceptually understand what "gait recognition" does;
basic method - the basic method types introduced in the article are relatively detailed, but the specific working principle needs to find relevant information by yourself (after all, review articles will not The specific workflow will be introduced one by one), there is still a lot of new knowledge to be learned;
datasets - (as of the release date of this article) the introduction of public datasets has been very detailed, and basically provides methods for obtaining datasets;
The latest research results - because the research results of gait recognition in recent years have been updated and iteratively fast, the results cited in this article may not be the optimal solution recently, but the model construction ideas still have a high reference value (a total of citations I have read 140 articles, pick it up~).
The blogger is translating and reading this article based on his own understanding. If there are any mistakes, please criticize and correct!

Guess you like

Origin blog.csdn.net/weixin_45074807/article/details/123432516