GaitGAN: Invariant Gait Feature Extraction Using Generative Adversarial Networks论文翻译以及理解

格式：一段英文，一段中文

2. Proposed method

To reduce the effect of variations, GAN is employed as a regressor to generate invariant gait images, that contain side view giat images with normal clothing and without carrying objects. The gait images at arbitrary views can be converted to those at the side view since the side view data contains more dynamic information. While this is intuitively appealing, a key challenge that must be address is to preserve the human identification information in the generated gait images.

为了减少视角变化的影响，GAN被用作生成不变步态图像的回归器，其包含具有在正常衣服且不携带物体情况下的侧视步态图像。由于侧视图包含更多动态信息，因此任意视图的步态图像可以转换为侧视图中的步态图像。虽然这具有直观吸引力，但必须解决的关键挑战是在生成的步态图像中保留人类身份识别信息。

The GaitGAN model is trained to generate gait images with normal clothing and without carrying objects at the side view by data from the training set. In the test phase, gait images are sent to the GAN model and invariant gait images contains human identification information are generated. The difference between the proposed method and most other GAN related methods is that the generated image here can help to improve the discriminant capability, not just generate some images which just looks realistic.The most challenging thing in the proposed method is to preserve human identification when generating realistic gait images.

用训练集训练GaitGAN模型以生成具有正常着装情况下且没有携带物的侧视步态图像。在测试阶段，将步态图像送入GAN模型，并且生成包含人类识别信息的视图无关的步态图像。所提出的方法与大多数其他GAN相关方法之间的区别在于，此处生成的图像对提高判别能力有帮助，而不仅仅是生成一些看起来很逼真的图像。所提出的方法最具挑战性的是在生成逼真的步态图像过程中保留人类身份信息。（在损失函数中加强监督信息）

2.1. Gait energy image

The gait energy image [6] is a popular gait feature, which is produced by averaging the silhouettes in one gait cycle in a gait sequence as illustrated in Figure 1. GEI is well known for its robustness to noise and its efficient computation. The pixel values in a GEI can be interpretted as the probability of pixel positions in GEI being occupied by a human body over one gait cycle. According to the success of GEI in gait recognition, we take GEI as the input and target image of our method. The silhouettes and energy images used in the experiments are produced in the same way as those described in [22].

步态能量图像[6]是一种流行的步态特征，其通过对一个步态周期中的步态序列轮廓进行平均来产生，如图1所示.GEI以其对噪声的鲁棒性和高效计算而众所周知。 GEI中的像素值可以被解释为在一个步态周期中人体占据GEI中像素位置的概率。根据GEI在步态识别中的成功，我们将GEI作为我们方法的输入和输出图像。实验中使用的轮廓和步态能量图像以与[22]中描述的相同的方式产生。
GEI

2.2. Generative adversarial networks for pixel-level

domain transfer Generative adversarial networks (GAN) [4] are a branch of unsupervised machine learning, implemented by a system of two neural networks competing against each other in a zero-sum game framework. A generative model G that captures the data distribution. A discriminative model D then takes either a real data from the training set or a fake image generated from model G and estimates the probability of its input having come from the training data set rather than the generator. In the GAN for image data, the eventual goal of the generator is to map a small dimensional space z to a pixel-level image space with the objective that the generator can produce a realistic image given an input random vector z. Both G and D could be a non-linear mapping function. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation.

域转移生成对抗网络（GAN）[4]是无监督机器学习的一个分支，由一个具有两个神经网络的系统实现，这个系统在零和游戏框架中相互竞争。生成模型G先捕获数据分布。然后，判别模型D获取来自训练集的真实数据或者从模型G生成的假图像，并估计其输入来自真实训练数据集而不是生成器的概率。在用于图像数据的GAN中，生成器的最终目标是将低维空间 z 映射到像素级图像空间，其目的是让生成器可以根据给定输入的随机向量z来生成逼真的图像。 G和D都可以是非线性映射函数。在G和D由多层感知器定义的情况下，整个系统可以用反向传播进行训练。

The input of the generative model can be an image instead of a noise vector. GAN can realize pixel-level domain transfer between input image and target image such as PixelDTGAN proposed by Yoo et al. [20]. PixelDTGAN can transfer a visual input into different forms which can then be visualized through the generated pixel-level image. In this way, it simulates the creation of mental images from visual scenes and objects that are perceived by the human eyes. In that work, the authors defined two domains, a source domain and a target domain. The two domains are connected by a semantic meaning. For instance, the source domain is an image of a dressed person with variations in pose and the target domain is an image of the person’s shirt. So PixelDTGAN can transfer an image from the source domain which is a photo of a dressed person to the pixel-level target image of shirts. Meanwhile the transferred image should look realistic yet preserving the semantic meaning. The framework consists of three important parts as illustrated in Figure 2. While the real/fake discriminator ensures that the generated images are realistic, the domain discriminator, on the other hand, ensures that the generated images contain semantic information.

生成模型的输入可以是图像而不是噪声向量。 GAN可以实现输入图像和目标图像之间的像素级域转移，例如Yoo等人提出的PixelDTGAN[20]。 PixelDTGAN可以将视图输入转换为不同的形式，然后可以通过生成的像素级图像来可视化。通过这种方式，它可以模拟产生人眼感知的物体和由视觉场景创建心理图像。在该工作中，作者定义了两个域，即源域和目标域。这两个域通过语义来连接。例如，源域是具有姿势变化的穿着者的图像，并且目标域是人的衬衫的图像。因此PixelDTGAN可以将来自源域的穿着衣服的图片转换到穿着衬衫的像素级目标图像。同时，使转换后的图像应该看起来很逼真并且保留了语义。该框架由三个重要部分组成，如图2所示。虽然真/假判别器可以确保生成的图像是真实的，但另一方面，域判别器确保生成的图像包含语义信息。（还是那个身份信息作为强监督信息的作用）。这里是一个生成器后面开两个分支，每个分支是一个判别器
ttt

The first important component is a pixel-level converter which are composed of an encoder for semantic embedding of a source image and a decoder to produce a target image. The encoder and decoder are implemented by convolution
neural networks. However, training the converter is not straightforward because the target is not deterministic. Consequently, on the top of converter it needs some strategies like loss function to constrain the target image produced. Therefore, Yoo et al. connected a separate network named domain discriminator on top of the converter. The domain discriminator takes a pair of a source image and a target image as input, and is trained to produce a scalar probability of whether the input pair is associated or not. The loss function $L_{D}^{A}$ in [20] for the domain discriminator ${D}_{A}$ is defined as:

第一个重要组件是像素级转换器，其由用于语义嵌入源图像的编码器和用于产生目标图像的解码器组成。编码器和解码器是通过卷积神经网络实现的。但是，训练编码器和解码器并不简单，因为目标是不明确的。因此，在转换器的顶端需要一些策略，如损失函数来约束所产生的目标图像。因此，Yoo等人在转换器顶端连接一个名为域判别器的独立网络。域判别器将一对源图像和目标图像作为输入，然后训练域判别器以产生输入对是否相关的标量概率。域鉴别器 $L_{D}^{A}$ 的[20]中的损失函数 ${D}_{A}$ 定义如下：

where $I_{S}$ is the source image, $I_{T}$ is the ground truth target, $I_{T}^{-}$ the irrelevant target, and $\hat{I}_{T}$ is the generated image from converter。
其中 $I_{S}$ 是源图像， $I_{T}$ 是基本事实目标， $I_{T}^{-}$ 是无关目标，而 $\hat{I}_{T}$ 是来自转换器的生成图像。

Another component is the real/fake discriminator which similar to traditional GAN in that it is supervised by the labels of real or fake, in order for the entire network to produce realistic images. Here, the discriminator produces a scalar probability to indicate if the image is a real one or not. The discriminator ’s loss function $L_{D}^{R}$ , according to[20], takes the form of binary cross entropy。

另一个组件是真/假鉴别器，它类似于传统的GAN，因为它是由真实或假的标签监督的，以便整个网络产生逼真的图像。在里，判别器产生标量概率以指示图像是否是真实的。根据[20]，判别器的损失函数 $L_{D}^{A}$ 采用二元交叉熵的形式。

where $\left \{ I^{i} \right \}$ contains real training images and $\left \{ \hat{I}^{i} \right \}$ contains fake images produced by the generator. Labels are given to the two discriminators, and they supervise the converter to produce images that are realistic while keeping the semantic mean.

其中 $\left \{ I^{i} \right \}$ 包含真实的训练图像， $\left \{ \hat{I}^{i} \right \}$ 包含由生成器产生的假图像。标签被赋予给了两个判别器，并且它们监督转换器以产生逼真的图像，同时保持语义信息。
这三部分是一起训练的吗，还是先训练生成器，然后再训练判别器

2.3. GaitGAN: GAN for gait gecognition

Inspired by the pixel-level domain transfer in PixelDTGAN, we propose GaitGAN to transform the gait data from any view, clothing and carrying conditions to the invariant view that contains side view with normal clothing and without carrying objects. Additionally, identification information is preserved. We set the GEIs at all the viewpoints with clothing and carrying variations as the source and the GEIs of normal walking at $90^{\circ}$ (side view) as the target, as shown in Figure 3. The converter contains an encoder and a decoder as shown in Figure4.

受PixelDTGAN中像素级域转移的启发，我们建议GaitGAN将步态数据从无论任何视图，服装和携带条件转换为包含正常服装侧视图且不携带物体的不变视图。另外，还会保留识别信息。如图3所示，我们将所有视角、衣服和携带物变化的GEIs作为源域图像，以正常行走的 $90^{\circ}$ （侧视图）GEI作为目标域图像。包含编码器和解码器的转换器如图4所示。

在这里插入图片描述
There are two discriminators. The first one is a real/fake discriminator which is trained to predict whether an image is real. If the input GEI is from real gait data at 90 view in 3.normal walking, the discriminator will output 1. Othervise,it will output 0. The structure of the real/fake discriminator is shown in Figure5:

有两个鉴别器。第一个是真/假鉴别器，其经过训练以预测图像是否真实。如果输入的GEI来自 $90^{\circ}$ 的三种正常行走的真实的步态能量图，则判别器将输出1。否则，它将输出0。真/假判别器的结构如图5所示：

With the real/fake discriminator, we can only generate side view GEIs which look well. But, the identification information of the subjects may be lost. To preserve the identification information, another discriminator, named as identification discriminator, which is similar to the domain discriminator in [20] is involved. The identification discriminator takes a source image and a target image as input, and is trained to produce a scalar probability of whether the input pair is the same person. If the two inputs source images are from the same subject, the output should be 1. If they are source images belonging to two different subjects, the output should be 0. Likewise, if the input is a source image and the target one is generated by the converter, the discriminator function should output 0. The structure of identification discriminator is shown in Figure6.

使用真/假鉴别器，我们只能生成看起来很像侧视图的GEI。但是，受试者的识别信息可能会丢失。为了保留识别信息，需要另一个类似于[20]中的域判别器，叫做身份判别器的组。身份判别器将源图像和目标图像作为输入，并且被训练以产生输入对是否是同一个人的标量概率。如果两个输入源图像来自同一主体，则输出应为1。如果它们是属于两个不同主体的源图像，则输出应为0。同样，如果输入是源图像并且目标图像是通过转换器生成的，鉴别器功能应输出0。身份判别器的结构如图6所示。

3. Experiments and analysis

3.1. Dataset

CASIA-B gait dataset [22] is one of the largest publicgait databases, which was created by the Institute of Automation, Chinese Academy of Sciences in January 2005. It consists of 124 subjects (31 females and 93 males) captured from 11 views. The view range is from 0Æto 1。with 18 interval between two nearest views. There are 11 sequences for each subject. There are 6 sequences for normal walking (”nm”), 2 sequences for walking with a bag (”bg”) and 2 sequences for walking in a coat (”cl”).

CASIA-B步态数据集[22]是最大的公共数据库之一，由中国科学院自动化研究所于2005年1月创建。它包括从11个视图中捕获的124个科目（31名女性和93名男性）。视图范围从 $0^{\circ}$ 到 $180^{\circ}$ 之间变化，且两个最近视图之间的间隔为 $18^{\circ}$ 。每个受试者有11个序列。有6个正常步行序列（“nm”），2个背包行走的序列（“bg”）和2个穿外套行走的序列（“cl”）。

3.2. Experimental design

In our experiments, all the three types of gait data including”nm”, ”bg” and ”cl” are all involved. We put the six normal walking sequences, two sequences with coat and two sequences containing walking with a bag of the first 62 subjects into the training set and the remaining 62 subjects into the test set. In the test set, the first 4 normal walking sequences of each subjects are put into the gallery set and the others into the probe set as it is shown in Table 1. There are four probe sets to evaluate different kind of variations.

在我们的实验中，所有三种类型的步态数据都包括在内，包括“nm”，“bg”和“cl”。训练集由前64个人物的六个正常步行序列构成，其中两个带有外套的序列和两个背包的步行序列。其余62个受试者放入测试集。在测试集中，每个受试者的前4个正常步行序列被放入Gallery集中，其他正常步行序列被放入Probe Set中，如表1所示。有四个Probe集用于评估不同类型的变化。

3.3. Model parameters

In the experiments, we used a similar setup to that of[20], which is shown in Figure 4. The converter is a unified network that is end-to-end trainable but we can divide it into two parts, an encoder and a decoder. The encoder part is composed of four convolutional layers to abstract the source into another space which should capture the personal attributes of the source as well as possible. Then the resultant feature z is then fed into the decoder in order to construct a relevant target through the four decoding layers. Each decoding layer conducts fractional stride convolutions, where the convolution operates in the opposite direction. The details of the encoder and decoder structures are shown in Table 2 and 3. The structure of the real/fake discriminator and the identification discriminator are similar to the encoder’s first four convolution layers. The layers of the discriminators are all convolution layers.

在实验中，我们使用了与[20]类似的设置，如图4所示。转换器是一个端到端可训练的统一网络，但我们可以将它分为两部分，一个编码器和一个解码器。编码器部分由四个卷积层组成，以将源抽象到另一个空间，该空间应该尽可能地捕获源的个体属性。然后，将得到的特征 $z$ 馈送到解码器中，以便通过四个解码层构造相关目标图像。每个解码层进行分数步幅卷积，其中卷积在相反方向上操作。编码器和解码器结构的细节如表2和3所示。真/假判别器和身份判别器的结构类似于编码器的前四个卷积层。判别器的各层都是卷积层。

Normally to achieve a good performance using deep learning related methods, a large number of iterations in training are needed. From Figure 7, we can find that more iterations can indeed result in a higher recognition rate, but the rate peaks at around 450 epoches. So in our experiments, the training was stopped after 450 epoches.

通常，为了使用深度学习相关方法获得良好性能，需要大量的训练迭代。从图7中，我们可以发现更多的迭代确实可以产生更高的识别率，但是速率在450个epoches时达到峰值。因此，在我们的实验中，训练在450个epoches段后停止。
在这里插入图片描述