Thesis "Interspecies Knowledge Transfer for Facial Keypoint Detection" translation

table of Contents

论文链接:Interspecies Knowledge Transfer for Facial Keypoint Detection

Code: https://github.com/ menoRashid / animal_human_kp

Summary:

1. Introduction

2. Related work

3. Approach

4. Experiments

5. Conclusion


Authors and organizations:

          

Summary:

       We propose a method to locate the animal facial features key points by converting Facial information. Not so much direct training network will face key points to finetune animal face critical point (this method is sub-optimal, because the animal's face and the face looks very different), we recommend the different animals and humans by modifying the shape of the face, so that the images of animals to adapt to face detection pre-training model. First, we use unsupervised by an animal shape matching method for each input image find the most similar facial images. We use these matches to train a network to warp warp each one input of an animal face it like a human face. The network then wrap and after a pre-trained people face critical point detection network joint finetune with animal data . We presented the latest results of horse and sheep face critical point detection with a simple fine-tuning has significantly improved compared especially when the training data is scarce. In addition, we propose a new data set with 3717 has the Malian image and facial key points mark.

1. Introduction

      Face detection is the key prerequisite for face alignment and registration is important, and facial expression analysis, face tracking, have a certain influence and manipulation and conversion Face graphic method. Although the face detection key point is a relatively mature field of study, but the animal's face critical point detection is a relatively unexplored field. For example, studies have shown that equine veterinarian, mice, sheep, and cats have different facial expressions in the face of pain (usually face critical point detection can help detect animal pain). In this paper, we mainly horses and sheep face critical point detection. Convolutional neural network (cnn) has a good performance in key areas of human face detection, and therefore a good choice cnn animal key point is detection. Unfortunately, training a neural network from scratch requires a lot of tagging data, time-consuming and costly. In addition, when there is insufficient training data, cnn can finetune the way adoption. Generalization ability of the network of pre-trained by the amount of data that can be used to fine-tune and limit correlation between the two tasks . For example, previous work has shown that training the network has a limited ability to adapt to the natural objects on man-made objects, and only related to the objectives and tasks, additional pre-training data is useful. We have a large number of people face critical point mark data, but not a lot of key points animal training data to train the neural network. At the same time, due to the different faces and facial structure of the animal, the direct use of fine-tuning might not get good results. In this article, we have by way of converting human face and animal facial data to solve this problem (critical point detection). How can we achieve this effect, however it through cnns? We are mainly pre-trained to adapt the data collection network to better fine-tuning, rather than pre-training network to adapt to the new training data set. The new data set and the pre-match training mission data mapping, we can use a person's face critical point detection network, and then finetune network to enable them to detect animal face. Specifically, the idea is to warp each animal images, to make it look more like a human, wrap an image and then use the resulting fine-tune the network through pre-trained to detect human face critical point.

                                                                  

      Intuitively, by making the face looks more like human animal, we can achieve the correct difference in their shapes, so the trimming process, the network only needs to adapt its appearance difference. For example, the distance between the mouth of the horse is generally much smaller than the distance between their eyes, but for humans, generally similar distances (different shapes). In addition, there are horse fur, people do not. We'll pull through wrap the horse's mouth network to adjust the shape difference, and in the fine-tuning process, the key point detection network will learn to adjust for the difference in appearance.

Contributions.
1. A method for introducing a new animal face detection key, the method key from the human faces detected loosely-related switching information field.
2. Provide a new label with a horse face key data set that contains 3717 images.
3. The latest results show the horses and sheep crucial point detection. Animal data by converting the image to look more like a human, we can expect significant improvement in the detection accuracy of the key points by simple trimming. It is important, with the reduction of the amount of training data, the gap between our approach and simply fine-tuning growing, which indicates that the actual applicability of our approach for small data sets.
 

2. Related work

      Face detection and alignment of key points in the study of computer vision has been very mature. Classical algorithms include .... (omitted here)
 

3. Approach

 
      Our goal is to detect a face key animal in the absence of large numbers of animals marked datasets situation. For this reason, we recommend the use of pre-trained people face critical point
detector, taking into account the kinds of differences between their domain. For training, we assume that access to key points of the animal's face comment, annotate key points of the face and its corresponding pre-trained human critical point detector. For testing purposes, we can assume that the use of animals face detector (ie, we focus only on the key points of the face detection rather than face detection). Our approach consists of three main steps: Find each animal face similar posture recent visits face; use the nearest neighbor to train from animals to humans deformation network; and using deformation (human-like) to fine-tune images of animals the key point for pre-trained to detect human face animal key point detector.
 
3.1. Nearest neighbors with pose matching
 
      In order to make (losely-related) face critical point detector applicable to fine-tune the animals, our idea is to first make the face deformed animals more human shape, in order to pre-trained people adapt more easily to the detector animal data. One challenge is that any pair of animals and humans may face on the show very different posture (for example, to the right and left of the horse people), which makes warping becomes extremely challenging, even without possible. To alleviate this problem, we first find animals and humans in a similar posture.
      If we have pose classifiers for animal and human face / comments, then we can simply use their classification / annotation to find the animals and human faces right. However, in this work, we assume that we do not have access to classified posture and gestures comment. Instead, we find close to the face posture, given its key comments. More specifically, we calculate the angle between a pair of human and animal key difference, then select the most similar animal face each instance.
                                                       
 
                                                      
 
3.2. Interspecies face warping network
 
      现在,我们拥有动物面部和其对应的最像的人脸数据,我们用这些匹配数据训练一个animal-to-human的wrap网络。这个网络可以使动物面部更像人脸,所以用wrap后动物数据去finetune预训练人脸特征点检测网络相比直接finetune更加容易。
     为此,我们训练了一个CNN网络,该CNN网络将动物图像作为输入并通过(TPS)[4]进行变形。 我们的wrap网络是一个空间变换器[19],主要区别在于我们的wrap是直接受监督的,类似于[6]。我们的网络架构类似于[38]中的本地化网络; 直到第五个卷积层为止,它与Alexnet [24]相同,随后是一个1×1卷积层,该层将过滤器的数量减半,两个全连接层,并在第五层之前进行BN。 在训练过程中,前五层将在ImageNet上进行预训练。 我们发现这些层/过滤器选择TPS转换而不会发生过拟合。
      对于每组动物和人类训练数据,我们首先使用其对应的关键点对来计算标注的TPS变换,然后应用该变换产生标注后的wrap动物图像。 然后,我们使用wrap网络计算预测的wrap动物图像。为了训练网络,我们对标注的wrap图像和预测的wrap图像像素位置偏移之间的差异进行回归,类似于[21]。 具体来说,我们使用平方差损失来训练网络:
                       
 
     我们的wrap网络不需要额外的标注来进行训练,因为我们仅使用动物 / 人类关键点注释来找到匹配项(对于训练其相应的关键点检测器,这些匹配项已经可用)。 另外,由于每个动物实例都具有多个(K = 5)人类匹配项,因此训练了wrap网络以将多个转换识别为潜在的正确的转换。 这是数据增强的一种形式,有助于使网络对异常匹配的敏感度降低。
 
 
3.3. Animal keypoint detection network3.3. Animal keypoint detection network
 
      我们上节提到的wrap网络可以使动物数据和人脸数据更加相似。所以我们可以利用大型的人脸关键点标注数据集去训练动物关键点。最后一步finetune人脸关键点检测网络,去检测我们wrap后的动物面部。
      我们的关键点检测网络是一种 Vanilla CNN的变形体。该网络有四个卷积层, 两个全连接层(tanh激活),max-pooling用在后三层卷积上。我们通过添加卷积层和max-pooling 调整使其适应更大的图像—我们用的是224*224而不是40*40的图像。此外我们在每层添加batch normalizaton层,因为原始网络中tanh容易过拟合。 
                                           
      关键点检测损失函数:smooth-L1
 
                                            
      我们将没有相应的标注(由于遮挡)的预测关键点的损失设置为零。
 
3.4. Final architecture
      在我们最终的模型中,在预训练人脸关键点检测模型之前我们适应了一个warp网络。 我们用两个loss共同finetune这个网络。关键点检测损失 keypoint 通过关键点检测网络和wrap网络反向传播。 另外,wrap损失 L warp通过 wrap 网络反向传播,并且在更新两个网络的权重之前计算梯度。
      在测试阶段,我们的关键点检测网络会在每张图像上预测所有5个面部关键点。 在我们的实验中,对于图像中不可见的关键点预测,我们将不会进行惩罚,并且仅针对具有相应真实标记的预测关键点来进行评估。 为了进行评估,使用TPS扭曲参数将在wrap图像上预测的关键点转换到原始图像。
 
3.5. Horse Facial Keypoint dataset
      作为这项工作的一部分,我们创建了一个新的马数据集来训练和评估面部关键点检测算法。共3717张,3531张用来训练,186张用于测试。我们标注了每张图像的面部框和5个关键点:左眼, 右眼, 鼻子, 左嘴角,右嘴角。
 

4. Experiments

      在本节中我们分析关键点检测模型的准确性,并逐个分析每个模块。此外,我不同数量的数据集上对模型进行评估,并在wrap网络表现较好的情况下评估最优解。
 
Baselines:
      我们与[51]中提出的算法进行比较,该算法在级联形状回归框架中使用三重态插值特征(TIF)进行动物关键点检测。 我们还制定了自己的baseline。 第一个baseline是没有扭曲网络的完整模型。 它只是在动物数据集(“ BL FT”)上微调了预先训练的人脸关键点网络。 第二个baseline是我们的完整模型,没有wrap损失; 即,它仅通过关键点检测损失来微调预训练的人脸关键点网络和wrap网络。 该基线等用于[19]中提出的空间变换器设置。 我们使用TPS(“ BL TPS”)显示了此结果。 第三个baseline从零开始训练关键点检测网络。 即无需进行任何人脸关键点检测的预培训,也无需扭曲网络(“从头开始”)。
 
Datasets:
      我们根据AFLW [23]数据集和[40]中使用的训练数据对人脸关键点进行了关键点检测网络的预训练(共31524张图像)。 该数据集还用于动物到人类的最近邻居检索。 我们评估了对两种动物(马和羊)的关键点检测。 对于 马 实验,我们使用“ 马面部关键点”数据集,该数据集包含3531张训练图像和186张测试图像。 对于羊 实验,我们手动标注了[51]中提供的数据集的一部分,以便在人类数据集中存在相同的5个关键点。 该数据集包含432张训练图像和99张测试图像。
 
Evaluation metric:

      我们使用与[51]相同的度量标准进行评估:如果预测的关键点与标注的关键点之间的欧式距离大于面部(边界框)大小的10%,则视为失败。 然后我们将平均失败率计算为失败的测试关键点的百分比。

 
Training and implementation details:

      我们发现,在联合训练之前对wrap网络进行预训练可以带来更好的性能。 为了训练wrap和关键点网络,我们对每张动物图像使用K = 5个人类图像。 这些匹配项还用于4.4节中介绍的“ GT Warp”网络中。 

      对于TPS wraping 网络,我们使用5×5的控制点网格。 我们使用adam[22]优化器。 wrap网络训练的基本学习率为0.001,而预训练层的学习率则低10倍。 它训练了50个epoch,学习率在25个epoch后降低了10倍。 在完整的系统训练过程中,wrap网络的学习速率相同,而关键点检测网络的学习速率为0.01。 我们将网络训练了150个epoch,分别在50和100个epoch后降低了学习率。 最后,我们使用水平翻转和从-10°到10°的旋转(以5°为增量)进行数据增强。

 
4.1. Comparison with our baselines
      首先,我们将所有模型和我们的模型进行比较。如图5,分别展示了马和羊数据集的验证结果。在这两个数据集中,我们的模型表现更好,关键点平均失败率分别为8.36%,0.87%。
  
                                                        
      
      总体而言,羊的错误率低于马的错误率,这是因为羊脸数据相对于马脸数据的姿势分布更像人。人脸和羊脸的正向姿势(5个关键点)在所有图像中的占比分别为72%和84%,而马脸只有29%,大部分马脸都是侧脸(3个关键点)。因此,羊脸更加适合人脸预训练的模型。尽管如此,我们的方法相对于其他方法在这两个数据集上表现的也更好,也证明了我们的方法可以适用于其他不同的数据集。
      这些结果也表明了我们系统每个部分的重要性,训练一个人脸预训练模型比从头训练好,添加wrap网络进一步提升了模型效果。
             
                                    
 
 
                                                       
4.2. Comparison with Yang et al
      我们接下来将我们的方法与Triplet InterpolatedFeatures (TIF)(三重插值特征)方法进行比较,该方法是目前最先进的动物关键点检测器,该方法要求在所有训练数据都是标注的。 我们在那里选择了一个马和绵羊图像的子集,其中数据标注了5个关键点:绵羊 345/100(训练/测试),马 982/100(训练/测试)。
                               
 
                                                                
                                                                 
 
 
 
 
4.3. Effect of training data size
      在本节中,我们评估了随着训练数据量的变化,网络性能如何变化。 为此,我们训练和测试了多组模型和基线的baseline,每次在Horse数据集上使用500到3531幅训练图像,并且以500幅图像为增量。
                                                                 
 
 
4.4. Effect of warping accuracy
      我们接下来分析wrap网络对关键点检测的影响。为此,我们首先分析用 ground-truth warp(“GT Warp”)的图像finetune的关键点检测网络的效果,我们通过使用人和马脸之间的标注进行wrap。 从某种意义上讲,这代表了我们系统性能的上限。
      下表显示了我们Horse数据集的结果。 首先,GT Warp的上限比我们的方法产生的错误率更低,这证明了通过wrap校正形状差异的想法的有效性。同时,GT Warp的错误率不可忽略,这也暗示了我们wrap网络的训练数据和和姿势匹配策略的局限性。 更好的训练数据,或者使用不同的算法进行最近邻匹配,或者使用注释的关键点增加,可能会导致更好的上限,并且也可能会为我们的方法提供改进。
 
                                                        
 
4.5. Evaluation of Nearest Neighbors
      最后,我们评估系统中所使用的最近邻方法。 在训练马数据集的过程中,我们修改最近邻K值从1至15(增量为5)。实验结果如下图所示。
 
                                                                        
 

5. Conclusion

      我们提出了一种新的动物面部关键点定位方法。 传统的deeplearning通常需要大量的带标注的数据,此类数据集的制作费时又费利。因此我们没有制作大型带标注的动物数据集,而是使动物的脸部形状wrap成人的形状。 通过这种方式,我们可以利用目前已有的人脸关键点数据集来进行与动物脸部关键点检测任务。 并我们将我们的方法与其他baselines进行了实验对比,并展示了有关马和绵羊面部关键点检测的最新结果。 最后,我们制作了Horse Facial Keypoint数据集,我们希望该数据将对动物面部关键点检测这一领域有所帮助。

 
 
发布了84 篇原创文章 · 获赞 108 · 访问量 3万+

Guess you like

Origin blog.csdn.net/qq_42013574/article/details/104131484