使用多任务级联的卷积神经网络将人脸检测与对准结合起来

作者：张凯鹏，张展鹏，李志峰，IEEE高级会员，余乔，IEEE高级会员

Abstract—Face detection and alignment in unconstrained en-
vironment are challenging due to various poses, illuminations and
occlusions. Recent studies show that deep learning approaches
can achieve impressive performance on these two tasks. In this
paper, we propose a deep cascaded multi-task framework which
exploits the inherent correlation between them to boost up their
performance. In particular, our framework adopts a cascaded
structure with three stages of carefully designed deep convolu-
tional networks that predict face and landmark location in a
coarse-to-fine manner. In addition, in the learning process, we
propose a new online hard sample mining strategy that can im-
prove the performance automatically without manual sample
selection. Our method achieves superior accuracy over the
state-of-the-art techniques on the challenging FDDB and WIDER
FACE benchmark for face detection, and AFLW benchmark for
face alignment, while keeps real time performance.

摘要–(人脸校准（alignment）是给你一张脸，你给我找出我需要的特征点的位置，比如鼻子左侧，鼻孔下侧，瞳孔位置，上嘴唇下侧等等点的位置。如果觉得还是不明白，看下图：) 在这里插入图片描述
图中红色框框就是在做detection，白色点点就是在做alignment。

由于多种多样的姿势，光照，遮挡的问题，在非限制场景下的人脸检测与校准是非常有挑战性的。

目前绝大多数的人脸识别数据集都是非限制场景下的，例如LFW。限制场景就是指基于某一特定环境下，比如一个证件照的数据集就是限制场景下，因为都是在同样的场景（差不多的背景，差不多的光照）下采集的。非限制场景则与之相反，例如LFW（Labeled Faces in the Wild）中的wild指的就是不限制某一特定场景下。
：弓長知行
链接：https://www.jianshu.com/p/506b7ef10b40
来源：简书
最近研究表明，深度学习的方法在人脸检测与校准这两个任务上的表现很好。在这篇论文中，我们提出了一个深度级联的多任务框架，他可以利用这两个任务中的内在联系来使这两个任务完成的更好。特别的，我们采用精心设计的深度卷积网络的三层级联结构，它使用一种由粗到精的方法对人脸和位置和关键点位置进行预测。此外，在学习过程中，我们提出了一种新的在线采集识别困难的样本的策略，这个方法可以不需要手动操作就自动采集样本，提高性能。我们的方法比当前最好的算法识别精度还要高一些。在以FDDB 和 WIDER FACE为标准在人脸检测上精度更高，在与AFLW算法在关键点对准上，能够保持实时性能。

Index Terms—Face detection, face alignment, cascaded con-
volutional neural network （索引词汇）

一、INTRODUCTION
FACE detection and alignment are essential to many face
applications, such as face recognition and facial expression
analysis. However, the large visual variations of faces, such as
occlusions, large pose variations and extreme lightings, impose
great challenges for these tasks in real world applications.

人脸检测和对准对于许多基于人脸的应用是至关重要的，比如人脸识别和面部表情分析。然而，大幅度的人脸视觉变化，比如遮挡、大幅度的姿势变化、和极端照明条件会给实际应用中的这些任务带来非常大的困难。

The cascade face detector proposed by Viola and Jones [2]
utilizes Haar-Like features and AdaBoost to train cascaded
classifiers, which achieve good performance with real-time
efficiency. However, quite a few works [1, 3, 4] indicate that
this detector may degrade significantly in real-world applica-
tions with larger visual variations of human faces even with
more advanced features and classifiers. Besides the cascade
structure, [5, 6, 7] introduce deformable part models (DPM) for
face detection and achieve remarkable performance. However,
they need high computational expense and may usually require
expensive annotation in the training stage. Recently, convolu-
tional neural networks (CNNs) achieve remarkable progresses
in a variety of computer vision tasks, such as image classifica-
tion [9] and face recognition [10].

Viola和Jones提出的级联人脸检测器，使用Haar-Like特性和AdaBoostl来训练级联分类器，这个级联分类器在实时性方面有良好的性能。然而，一些工作表明检测器在实际的工作中性能会显著退化，当人脸有大幅度的视觉变化的时候使用更高级的特征和分类器效果也不是很好。除了级联结构之外，==5,6,7 介绍了一种可部分变形的模型（DPM）==用来人脸检测，也得到了很显著的效果。然而，他们在训练过程中都需要高昂的计算代价和昂贵的注释。最近，卷积神经网络（CNN）在许多计算机视觉任务中取得了很显著的效果。例如图片分类【9】和人脸识别【10】

Inspired by the good performance of CNNs in computer vision tasks, some of the CNNs based face detection approaches have been proposed in recent
years. Yang et al. [11] train deep convolution neural networks
for facial attribute recognition to obtain high response in face
regions which further yield candidate windows of faces.
However, due to its complex CNN structure, this approach is
time costly in practice. Li et al. [19] use cascaded CNNs for
face detection, but it requires bounding box calibration from
face detection with extra computational expense and ignores
the inherent correlation between facial landmarks localization
and bounding box regression.

被卷积神经网络在计算机视觉任务上的显著效果所启发，许多基于CNNS 的人脸检测技术近些年被提出来。==Yang et al. 【11】==训练出深度卷积神经网络用来面部特征识别，在面部区域获得了很高的反应，进一步产生脸部待参考的窗口。然而，由于它的复杂的CNN结构，这种方法在实践中时间开销非常大。Li et al.【19】使用级联卷积神经网络用来人脸检测。但是它需要进行边界框校准，这需要额外的计算开销，而且也忽略了人脸关键点定位与边界框回归之间的内在联系。

Face alignment also attracts extensive interests. Regres-
sion-based methods [12, 13, 16] and template fitting ap-
proaches [14, 15, 7] are two popular categories. Recently,
Zhang et al. [22] proposed to use facial attribute recognition as
an auxiliary task to enhance face alignment performance using
deep convolutional neural network.

人脸关键点的校准也吸引了广泛的研究兴趣，基于回归的方法【12.13.16】和基于模板拟合的方法【14 15 7】是两种流行的方法。最近，ZHANG ET AL.【22】提出用面部特征识别作为辅助的任务来增强脸部关键点校准的表现，使用了深度CNN。

However, most of the available face detection and face
alignment methods ignore the inherent correlation between
these two tasks. Though there exist several works attempt to
jointly solve them, there are still limitations in these works. For
example, Chen et al. [18] jointly conduct alignment and detec-
tion with random forest using features of pixel value difference.
But, the handcraft features used limits its performance. Zhang
et al. [20] use multi-task CNN to improve the accuracy of
multi-view face detection, but the detection accuracy is limited
by the initial detection windows produced by a weak face de-
tector.

然而，大部分当前使用的人脸检测和关键点对准方法都忽略了这两个人物之间的内在联系。尽管也有一些工作尝试把他们两个任务结合起来处理，但是在这些工作中也有很多局限性。例如 Chen et al.【18】，把这两个工作同时执行，使用了random forest和像素值不同的特征；（在机器学习中，随机森林是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别的众数而定。 Leo Breiman和Adele Cutler发展出推论出随机森林的算法。而 “Random Forests” 是他们的商标。这个术语是1995年由贝尔实验室的Tin Kam Ho所提出的随机决策森林（random decision forests）而来的。这个方法则是结合 Breimans 的 “Bootstrap aggregating” 想法和 Ho 的"random subspace method"以建造决策树的集合。）但是，这种技巧的运用就限制了它的效果。== Zhang et al.【20】== 使用多任务的CNN来提高多视图人脸检测的精度，但是检测准确度被比较辣鸡的人脸检测器产生的初始检测窗口所限制。

On the other hand, in the training process, mining hard
samples in training is critical to strengthen the power of de-
tector. However, traditional hard sample mining usually per-
forms an offline manner, which significantly increases the
manual operations. It is desirable to design an online hard
sample mining method for face detection and alignment, which
is adaptive to the current training process automatically.

另一方面，在训练过程中，采集识别起来困难的样本对于增强检测器的实用效果是非常重要的。然而，传统的困难样本采集通常是线下采集的方法。这显著地增加了手工操作的复杂性。设计一种线上的人脸检测和关键点对准的困难样本的采集是非常有意义的。它可以自己适应当前的自动训练学习过程。

In this paper, we propose a new framework to integrate these
two tasks using unified cascaded CNNs by multi-task learning.
The proposed CNNs consist of three stages. In the first stage, it
produces candidate windows quickly through a shallow CNN.
Then, it refines the windows to reject a large number of
non-faces windows through a more complex CNN. Finally, it
uses a more powerful CNN to refine the result and output facial
landmarks positions. Thanks to this multi-task learning
framework, the performance of the algorithm can be notably
improved. The major contributions of this paper are summa-
rized as follows: (1) We propose a new cascaded CNNs based
framework for joint face detection and alignment, and carefully design lightweight CNN architecture for real time performance.
(2) We propose an effective method to conduct online hard
sample mining to improve the performance. (3) Extensive ex-
periments are conducted on challenging benchmarks, to show
the significant performance improvement of the proposed ap-
proach compared to the state-of-the-art techniques in both face
detection and face alignment tasks.

在这篇论文中，我们提出了一种新的架构来整合这两个问题。这种架构结合了多任务学习的级联卷积神经网络。提出的卷积神经网络架构包含三个阶段。在第一个阶段，浅层的CNN快速产生候选窗体。第二个阶段是通过更复杂的CNN去掉一些没有人脸的窗口来精炼窗口。最后，使用更加强大的CNN来改进结果，输出面部的位置标记。多亏了多任务学习的框架，算法的表现才有了显著地增强。这篇论文的主要贡献总结如下：
1.提出了一种新的基于级联CNNs的框架把人脸检测与关键点对准结合起来，并精心设计了轻量级的CNN架构来获得更好的实时性能。
2.提出了一种线上形式的困难识别样本采集方法。
3.通过做了大量有挑战性的实验，来展示在人脸检测和关键点校准上与当前最好的技术相比较，我们提出的方法有显著的效果提升。

大西瓜不甜

发布了49 篇原创文章 · 获赞 11 · 访问量 7629

私信关注

论文笔记1.1——Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

使用多任务级联的卷积神经网络将人脸检测与对准结合起来

猜你喜欢