论文笔记1.3——Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

三、EXPERIMENTS

In this section, we first evaluate the effectiveness of the

proposed hard sample mining strategy. Then we compare our

face detector and alignment against the state-of-the-art methods

in Face Detection Data Set and Benchmark (FDDB) [25],

WIDER FACE [24], and Annotated Facial Landmarks in the

Wild (AFLW) benchmark [8]. FDDB dataset contains the an

notations for 5,171 faces in a set of 2,845 images. WIDER

FACE dataset consists of 393,703 labeled face bounding boxes

in 32,203 images where 50% of them for testing into three

subsets according to the difficulty of images, 40% for training

and the remaining for validation. AFLW contains the facial

landmarks annotations for 24,386 faces and we use the same

test subset as [22]. Finally, we evaluate the computational ef

ficiency of our face detector.

在这一部分，我们首先对提出的困难样本采集策略的效率进行评价。之后把我们的人脸检测器和关键点对准与当前最好的方法（Face Detection Data Set and Benchmak FDDB)[25],和WIDER FACE 【24】相比较，以及Annotated Facial Landmarks in the Wild (AFLW) benchmark[8] 。FDDB 数据集包含5171张人脸标注，2845张图片。 WIDER FACE 数据集包含393703个标记好的人脸边界框，32203张图片。把他们之中50%分成三个识别困难度的组别进行测试，40%用来训练、和保留下来用来核实，AFLW包含面部标注的24386张人脸，我们使用和【22】相同的方法测试，最后评估我们的人脸检测器的计算效能。

A. Training Data

Since we jointly perform face detection and alignment, here

we use four different kinds of data annotation in our training

process: (i) Negatives: Regions that the Intersec

tion-over-Union (IoU) ratio less than 0.3 to any ground-truth

faces; (ii) Positives: IoU above 0.65 to a ground truth face; (iii)

Part faces: IoU between 0.4 and 0.65 to a ground truth face; and

(iv) Landmark faces: faces labeled 5 landmarks’ positions.

Negatives and positives are used for face classification tasks,

positives and part faces are used for bounding box regression,

and landmark faces are used for facial landmark localization.

The training data for each network is described as follows:

A. Training Data

因为我们想要把人脸识别和关键点校准结合起来处理，所以我们就在训练过程中使用四种不同的数据标注：（i）负面的：交并比（IOU）（模型产生的目标窗口和原来标记窗口的交叠率。具体我们可以简单的理解为：即检测结果(DetectionResult)与 Ground Truth 的交集比上它们的并集。）比例对于任何标记数据在0.3以下。（ii）正面的： IoU 比例在0.65以上（iii）部分人脸：IoU比例在0.4-0.65 （iv）标记人脸：标记好了五个关键部位的位置。正反面的数据被用来训练人脸分类任务，正面的和部分人脸的两组用来训练边界框回归。标记好的人脸用来训练面部关键部位的定位。

每个网络的训练数据集被描述成如下形式：

1) P-Net: We randomly crop several patches from WIDER

FACE [24] to collect positives, negatives and part face. Then,

we crop faces from CelebA [23] as landmark faces

把WIDER FACE 中的图片随机裁剪成小块来收集正面的，反面的，和部分脸的数据。然后把CelebA【23】中裁剪的脸部图片作为第四组数据。

2) R-Net: We use first stage of our framework to detect faces

from WIDER FACE [24] to collect positives, negatives and

part face while landmark faces are detected from CelebA [23].

使用框架的第一阶段来从WIDER FACE 中收集正面的反面的和部分脸数据，其中第四组数据同上

3) O-Net: Similar to R-Net to collect data but we use first two

stages of our framework to detect faces.

与R-net类似不过使用的是框架的第二个阶段

B. The effectiveness of online hard sample mining

To evaluate the contribution of the proposed online hard

sample mining strategy, we train two O-Nets (with and without

online hard sample mining) and compare their loss curves. To

make the comparison more directly, we only train the O-Nets

for the face classification task. All training parameters includ

ing the network initialization are the same in these two O-Nets.

To compare them easier, we use fix learning rate. Fig. 3 (a)

shows the loss curves from two different training ways. It is

very clear that the hard sample mining is beneficial to perfor

mance improvement.

B. 线上收集困难识别样本的效果

为了看看我们提出的线上收集策略怎么样，我们训练了两组 O-nets （用和不用线上收集样本方法）并比较他们的损失曲线。为了使这个比较更加直观，我们只训练了O-Nets。所有的训练参数包括网络初始化数据在这两组实验中都是相同的。为了更容易的比较他们，我们使用固定不变的 learning rate 。Fig 3 （a）展示两种不同训练方式的损失曲线。很明显，困难样本的在线采集对提高实验效果是很有帮助的。

C. The effectiveness of joint detection and alignment

To evaluate the contribution of joint detection and alignment,

we evaluate the performances of two different O-Nets (joint

facial landmarks regression task and do not joint it) on FDDB

(with the same P-Net and R-Net for fair comparison). We also

compare the performance of bounding box regression in these

two O-Nets. Fig. 3 (b) suggests that joint landmarks localiza

tion task learning is beneficial for both face classification and

bounding box regression tasks.

C，把人脸检测与关键点对准结合起来的效果

评价了两种不同的O-Nets on FDDB数据集（使用相同的 P-NET 和R-net）同时比较了边界框回归效果。Fig.3b表明把这两者结合起来工作对于人脸分类和边界框回归任务都有很大好处。

D. Evaluation on face detection

To evaluate the performance of our face detection method,

we compare our method against the state-of-the-art methods [1,

5, 6, 11, 18, 19, 26, 27, 28, 29] in FDDB, and the

state-of-the-art methods [1, 24, 11] in WIDER FACE. Fig. 4

(a)-(d) shows that our method consistently outperforms all the

previous approaches by a large margin in both the benchmarks.

We also evaluate our approach on some challenge photos 1 .

D.人脸识别效果评价

比较了我们的方法和当前表现最好的算法【1，5，6，11，18，19，26，27，28，29】在FDDB数据集上，还有在WIDER FACE数据集上表现最好的方法。如图 Fig 4 a-d 展示，我们的算法在两个数据集上的效果始终要大幅度地超过所有先前的算法。我们同时也在一些非常有挑战性的图片上对我们的算法做出了评估。（Examples are showed in http://kpzhang93.github.io/SPL/index.html）

E. Evaluation on face alignment

In this part, we compare the face alignment performance of

our method against the following methods: RCPR [12], TSPM

[7], Luxand face SDK [17], ESR [13], CDM [15], SDM [21],

and TCDCN [22]. In the testing phase, there are 13 images that

our method fails to detect face. So we crop the central region of

these 13 images and treat them as the input for O-Net. The

mean error is measured by the distances between the estimated

landmarks and the ground truths, and normalized with respect

to the inter-ocular distance. Fig. 4 (e) shows that our method

outperforms all the state-of-the-art methods with a margin.

E.在人脸关键点对准上的效果评价

在方面，我们与一下算法对比

RCPR [12], TSPM

[7], Luxand face SDK [17], ESR [13], CDM [15], SDM [21],

and TCDCN [22].

在测试阶段，有13张图片我们的方法并没有检测出人脸。所以我们把这13张图的中心裁剪出来，作为O-net的输入。平均误差通过估计出来的坐标位置和真实标注的距离计算得到。并基于眼间距离作归一化。 Fig4 显示我们的方法在一定幅度上超过了当前最好的方法。

F. Runtime efficiency

Given the cascade structure, our method can achieve very fast

speed in joint face detection and alignment. It takes 16fps on a

2.60GHz CPU and 99fps on GPU (Nvidia Titan Black). Our

implementation is currently based on un-optimized MATLAB

code.

F.实时效率

鉴于级联结构，我们的方法可以在人脸检测和关键点校准上达到非常快的运行速度。在2.60GHZ的CPU上可以达到16fps。在Nvidia Titan Black gpu上可以达到99fps 我们的实验目前是基于尚未优化的MATLAB代码。

IV. C ONCLUSION

In this paper, we have proposed a multi-task cascaded CNNs

based framework for joint face detection and alignment. Ex

perimental results demonstrate that our methods consistently

outperform the state-of-the-art methods across several chal

lenging benchmarks (including FDDB and WIDER FACE

benchmarks for face detection, and AFLW benchmark for face

alignment) while keeping real time performance. In the future,

we will exploit the inherent correlation between face detection

and other face analysis tasks, to further improve the perfor

mance.

四、结论

这篇论文中，我们提出了一个基于结合人脸检测和关键点对准的多任务级联的CNNs。实验结果表明我们的方法始终比当前最先进的方法要表现要好，同时也保持了运行速度。未来，我们将会把人脸检测和其他人脸分析任务结合起来，进一步提升效果。

R EFERENCES

[1] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Aggregate channel eatures for

multi-view face detection,” in IEEE International Joint Conference on

Biometrics, 2014, pp. 1-8.

Fig. 4. (a) Evaluation on FDDB. (b-d) Evaluation on three subsets of WIDER

FACE. The number following the method indicates the average accuracy. (e)

Evaluation on AFLW for face alignment

Fig. 3. (a) Validation loss of O-Net with and without hard sample mining. (b)

“JA” denotes joint face alignment learning while “No JA” denotes do not joint

it. “No JA in BBR” denotes do not joint it while training the CNN for bounding

box regression. 5

[2] P. Viola and M. J. Jones, “Robust real-time face detection. International

journal of computer vision,” vol. 57, no. 2, pp. 137-154, 2004

[3] M. T. Pham, Y. Gao, V. D. D. Hoang, and T. J. Cham, “Fast polygonal

integration and its application in extending haar-like features to improve

object detection,” in IEEE Conference on Computer Vision and Pattern

Recognition, 2010, pp. 942-949.

[4] Q. Zhu, M. C. Yeh, K. T. Cheng, and S. Avidan, “Fast human detection

using a cascade of histograms of oriented gradients,” in IEEE Computer

Conference on Computer Vision and Pattern Recognition, 2006, pp.

1491-1498.

[5] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection

without bells and whistles,” in European Conference on Computer Vision,

2014, pp. 720-735.

[6] J. Yan, Z. Lei, L. Wen, and S. Li, “The fastest deformable part model for

object detection,” in IEEE Conference on Computer Vision and Pattern

Recognition, 2014, pp. 2497-2504.

[7] X. Zhu, and D. Ramanan, “Face detection, pose estimation, and landmark

localization in the wild,” in IEEE Conference on Computer Vision and

Pattern Recognition, 2012, pp. 2879-2886.

[8] M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial

landmarks in the wild: A large-scale, real-world database for facial land

mark localization,” in IEEE Conference on Computer Vision and Pattern

Recognition Workshops, 2011, pp. 2144-2151.

[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification

with deep convolutional neural networks,” in Advances in neural infor

mation processing systems, 2012, pp. 1097-1105.

[10] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representa

tion by joint identification-verification,” in Advances in Neural Infor

mation Processing Systems, 2014, pp. 1988-1996.

[11] S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts responses to

face detection: A deep learning approach,” in IEEE International Confer

ence on Computer Vision, 2015, pp. 3676-3684.

[12] X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmark

estimation under occlusion,” in IEEE International Conference on Com

puter Vision, 2013, pp. 1513-1520.

[13] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape

regression,” International Journal of Computer Vision, vol 107, no. 2, pp.

177-190, 2012.

[14] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23,

no. 6, pp. 681-685, 2001.

[15] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas, “Pose-free facial

landmark fitting via optimized part mixtures and cascaded deformable

shape model,” in IEEE International Conference on Computer Vision,

2013, pp. 1944-1951.

[16] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder

networks (CFAN) for real-time face alignment,” in European Conference

on Computer Vision, 2014, pp. 1-16.

[17] Luxand Incorporated: Luxand face SDK, http://www.luxand.com/

[18] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun, “Joint cascade face detection

and alignment,” in European Conference on Computer Vision, 2014, pp.

109-122.

[19] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural

network cascade for face detection,” in IEEE Conference on Computer

Vision and Pattern Recognition, 2015, pp. 5325-5334.

[20] C. Zhang, and Z. Zhang, “Improving multiview face detection with mul

ti-task deep convolutional neural networks,” IEEE Winter Conference

on Applications of Computer Vision, 2014, pp. 1036-1041.

[21] X. Xiong, and F. Torre, “Supervised descent method and its applications to

face alignment,” in IEEE Conference on Computer Vision and Pattern

Recognition, 2013, pp. 532-539.

[22] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by

deep multi-task learning,” in European Conference on Computer Vision,

2014, pp. 94-108.

[23] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the

wild,” in IEEE International Conference on Computer Vision, 2015, pp.

3730-3738.

[24] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detec

tion Benchmark”. arXiv preprint arXiv:1511.06523.

[25] V. Jain, and E. G. Learned-Miller, “FDDB: A benchmark for face detec

tion in unconstrained settings,” Technical Report UMCS-2010-009, Uni

versity of Massachusetts, Amherst, 2010.

[26] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features,” in

IEEE International Conference on Computer Vision, 2015, pp. 82-90.

[27] R. Ranjan, V. M. Patel, and R. Chellappa, “A deep pyramid deformable

part model for face detection,” in IEEE International Conference on Bio

metrics Theory, Applications and Systems, 2015, pp. 1-8.

[28] G. Ghiasi, and C. C. Fowlkes, “Occlusion Coherence: Detecting and

Localizing Occluded Faces,” arXiv preprint arXiv:1506.08347.

[29] S. S. Farfade, M. J. Saberian, and L. J. Li, “Multi-view face detection using

deep convolutional neural networks,” in ACM on International Conference

on Multimedia Retrieval, 2015, pp. 643-650.

大西瓜不甜

发布了49 篇原创文章 · 获赞 11 · 访问量 7627

私信关注

论文笔记1.3——Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

猜你喜欢