论文笔记1.3——Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

三、EXPERIMENTS

In this section, we first evaluate the effectiveness of the
proposed hard sample mining strategy. Then we compare our
face detector and alignment against the state-of-the-art methods
in Face Detection Data Set and Benchmark (FDDB) [25],
WIDER FACE [24], and Annotated Facial Landmarks in the
Wild (AFLW) benchmark [8]. FDDB dataset contains the an
notations for 5,171 faces in a set of 2,845 images. WIDER
FACE dataset consists of 393,703 labeled face bounding boxes
in 32,203 images where 50% of them for testing into three
subsets according to the difficulty of images, 40% for training
and the remaining for validation. AFLW contains the facial
landmarks annotations for 24,386 faces and we use the same
test subset as [22]. Finally, we evaluate the computational ef
ficiency of our face detector.
 
在这一部分,我们首先对提出的困难样本采集策略的效率进行评价。之后把我们的人脸检测器和关键点对准与当前最好的方法(Face Detection Data Set and Benchmak FDDB)[25],和WIDER FACE 【24】相比较,以及Annotated Facial Landmarks in the Wild (AFLW) benchmark[8] 。FDDB 数据集包含5171张人脸标注 ,2845张图片。 WIDER FACE 数据集包含393703个标记好的人脸边界框,32203张图片。把他们之中50%分成三个识别困难度的组别进行测试,40%用来训练、和保留下来用来核实,AFLW包含面部标注的24386张人脸,我们使用和【22】相同的方法测试,最后评估我们的人脸检测器的计算效能。
 
A. Training Data
Since we jointly perform face detection and alignment, here
we use four different kinds of data annotation in our training
process: (i) Negatives: Regions that the Intersec
tion-over-Union (IoU) ratio less than 0.3 to any ground-truth
faces; (ii) Positives: IoU above 0.65 to a ground truth face; (iii)
Part faces: IoU between 0.4 and 0.65 to a ground truth face; and
(iv) Landmark faces: faces labeled 5 landmarks’ positions.
Negatives and positives are used for face classification tasks,
positives and part faces are used for bounding box regression,
and landmark faces are used for facial landmark localization.
The training data for each network is described as follows:
 
A. Training Data
因为我们想要把人脸识别和关键点校准结合起来处理,所以我们就在训练过程中使用四种不同的数据标注:(i)负面的:交并比(IOU)(模型产生的目标窗口和原来标记窗口的交叠率。具体我们可以简单的理解为: 即检测结果(DetectionResult)与 Ground Truth 的交集比上它们的并集。)比例对于任何标记数据在0.3以下。(ii)正面的: IoU 比例在0.65以上  (iii)部分人脸:IoU比例在0.4-0.65 (iv)标记人脸:标记好了五个关键部位的位置。   正反面的数据被用来训练人脸分类任务,正面的和部分人脸的两组用来训练边界框回归。标记好的人脸用来训练面部关键部位的定位。
每个网络的训练数据集被描述成如下形式:
 
1) P-Net: We randomly crop several patches from WIDER
FACE [24] to collect positives, negatives and part face. Then,
we crop faces from CelebA [23] as landmark faces
 
把WIDER FACE 中的图片随机裁剪成小块来收集正面的,反面的,和部分脸的数据。然后把CelebA【23】中裁剪的脸部图片作为第四组数据。
 
2) R-Net: We use first stage of our framework to detect faces 
from WIDER FACE [24] to collect positives, negatives and
part face while landmark faces are detected from CelebA [23].
 
使用框架的第一阶段来从WIDER FACE 中收集正面的 反面的 和部分脸数据,其中第四组数据同上
 
3) O-Net: Similar to R-Net to collect data but we use first two
stages of our framework to detect faces.
 
与R-net类似不过使用的是框架的第二个阶段
 
B. The effectiveness of online hard sample mining
To evaluate the contribution of the proposed online hard
sample mining strategy, we train two O-Nets (with and without
online hard sample mining) and compare their loss curves. To
make the comparison more directly, we only train the O-Nets
for the face classification task. All training parameters includ
ing the network initialization are the same in these two O-Nets.
To compare them easier, we use fix learning rate. Fig. 3 (a)
shows the loss curves from two different training ways. It is
very clear that the hard sample mining is beneficial to perfor
mance improvement.
 
B. 线上收集困难识别样本的效果
 
为了看看我们提出的线上收集策略怎么样,我们训练了两组 O-nets (用和不用线上收集样本方法)并比较他们的损失曲线。为了使这个比较更加直观,我们只训练了O-Nets。所有的训练参数包括网络初始化数据在这两组实验中都是相同的。为了更容易的比较他们,我们使用固定不变的 learning rate 。Fig 3 (a)展示两种不同训练方式的损失曲线。很明显,困难样本的在线采集对提高实验效果是很有帮助的。
 
C. The effectiveness of joint detection and alignment
To evaluate the contribution of joint detection and alignment,
we evaluate the performances of two different O-Nets (joint
facial landmarks regression task and do not joint it) on FDDB
(with the same P-Net and R-Net for fair comparison). We also
compare the performance of bounding box regression in these
two O-Nets. Fig. 3 (b) suggests that joint landmarks localiza
tion task learning is beneficial for both face classification and
bounding box regression tasks.
 
C,把人脸检测与关键点对准结合起来的效果
评价了两种不同的O-Nets on FDDB数据集(使用相同的 P-NET 和R-net)同时比较了边界框回归效果。Fig.3b表明把这两者结合起来工作对于人脸分类和边界框回归任务都有很大好处。
 
 
 
D. Evaluation on face detection
To evaluate the performance of our face detection method,
we compare our method against the state-of-the-art methods [1,
5, 6, 11, 18, 19, 26, 27, 28, 29] in FDDB, and the
state-of-the-art methods [1, 24, 11] in WIDER FACE. Fig. 4
(a)-(d) shows that our method consistently outperforms all the
previous approaches by a large margin in both the benchmarks.
We also evaluate our approach on some challenge photos 1 .
 
D.人脸识别效果评价
比较了我们的方法和当前表现最好的算法【1,5,6,11,18,19,26,27,28,29】在FDDB数据集上,还有在WIDER FACE数据集上表现最好的方法。如图 Fig 4 a-d 展示,我们的算法在两个数据集上的效果始终要大幅度地超过所有先前的算法。我们同时也在一些非常有挑战性的图片上对我们的算法做出了评估。(Examples are showed in http://kpzhang93.github.io/SPL/index.html)
 
E. Evaluation on face alignment
In this part, we compare the face alignment performance of
our method against the following methods: RCPR [12], TSPM
[7], Luxand face SDK [17], ESR [13], CDM [15], SDM [21],
and TCDCN [22]. In the testing phase, there are 13 images that
our method fails to detect face. So we crop the central region of
these 13 images and treat them as the input for O-Net. The
mean error is measured by the distances between the estimated
landmarks and the ground truths, and normalized with respect
to the inter-ocular distance. Fig. 4 (e) shows that our method
outperforms all the state-of-the-art methods with a margin.

E.在人脸关键点对准上的效果评价

在方面,我们与一下算法对比

RCPR [12], TSPM
[7], Luxand face SDK [17], ESR [13], CDM [15], SDM [21],
and TCDCN [22].
 
在测试阶段,有13张图片我们的方法并没有检测出人脸。所以我们把这13张图的中心裁剪出来,作为O-net的输入。平均误差通过估计出来的坐标位置和真实标注的距离计算得到。并基于眼间距离作归一化。 Fig4 显示我们的方法在一定幅度上超过了当前最好的方法。
 
F. Runtime efficiency
Given the cascade structure, our method can achieve very fast
speed in joint face detection and alignment. It takes 16fps on a
2.60GHz CPU and 99fps on GPU (Nvidia Titan Black). Our
implementation is currently based on un-optimized MATLAB
code.
 
F.实时效率
鉴于级联结构,我们的方法可以在人脸检测和关键点校准上达到非常快的运行速度。在2.60GHZ的CPU上可以达到16fps。在Nvidia Titan Black gpu上可以达到99fps 我们的实验目前是基于尚未优化的MATLAB代码。
 
IV. C ONCLUSION
In this paper, we have proposed a multi-task cascaded CNNs
based framework for joint face detection and alignment. Ex
perimental results demonstrate that our methods consistently
outperform the state-of-the-art methods across several chal
lenging benchmarks (including FDDB and WIDER FACE
benchmarks for face detection, and AFLW benchmark for face
alignment) while keeping real time performance. In the future,
we will exploit the inherent correlation between face detection
and other face analysis tasks, to further improve the perfor
mance.
 
四、结论
 
这篇论文中,我们提出了一个基于结合人脸检测和关键点对准的多任务级联的CNNs。实验结果表明我们的方法始终比当前最先进的方法要表现要好,同时也保持了运行速度。未来,我们将会把人脸检测和其他人脸分析任务结合起来,进一步提升效果。
 
R EFERENCES
 
[1] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Aggregate channel eatures for
multi-view face detection,” in IEEE International Joint Conference on
Biometrics, 2014, pp. 1-8.
Fig. 4. (a) Evaluation on FDDB. (b-d) Evaluation on three subsets of WIDER
FACE. The number following the method indicates the average accuracy. (e)
Evaluation on AFLW for face alignment
Fig. 3. (a) Validation loss of O-Net with and without hard sample mining. (b)
“JA” denotes joint face alignment learning while “No JA” denotes do not joint
it. “No JA in BBR” denotes do not joint it while training the CNN for bounding
box regression. 5
[2] P. Viola and M. J. Jones, “Robust real-time face detection. International
journal of computer vision,” vol. 57, no. 2, pp. 137-154, 2004
[3] M. T. Pham, Y. Gao, V. D. D. Hoang, and T. J. Cham, “Fast polygonal
integration and its application in extending haar-like features to improve
object detection,” in IEEE Conference on Computer Vision and Pattern
Recognition, 2010, pp. 942-949.
[4] Q. Zhu, M. C. Yeh, K. T. Cheng, and S. Avidan, “Fast human detection
using a cascade of histograms of oriented gradients,” in IEEE Computer
Conference on Computer Vision and Pattern Recognition, 2006, pp.
1491-1498.
[5] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection
without bells and whistles,” in European Conference on Computer Vision,
2014, pp. 720-735.
[6] J. Yan, Z. Lei, L. Wen, and S. Li, “The fastest deformable part model for
object detection,” in IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 2497-2504.
[7] X. Zhu, and D. Ramanan, “Face detection, pose estimation, and landmark
localization in the wild,” in IEEE Conference on Computer Vision and
Pattern Recognition, 2012, pp. 2879-2886.
[8] M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial
landmarks in the wild: A large-scale, real-world database for facial land
mark localization,” in IEEE Conference on Computer Vision and Pattern
Recognition Workshops, 2011, pp. 2144-2151.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor
mation processing systems, 2012, pp. 1097-1105.
[10] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representa
tion by joint identification-verification,” in Advances in Neural Infor
mation Processing Systems, 2014, pp. 1988-1996.
[11] S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts responses to
face detection: A deep learning approach,” in IEEE International Confer
ence on Computer Vision, 2015, pp. 3676-3684.
[12] X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmark
estimation under occlusion,” in IEEE International Conference on Com
puter Vision, 2013, pp. 1513-1520.
[13] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape
regression,” International Journal of Computer Vision, vol 107, no. 2, pp.
177-190, 2012.
[14] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23,
no. 6, pp. 681-685, 2001.
[15] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas, “Pose-free facial
landmark fitting via optimized part mixtures and cascaded deformable
shape model,” in IEEE International Conference on Computer Vision,
2013, pp. 1944-1951.
[16] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder
networks (CFAN) for real-time face alignment,” in European Conference
on Computer Vision, 2014, pp. 1-16.
[17] Luxand Incorporated: Luxand face SDK, http://www.luxand.com/
[18] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun, “Joint cascade face detection
and alignment,” in European Conference on Computer Vision, 2014, pp.
109-122.
[19] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural
network cascade for face detection,” in IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 5325-5334.
[20] C. Zhang, and Z. Zhang, “Improving multiview face detection with mul
ti-task deep convolutional neural networks,” IEEE Winter Conference
on Applications of Computer Vision, 2014, pp. 1036-1041.
[21] X. Xiong, and F. Torre, “Supervised descent method and its applications to
face alignment,” in IEEE Conference on Computer Vision and Pattern
Recognition, 2013, pp. 532-539.
[22] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by
deep multi-task learning,” in European Conference on Computer Vision,
2014, pp. 94-108.
[23] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the
wild,” in IEEE International Conference on Computer Vision, 2015, pp.
3730-3738.
[24] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detec
tion Benchmark”. arXiv preprint arXiv:1511.06523.
[25] V. Jain, and E. G. Learned-Miller, “FDDB: A benchmark for face detec
tion in unconstrained settings,” Technical Report UMCS-2010-009, Uni
versity of Massachusetts, Amherst, 2010.
[26] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features,” in
IEEE International Conference on Computer Vision, 2015, pp. 82-90.
[27] R. Ranjan, V. M. Patel, and R. Chellappa, “A deep pyramid deformable
part model for face detection,” in IEEE International Conference on Bio
metrics Theory, Applications and Systems, 2015, pp. 1-8.
[28] G. Ghiasi, and C. C. Fowlkes, “Occlusion Coherence: Detecting and
Localizing Occluded Faces,” arXiv preprint arXiv:1506.08347.
[29] S. S. Farfade, M. J. Saberian, and L. J. Li, “Multi-view face detection using
deep convolutional neural networks,” in ACM on International Conference
on Multimedia Retrieval, 2015, pp. 643-650.
 
 
 
 
 
 
 
发布了49 篇原创文章 · 获赞 11 · 访问量 7627

猜你喜欢

转载自blog.csdn.net/mid_Faker/article/details/104554908