PFLD: A Practical Facial Landmark Detector

Xiaojie Guo 1 , Siyuan Li 1 , Jinke Yu 1 , Jiawan Zhang 1 , Jiayi Ma 2 , Lin Ma 3 , Wei Liu 3 , and Haibin Ling 4

1 Tianjin University 2 Wuhan University 3 Tencent AI Lab 4 Temple University

Abstract

Being accurate, effificient, and compact is essential to a

facial landmark detector for practical use. To simultane

ously consider the three concerns, this paper investigates a

neat model with promising detection accuracy under wild

environments (e.g., unconstrained pose, expression, light

ing, and occlusion conditions) and super real-time speed

on a mobile device. More concretely, we customize an end

to-end single stage network associated with acceleration

techniques. During the training phase, for each sample,

rotation information is estimated for geometrically regular

izing landmark localization, which is then NOT involved

in the testing phase. A novel loss is designed to, be

sides considering the geometrical regularization, mitigate

the issue of data imbalance by adjusting weights of sam

ples to different states, such as large pose, extreme light

ing, and occlusion, in the training set. Extensive exper

iments are conducted to demonstrate the effificacy of our

design and reveal its superior performance over state-of

the-art alternatives on widely-adopted challenging bench

marks, i.e., 300W (including iBUG, LFPW, AFW, HELEN,

and XM2VTS) and AFLW. Our model can be merely 2.1Mb

of size and reach over 140 fps per face on a mobile phone

(Qualcomm ARM 845 processor) with high precision, mak

ing it attractive for large-scale or real-time applications.

We have made our practical system based on PFLD 0.25X

model publicly available at http://sites.google.

com/view/xjguo/fld for encouraging comparisons

and improvements from the community.

摘要

精确，高效，健壮对于一个人脸关键点检测器的实际应用来说是非常重要的。为了同时做好这三点，这篇论文研究了一种有条例的模型，可以在自然的环境下保证检测精度。同时在移动设备上有非常时效的速度。更具体来说，我们制定了一种端到端的、单任务阶段的网络，并且运用了加速技术。在训练过程中，对于每一个样本，旋转信息也被考虑为了将坐标位置在几何上规范化表示。这一过程在测试阶段就不考虑了。我们设计了一种新的损失函数，除了考虑几何方面的规范化，同时也通过调整不同状态的样本所占权重减少了数据不平衡所带来的问题。比如说训练集中有很浮夸的姿势，极端光照条件，遮挡等。我们做了大量的实验，发现我们的设计非常高效，并且比当前最先进的技术表现要好。在被广泛采用的具有挑战性的数据集

300W (including iBUG, LFPW, AFW, HELEN,

and XM2VTS) and AFLW

我们的模型仅仅占用2.1MB空间，可以达到140fps 一张脸，在移动电话（ Qualcomm ARM 845 processor)上，同时具有非常高的精度，这使得它对于大型的、实时的应用来说很有吸引力。我们基于PFLD 0.25Xmodel来做实践系统。在

http://sites.google.

com/view/xjguo/fld

网站公开，希望把比较和进展向社会公开。

端到端：

端到端是相较于非端到端而言的，那么什么是非端到端呢，简单来说，就是我们的输入数据首先经过人工处理，在喂给模型去训练。那么在这个过程中很有可能会出现人工提取的特征不准确或者偏差很大，导致再好的算法也无法得到满意的结果。

举个栗子
对一系列人的数据分类，分类结果是性别，如果你提取的特征是头发的颜色，无论分类算法如何，分类效果都不会好，如果你提取的特征是头发的长短，这个特征就会好很多，但是还是会有错误，如果你提取了一个超强特征，比如染色体的数据，那你的分类基本就不会错了。

而端到端指的是直接输入原始数据，让模型自己去学习特征，最后输出结果。中间不再需要人工的参与，就像一个工厂，送进去玉米，最后出来爆米花，中间的流程我们一律不参与。
———————————————
摘自链接：https://blog.csdn.net/qq_41621362/article/details/91881130

1. Introduction

Facial landmark detection a.k.a. face alignment aims to

automatically localize a group of pre-defifined fifiducial points

( e.g. , eye corners, mouth corners, etc. ) on human faces. As

a fundamental component in a variety of face applications,

such as face recognition [ 21 , 49 ] and verifification [ 27 ], as

well as face morphing [ 11 ] and editing [ 28 ], this problem

has been drawing much attention from the vision commu

nity with a great progress made over the past years. How

ever, developing a practical facial landmark detector re

mains challenging, as the detection accuracy, processing

speed, and model size should all be concerned.

人脸关键点检测 a.k.a.（also known as)也就是人脸校准来自动定位一组事先定义好的基准点。（例如眼角，嘴角等等）。作为一个很多人脸应用的关键部分，比如人脸识别【21 49】和验证【27】、人脸渐变【11】、和编辑【28】这方面的问题在计算机视觉界广受关注并且过去这些年取得了很大的进展。但是，开发出一个实用的人脸关键点检测应用仍然是具有挑战性的，因为检测准确度，运行速度，和模型的大小都要被考虑进去。

Acquiring perfect faces is barely the case in real-world

situations. In other words, human faces are often exposed

in under-controlled or even unconstrained environments.

The appearance has large variations of poses, expressions

and shapes under various lighting conditions, sometimes

with partial occlusions. Figure 1 provides several such

examples. Besides, suffificient training data for data-driven

approaches is also key to model performance. It may be

viable to capture several persons’ faces under different con

ditions with balanced consideration though, this collecting

manner becomes impractical especially when large-scale

data is required to train (deep) models. Under the cir

cumstances, one often comes across an imbalanced data

distribution. The following summarizes issues regarding

the landmark detection accuracy into three challenges.

完美的用来识别的脸在现实世界中几乎是不存在的。换句话说，人脸在镜头中通常是不可控的，甚至是在不受限制的环境中。实际上人脸经常会有许多姿势，表情和形状，同时也是处在很多不同的光照条件下，也有一些被部分遮挡了。图一中就是几个很好的例子。除此之外，充足的训练数据对于数据支配的方法来说也是训练模型表现好坏的关键。用平衡的方式去捕捉几个人在不同的条件下的人脸可能是可行的，但是当需要大规模的数据来训练深度学习的模型的时候，这种方法就不切实际了。在这种条件下，人们经常会遇到数据分布不平衡的状况。一下总结了三点关于人脸关键点检测精度的挑战性问题。

Challenge #1 - Local Variation.

Expression, local

extreme lighting ( e.g. , highlight and shading), and occlu

sion bring partial changes/interferences onto face images.

Landmarks of some regions may deviate from their normal

positions or even disappear.

挑战1. 自身变化

表情，当地的极端光照（例如高光和遮挡），还有遮挡给人脸图片带来的部分改变和干预。在某些区域的面部标记就很可能会偏离他们原本的正常坐标甚至是直接消失了。

Challenge #2 - Global Variation.

Pose and imaging

quality are two main factors globally affecting the ap

pearance of faces in images, which would result in poor

localization of a (large) fraction of landmarks when the

global structure of faces is mis-estimated.

挑战2. 全局变化

姿势和图片质量是两个从全局影响面部图片表现的因素，当脸部的全局结构被错误评估了之后就会导致一些关键点定位的误差。

Challenge #3 - Data Imbalance.

It is not uncom

mon that, in both shallow learning and deep learning, an

available dataset exhibits an unequal distribution between

its classes/attributes. The imbalance highly likely makes

an algorithm/model fail to properly represent the charac

teristics of the data, thus offering unsatisfactory accuracies

across different attributes.

The above challenges considerably increase the diffifi-

culty of accurate detection, demanding the detector to be

robust.

挑战3. 数据不平衡

在浅层学习和深层学习中，一个可用的数据集在类/属性上展现出不同的分布，这并不少见。这种不平衡很有可能使算法/模型不能合适地代表数据的特征。因此会在不同的特征上产生令人不满意地准确度。

以上的种种挑战极大地增加了准确检测的困难度，这强烈需求检测器的鲁棒性。

With the emergence of portable devices, more and more

people prefer to deal with their business or get entertained

anytime and anywhere. Therefore, the challenge below,

aside from pursuing high accuracy of detection, should be

taken into account.

Challenge #4 - Model Effificiency. Another two constraints

on applicability are model size and computing requirement.

Tasks like robotics, augmented reality, and video chat are

expected to be executed in a timely fashion on a platform

equipped with limited computation and memory resources

e.g. , smart phones or embedded products.

This point particularly requires the detector to be of

small model size and fast processing speed. Undoubtedly, it

is desired to build accurate, effificient, and compact systems

for practical landmark detection.

便携设备的出现，越来越多的人们喜欢随时随地的工作和娱乐。因此，下面所说的一个挑战，除了追求检测的高精度之外，也应该别考虑进去。

挑战4. 模型效率

适用性的两个另外限制是模型大小和计算需求。机器人科学、增强现实、视频聊天等的任务都要在一个有限计算能力和有限内存空间的设备上执行。例如智能手机和嵌入式产品。

这一点特别需要检测器的模型规模要小，运行速度要高。毫无疑问，希望构建一个高效、精确、轻量的系统用来切实可行的关键点检测。

1.1. Previous Arts

Over last decades, a number of classic methods have

been proposed in the literature for facial landmark detec

tion. Parameterized appearance models, with active ap

pearance models (AAMs) [ 6 ] and constrained local models

(CLMs) [ 7 ] as representatives, accomplish the job through

maximizing the confifidence of part locations in an image.

Specififically, AAMs and its follow-ups [ 23 , 17 , 20 ] attempt

to jointly model holistic appearance and shape, while CLMs

and variants [ 2 , 31 ] instead learn a group of local experts

via imposing various shape constraints. In addition, the

tree structure part model (TSPM) [ 50 ] utilizes a deformable

part-based model for simultaneous detection, pose estima

tion, and landmark localization. The methods including ex

plicit shape regression (ESR) [ 5 ] and supervised descent

method (SDM) [ 38 ] try to address the problem in a regres

sion manner. The main limitations of these methods are the

inferior robustness against diffificult cases, expensive com

putation, and/or high model complexity. A more elaborated

review for the classic approaches can be found in [ 32 ].

1.1 前人之述

过去几十年，在相关文献中已经提出了很多关于人脸关键点检测的经典方法。以active appearence model（AAMS）、constrained local models（CLMs）为代表的参数化的外观模型，通过最大限度地提高图片中地部分位置地可信度来完成工作。确切来说，AAMS模型和后来的【23 17 20】都尝试结合模型的整体外观和形状。与他不同，CLMs和他的变种【2，31】则通过施加各种形状制约来学习得到可靠的数据。除此之外，tree structure part model（TSPM）【50】使用一种可变形地基于部分地模型来同步检测、姿势识别、关键点定位。这种方法包括 explicit shape regression （直接形状回归？）（ESR）【5】和 supervised decent method （SDM）【38】试图通过回归地方式解决问题。这些方法的主要缺陷是当面对困难识别的情况时，有很差的健壮性、高昂的计算代价，还有很高的模型复杂度。对于这些经典方法的更详尽的回顾描述在【32】中可以看到。

Recently, deep learning based strategies have dominated

state-of-the-art performances on this task. In what follows,

we brieflfly introduce representative works in this category.

Zhang et al. [ 45 ] built up a multi-task learning network,

called TCDCN, for jointly learning landmark locations and

pose attributes. TCDCN, due to its multi-task nature, is dif-

fificult to train in practice. An end-to-end recurrent convolu

tional model for face alignment from coarse to fifine was pro

posed by Trigeorgis et al. , termed as MDM [ 29 ]. Lv et al.

[ 22 ] proposed a deep regression architecture with the two

stage re-initialization scheme, namely TSR, which divides

a whole face into several parts to boost the detection accu

racy. Using pose angles including pitch, yaw, and roll as at

tributes, [ 39 ] constructs a network to directly estimate these

three angles for helping landmark detection. But the cas

caded nature of [ 39 ] makes it suboptimal in the following

landmark detection. Pose-invariant face alignment (PIFA

for short) proposed by Jourabloo et al. [ 14 ] estimates the

projection matrix from 3D to 2D via deep cascade regres

sors, which is followed by the work PIFA-CNN [ 15 ] using

a single convolutional neural network (CNN). The work in

[ 48 ] fifirst models the face depth in a Z-buffer and then fifits a

3D model for 2D images.

最近，基于深度学习的方法在人脸识别任务中表现的最好。下面，我们来简述使用深度学习的具有代表性的工作。 Zhang et al。【45】建立了一个多任务的学习网络，叫做TCDCN，把关键点位置和姿势特征结合起来学习。TCDCN，因为它多任务的方式，在实际中很难训练模型。有一种端到端的多次出现的卷积模型用来由粗到精的进行人脸校准，被TRIgeorgis 等人提出，被叫做MDM【29】 Lv等人【22】提出了一种有两个阶段的重新初始化的回归结构，叫做 TSR，这种方法把一张人脸分成几个部分，来增大检测的精度。使用姿势角度（pitch yaw roll）作为特征，【39】构建了一个网络可以直接就三处这三个角度，来帮助关键点检测。

pitch是围绕X轴旋转，也叫做俯仰角

yaw是围绕Y轴旋转，也叫偏航角

roll是围绕Z轴旋转，也叫翻滚角

但是这种级联的方式使得它在与以下几种相比，并不是最佳的方法。Jourabloo 等人提出的Pose-invariant face alignment （PIFA）【14】通过深度级联的回归器估算出从3D到2D的投影矩阵，其次是PIFA-CNN【15】使用一个单一的卷积神经网络。【48】中的工作首先对Z缓冲区中的面深度进行建模，然后为二维图像建立三维模型。

Most recently, Kumar and Chellapa designed a single

dendritic CNN, named as pose conditioned dendritic convo

lution neural network (PCD-CNN) [ 19 ], which combines a

classifification network with a second and modular classififica

tion network, for improving the detection accuracy. Honari

et al. designed a network, called sequential multi-tasking

(SeqMT) net, with an equivariant landmark transformation

(ELT) loss term [ 12 ]. In [ 30 ], the authors presented a facial

landmark regression method based on a coarse-to-fifine en

semble of regression trees (ERT) [ 16 ]. To make the facial

landmark detector robust against the intrinsic variance of

image styles, Dong et al. developed a style-aggregated net

work (SAN) [ 9 ], which accompanies the original face im

ages with style-aggregated ones to train the landmark detec

tor. By considering boundary information as the geometric

structure of human faces, Wu et al. presented a boundary

aware face alignment algorithm, i.e.

LAB, to improve

the detection accuracy. LAB derives face landmarks from

boundary lines. By doing so, the ambiguities in the land

mark defifinition can be largely avoided. Other face align

ment techniques include [ 33 , 42 , 47 , 10 , 37 , 36 ]. Though

the existing deep learning strategies have made great strides

for the task, huge space still exists for improvement espe

cially jointly taking into account the accuracy, effificiency,

and model compactness of detectors for practical use.

最近，Kumar 、Chellapa设计了一种单一分支的卷积神经网络，叫做 pose conditioned dendritic convolution neural network（PCD-CNN)【19】，它结合了一种分类神经网络和第二个模块化分类神经网络来提高检测精度。Honari 等人设计了一种网络叫做 sequential multi-tasking net（SeqMT）（按序多任务网络），使用了一种等变的坐标变换损失计算【12】。在【30】中作者提出了一种基于由粗到精的面部坐标回归方法---回归树整体（ERT）【16】。为了使面部关键点检测器在面对不同的图片具有不同的样式时，能有更健壮的效果，Dong 等人开发了一种整合样式的网络-（SAN）【9】，它把原始图片和样式整合网络放到一起训练关键点检测器。考虑到人脸边界框信息作为几何结构，Wu等人提出了一种基于边界的人脸校准算法。也就是LAB,为了提高检测精度。LAB从边界线中计算出人脸关键点坐标。通过这样做，坐标确定中不明确的地方就可以被很大程度上避免了。其他的人脸关键点校准技术包括【33，42，47，10，37，36】。虽然现有的深度学习方法已经在人脸识别任务中有了很大的进展，但是仍然有很大的提升空间，尤其是当把精度，效率，和模型在实际使用中的量级综合起来考虑的时候。

1.2 Our Contributions

The main intention of this work is to show that a good

design can save a lot resources with the state-of-the-art per

formance on the target task. This work develops a practical

facial landmark detector , denoted as PFLD, with high ac

curacy against complex situations including unconstrained

poses, expressions, lightings, and occlusions. Compared

with the local variation , the global one deserves more ef

forts, as it can greatly inflfluence the whole set of landmarks.

To boost the robustness, we employ a branch of network

to estimate the geometric information for each face sam

ple, and subsequently regularize the landmark localization.

Besides, in deep learning, the data imbalance issue often

limits the performance in accurate detection. For instance,

a training set may contain plenty of frontal faces while lack

ing those with large poses. This would degrade the accuracy

when dealing with large pose cases. To address this issue,

we advocate to penalize more on errors corresponding to

rare training samples than on those to rich ones. Consid

ering the above two concerns, say the geometric constraint

and the data imbalance , a novel loss is designed. To en

large the receptive fifield and better catch the global struc

ture on faces, a multi-scale fully-connected (MS-FC) layer

is added for precisely localizing landmarks in images. As

for the processing speed and model compactness , we build

the backbone network of our PFLD using MobileNet blocks

[ 13 , 26 ]. In experiments, we evaluate the effificacy of our de

sign, and demonstrate its superior performance over other

state-of-the-art alternatives on two widely-adopted chal

lenging datasets including 300W [ 25 ] and AFLW [ 18 ]. Our

model can be adjusted to merely 2.1Mb of size and achieve

over 140 fps per face on a mobile phone. All the above

merits make our PFLD attractive for practical use. We have

released our practical system based on PFLD 0.25X model

at http://sites.google.com/view/xjguo/fld

for encouraging comparisons and improvements from the

community.

我们工作的主要目的是在人脸识别任务中与当前最先进的算法相比，能够节省大量的资源。这个工作开发出了一款实际应用的面部识别检测器，叫做PFLD，它可以在很复杂的环境下（包括不受约束的姿势，表情，光照，和遮挡）达到非常高的精度。与局部变化相比，应该更关注全局发生的变化，因为全局的变化能更大幅度地影响坐标点地设置。为了让程序更加健壮，我们使用一个网络来对每一个人脸样本估算出几何信息，进而规范关键点地定位。

除此之外，在深度学习过程中，数据的平衡性问题也限制了检测精度。例如，如果一个训练过程中用到的样本大部分都是很标准的人脸，而很大姿势改变的人脸样本就很缺乏的时候，就会在处理人脸姿势变化的图片的时候大大的降低精度。为了解决这个问题，我们提倡对占少数的样本发生的错误或者是产生的影响更积极的处理，相比于占很大比例的其他数量丰富的样本来说。考虑到上面所提到的两点，也就是几何上的限制还有数据的平衡性，我们设计了一种新的损失函数。为了扩大可以识别的人脸种类幅度，更好的获得人脸的全局结构，我们加入了一个多尺度的、全连接的（MS-FC）一层来提高定位图片中人脸关键点的准确度。

至于处理速度和模型的量级，我们在我PFLD中使用移动网络模块【13，26】搭建了一个骨干网络。在实验中，我们评价了我们的设计的程序的效率，也展现出了相比与当前最先进的其他算法相比较有更好的表现。我们使用的是被广泛认可的两个具有挑战性的数据集包括300W【25】和AFLW【18】。我们的模型只占2.1MB空间并且可以达到 140fps 在手机上。所有以上所说的这些优点让我们开发的PFLD非常有实用价值。我们把我们的系统发行在

http://sites.google.com/view/xjguo/fld

来鼓励业界的同辈比较优劣。

大西瓜不甜

发布了49 篇原创文章 · 获赞 11 · 访问量 7624

私信关注

论文笔记2.1 PFLD: A Practical Facial Landmark Detector

摘要