论文笔记2.2——PFLD: A Practical Facial Landmark Detector

2.Methodology

Against the aforementioned challenges, effective mea

sures need to be taken. In this section, we fifirst focus on the

design of loss function, which simultaneously takes care of

Challenges #1, #2, and #3. Then, we detail our architecture.

The whole deep network consists of a backbone subnet for

predicting landmark coordinates, which specififically consid

ers Challenge #4, as well as an auxiliary one for estimating

geometric information.

为了应对以上提到的种种挑战，需要采取高效的方法。在这一部分，我们首先关注损失函数的设计，同时兼顾挑战 #1 #2 #3，然后我们详述我们的结构的一些细节。整个深度学习网络由用来预测关键点坐标的主干网络组成，也同时特地考虑到了挑战#4，用来辅助计算几何信息。

2.1 损失函数

The quality of training greatly depends on the design

of loss function, especially when the scale of training data

is not suffificiently large. For penalizing errors between

ground-truth landmarks X := [ x 1 , ..., x N ] ∈ R 2 × N

and

predicted ones Y := [ y 1 , ..., y N ] ∈ R 2 × N , the simplest

losses arguably go to ` 2 and ` 1 losses. However, equally

measuring the differences of landmark pairs is not so wise,

without considering geometric/structural information. For

instance, given a pair of x i

and y i

with their deviation

d i := x i -

y i in the image space, if two projections (poses

with respect to a camera) are applied from 3D real face to

2D image, the intrinsic distances on the real face could be

signifificantly different. Hence, integrating geometric infor

mation into penalization is helpful to mitigating this issue.

For face images, the global geometric status - 3D pose -

is suffificient to determine the manner of projection. For

mally, let X denote the concerned location of 2D land

marks, which is a projection of 3D face landmarks, i.e.

U ∈ R 4 × N , each column of which corresponds to a 3D

location [ u i , v i , z i , 1] T . By assuming a weak perspective

model as [ 14 ], a 2 × 4 projection matrix P can connect U

and X via X = PU . This projection matrix has six de

grees of freedom including yaw, roll, pitch, scale, and 2D

translation. In this work, the faces are supposed to be well

detected, centralized, and normalized 1 . And local variation

like expression barely affects the projection. This is to say,

three degrees of freedom including scale and 2D translation

can be reduced, and thus only three Euler angles (yaw, roll,

and pitch) are needed to be estimated.

特别是当训练数据的规模并不足够大的时候，模型训练的质量非常取决于损失函数的设计。为了找出真实坐标 X：[x1.....xN] 和预测坐标Y：=[ y1.....yn]之间的差距，最简单的损失函数理应使用 L2 和 L1 损失函数？（这个是什么损失函数，不太了解）。然而，不考虑几何与结构信息的情况下，用相同的方法测量不同的坐标不是很明智。比如，给出一对坐标xi和yi以及他们的偏差di: xi-yi在图片空间中，两个对于相机不同姿势产生的投影从3D应用到2D，在真实脸上的固有的距离是完全不同的，也就是说从3D到2D 情况下，一张脸上坐标点的实际距离是很可能有非常大的差别。因此将几何信息和误差判断结合起来对于解决这个问题是非常有帮助的。

对于人脸图片来说，全局的几何状态--3D姿势已经已经能够确定投影状态了。从形式上来看，令X代表相关的从3D投影而来的2D坐标。其中3D坐标是四维向量U [ U V Z 1]。假设使用【14】中提到的弱透视投影模型就可以得到一个2x4的投影矩阵P，这个矩阵可以计算出X=PU。这个投影矩阵是6维的，包括 yaw ，roll ，pitch ,scale,2D translation。在这个工作中，人脸应该是容易检测，集中控制，和规范化的。同时比如表情等局部的变化很难影响到投影。这也就是说，scale 和2d translation 就可以被去掉了，最后只有三个欧拉角（yaw roll pitch）需要我们的估算。

Moreover, in deep learning, data imbalance is another

issue often limiting the performance in accurate detection.

For example, a training set may contain a large number of

frontal faces while lacking those with large poses. With

out extra tricks, it is almost sure that the model trained by

such a training set is unable to handle large pose cases well.

Under the circumstances, “equally” penalizing each sample

makes it unequal instead. To address this issue, we advocate

to penalize more on errors corresponding to rare training

samples than on those to rich ones.

Mathematically, the loss can be written in the following

general form:

进一步来说，在深度学习中，数据的平衡是另一个限制检测精度的问题。举个例子，一个训练数据集可能包含太多正常人脸的图pain，但是缺少哪些做出很夸张姿势的人脸照片。如果不加上其他技巧的话，用这样的数据集训练出来的模型肯定不能处理那些极端情况的人脸图片。这样看的话，“同等”地处理这些照片其实本身就“不平等”了。所以我们提倡给那些数量少的照片赋更重地权值。

用数学公式表示的话，损失函数如下表示：

where ||· || designates a certain metric to measure the dis

tance/error of the n -th landmark of the m -th input. N is the

pre-defifined number of landmarks per face to detect. M de

notes the number of training images in each process. Given

the metric used ( e.g. , l 2 in this work), the weight γ n plays

a key role. Consolidating the aforementioned concerns, say

the geometric constraint and the data imbalance, a novel loss

is designed as follows：

|| || 是测量第n个坐标和第m个输入的误差。N是预定义需要计算关键点的人脸数量。M是每个处理过程中需要训练的图片数量。给出用到的测量标准（例如本次工作中的L2），权值 r n在这个表达式中是最关键的变量。为了把之前考虑的问题都涉及到了，也就是之前说的几何限制和数据的平衡问题。一个新的损失函数设计如下

等式中括号内的结果很容易计算，这个跟等式 1 中的 r n 变量作用相同。来仔细分析一下这个等式θ1 θ2 θ3代表真实值和估算值之间的角度偏差（yaw pitch roll angles）说白了，当偏差角度增加的时候也就是这三个θ增大的时候，penalization 也就上升了。另外，我们把样本分成多种不同特征的类别，包括侧脸、正脸、头朝上、头朝下、表情、遮挡等。权重w n c 根据属于类别C的少量样本设定。例如，不考虑几何与数据平衡的作用的话，我们直接使用L2损失函数就可以。无论3d姿势或者数据平衡是否影响到训练结果，我们的损失函数都可以通过他的误差测量来处理这些局部变化。

Although, in the literature, several works have consid

ered the 3D pose information to improve the performance,

our loss has following merits: 1) it plays in a coupled way

between 3D pose estimation and 2D distance measurement,

which is much more reasonable than simply adding two

concerns [ 14 , 15 ]; 2) it is intuitive and easy to be com

puted both forward and backward, comparing with [ 19 ];

and 3) it makes the network work in a single-stage manner

instead of cascaded [ 39 , 14 ], which improves the optimal

ity. We here notice that the variable d m n comes from the backbone net, while θ

k n from the auxiliary one, which are

coupled/connected by the loss in Eq. ( 2 ). In the next two

subsections, we detail our network, which is schematically

illustrated in Fig. 2 .

尽管在文献中，一些工作已经把3D姿势考虑进去来改进了模型的表现效果。我们的损失函数还有如下优点：

1）把3D姿势的估算与2D距离的测量结合起来，这样比直接把【14，15】中提到的相加更加合理

2）与【19】相比，更容易计算出正向和反向的数据

3）与【39，14】相比，采用一步到位的网络结构而不是多层级联的模式，这更加有优势

其中变量 d n m 是从主干网络中产生， θ n k是从辅助网络中产生，同时通过等式2 把这两者结合起来。在之后的两个部分，我们将详述我们的网络，在Fig2 中也有系统的描述。

2.2 backbone network

Similar to other CNN based models, we employ several

convolutional layers to extract features and predict land

marks. Considering that human faces are of strong global

structures, like symmetry and spacial relationships among

eyes, mouth, nose, etc. , such global structures could help

localize landmarks more precisely. Therefore, instead of

single scale feature maps, we extend them into multi-scale

maps. The extension is fifinished via executing convolution

operations with strides, which enlarges the receptive fifield.

Then we perform the fifinal prediction through fully connect

ing the multi-scale feature maps. The detailed confifiguration

of the backbone subnet is summarized in Table 1 . From the

perspective of architecture, the backbone net is simple. Our

primary intention is to verify that, associated with our novel

loss and the auxiliary subnet (discussed in the next subsec

tion), even a very simple architecture can achieve state-of

the-art performance.

与其他卷积神经网络类似，我们使用了几个卷积层来提取特征、预测关键点坐标。考虑到人脸是一个健壮的全局结构，就像眼睛、嘴、鼻子等的相似和空间上的关系一样。这样的全局结构能够帮助更精确的定位坐标。因此，我们使用多种尺寸的特征图而不是使用单一尺寸的特征图。这些通过执行卷积操作得到，这可以帮助扩大可识别的图片的范围。然后我们根据全连接和多尺寸的特征图做出最后的预测。具体的主干网的结构在Table1 中描述。主干网络从结构来看是很简单的，我们的初步打算就是要证实，将我们的损失函数和辅助网络结合起来，就算是很简单的结构也能达到当前最先进的技术的效果。

The backbone network is the bottleneck in terms of pro

cessing speed and model size, as in the testing only this

branch is involved. Thus, it is critical to make it fast and

compact. Over the last years, several strategies including

ShufflfleNet [ 44 ], Binarization [ 3 ], and MobileNet [ 13 ] have

been investigated to speed up networks. Due to the satisfac

tory performance of MobileNet techniques (depthwise sep

arable convolutions, linear bottlenecks, and inverted resid

uals) [ 13 , 26 ], we replace the traditional convolution opera

tions with the MobileNet blocks. By doing so, the computa-

tional load of our backbone network is signifificantly reduced

and the speed is thus accelerated. In addition, our network

can be compressed by adjusting the width parameter of Mo

bileNets according to demand from users, for making the

model smaller and faster. This operation is based on the ob

servation and assumption that a large amount of individual

feature channels of a deep convolutional layer could lie in

a lower-dimensional manifold. Thus, it is highly possible

to reduce the number of feature maps without (obvious) ac

curacy degradation. We will show in experiments, losing

80% of the model size can still provide promising accuracy

of detection. This again corroborates that a well-designed

simple/small architecture can perform suffificiently well on

the task of facial landmark detection. It is worth to men

tion that the quantization techniques are totally compatible

with ShufflfleNet and MobileNet, which means the size of

our model can be further reduced by quantization.

主干网络在处理速度和模型量级上遇到了瓶颈，因为在测试中只有这个网络被涉及。因此，把模型变得更快更加轻量是非常有必要的，过去这些年，一些方法包括 shuffleNet[44] Binarization[3] MobileNet[13] 已经被用来提升网络的速度。因为MobileNet 技术在提升速度方面的良好表现( 深度可分卷积、线性瓶颈、剩余值倒置？【13、16】）

深度可分卷积（de epthwise separable convolution）是卷积神经网络中对标准的卷积计算进行改进所得到的算法，其通过拆分空间维度和通道（深度）维度的相关性，减少了卷积计算所需要的参数个数，并在一些研究中被证实提升了卷积核参数的使用效率。

我们用mobileNet 模块替换掉传统的卷积操作。这样做，计算量被大大减少了，速度也就被加快了。另外，我们的网络可以通过调整MobileNets的宽度参数来压缩，根据用户的需求。来使模型更加轻量。这个操作是基于观察和假设，一个深卷积层的大量单个特征通道可能位于一个低维数据流中。因此，在避免明显降级检测精度的情况下减少特征图的数量是很有可能实现的。我们将在实验中展示，即使损失80%大小的模型，也仍能够提供可观的检测精度。这再一次证实一个设计良好的简单结构可以在面部关键点检测中也能表现的足够好。并且分层技术是完全可以和 ShuffleNet 和MobileNet共用的，这意味着我们的模型大小可以通过分层大大减小。

2.3 Auxiliary Network

It has been verifified by previous works [ 48 , 14 , 19 , 34 ]

that a proper auxiliary constraint is benefificial to making

the landmark localization stable and robust. Our auxiliary

network plays this role. Different from the previous meth

ods, like [ 14 ] learning the 3D to 2D projection matrix, [ 19 ]

discovering the dendritic structure of parts, and [ 34 ] em

ploying boundary lines, our intention is to estimate the 3D

rotation information including yaw, pitch, and roll angles.

Having these three Euler angles, the pose of head can be

determined.

之前的工作已经证实【48，14，19，34】一个合适的辅助网络限制能够使关键点定位更加稳定和健壮。我们的辅助网络也是起到这种作用。和以前的工作不同，比如【14】中把3D转换成2D投影矩阵，【19】发掘图像部分的分支结构【34】使用边界线。我们打算估算出三维的旋转信息包括 yaw pitch roll 的角度。

有了这三个欧拉角，头的姿势就可以确定了。

One may wonder that given predicted and ground-truth

landmarks, why not directly compute the Euler angles from

them? Technically, it is feasible. However, the landmark

prediction may be too inaccurate especially at the beginning

of training, which consequently results in a low-quality es

timation of the angles. This could drag the training into

dilemmas, like over-penalization and slow convergence. To

decouple the estimation of rotation information from land

mark localization, we bring the auxiliary subnet.

有人可能会想知道给出了预测值和实际值，为什么不直接通过他们来计算欧拉角？这从技术上来说是可行的，然而坐标的预测可能非常不精确尤其是在模型训练初期，这样做的话结果就会导致效果很差的角度估算结果。这可能会把训练拖入窘境，比如over-penalization和收敛过慢。为了从关键点定位中解析出旋转信息的估算结果，我们使用了一个辅助子网络。

It is worth mentioning that DeTone et al. [ 8 ] proposed a

deep network for estimating the homography between two

related images. The yaw, roll, and pitch angles can be cal

culated from the estimated homography matrix. But for our

task, we do not have a frontal face with respect to each train

ing sample. Intriguingly, our auxiliary net can output the

target angles without a requirement of frontal faces as input.

The reason is that our task is specifific to human faces that

are of strong regularity and structure from the frontal view.

In addition, the factors such as expressions and lightings

barely affect the pose. Thus, an identical average frontal

face can be considered available for different persons. In

other words, there is NO extra annotation used for comput

ing the Euler angles. The following is our way to calcu

late them: 1) predefifine ONE standard face (averaged over

a bunch of frontal faces) and fifix 11 landmarks on the dom

inant face plane as references for ALL of training faces; 2)

use the corresponding 11 landmarks of each face and the

reference ones to estimate the rotation matrix; and 3) com

pute the Euler angles from the rotation matrix. For accuracy,

the angles may not be exact for each face, as the averaged

face is used for all the faces. Even though, they are suffifi-

ciently accurate for our task as verifified later in experiments.

Table 2 provides the confifiguration of our proposed auxiliary

network. Please notice that the input of the auxiliary net is

from the 4 -th block of the backbone net (see Table 1 ).

DeTone等人【8】提出了一个深度网络来估算两个图像之间的单映射性（homography）。yaw roll pitch角度可以从单映射矩阵中计算出来。但是对于我们的任务来说，并不是每一个样本又有正脸图片。有趣的是，我们的辅助网络可以在不需要正脸输入的情况下输出这三个角度，这是因为我们的任务是根据很有规律和结构特点的正脸图片来进行的。除此之外，比如表情、光照很少会影响到姿势。因此，相同的平均情况下的人脸正面可以别用来当作是不同的人的正脸图片。换句话说，没有额外的代码用来计算欧拉角了。

接下来就是我们计算他们的方法：

1）预定义一张标准面部（在一组正脸数据上取平均值），并在这张标准面部中固定11个关键点作为所有用来训练用脸的标准。

2）使用每张脸上相关的11个关键点和参照脸的11个点来估算旋转矩阵

3）通过旋转矩阵来计算欧拉角

为了更加精确，欧拉角对于每张脸来说都但是不完全准确的，因为平均面被用来计算所有的面。尽管这样，在后面的实验中证实，数据对于我们的实验来说仍然是足够精确的。 Table 2 展示了我们使用的辅助网络的结构。请注意辅助网的输入是从主干网的第四层获得的。

2.4 详细实现

During training, all faces are cropped and resized into

112 × 112 according to given bounding boxes for pre

processing. We implement the network via the Kera frame

work, using the batch size of 256, and employ the Adam

technique for optimization with the weight decay of 10的- 6次方

and momentum of 0 . 9 . The maximum number of iterations

is 64K, and the learning rate is fifixed to 10的- 4次方 throughout the

training. The entire network is trained on a Nvidia GTX

1080Ti GPU. For 300W, we augment the training data by

flflipping each sample and rotating them from m 30 ◦ to 30 ◦

with 5 ◦ interval. Further, each sample has a region of 20%

face size randomly occluded. While for AFLW, we feed

the original training set into the network without any data

augmentation. In the testing, only the backbone network is

involved, which is effificient.

在训练过程中，所有的面部图片都被按照给出了预处理边界框裁剪成112x112的的大小。我们通过Kera 框架实现我们的网络，使用了256大小的 batch size ，使用Adam技术来优化，权重衰减为10e-6 动量是0.9. 最大迭代次数是64K，整个培训期间学习率固定为10-4。整个网络在 Nvidia GTX 1080Ti GPU上做实验。使用300W数据集，我们通过采集每个样本并以5度的间隔将其从−30旋转到30°来增加培训数据。此外，每个样本都有一个20%的面大小的区域被随机遮挡。而对于AFLW，我们在不增加任何数据的情况下将原始训练集输入网络。在测试中，只涉及到主干网，这是有效的。

大西瓜不甜

发布了49 篇原创文章 · 获赞 11 · 访问量 7623

私信关注