2018 CVPR-Attention-Aware Compositional Network for Person Re-identification

论文地址

Motivation

已有很多方法利用人体姿态估计对来解决re-id中的姿势变化问题，并在一定程度上提升了re-id性能，但是pose information信息是否被充分利用了呢？
在Re-ID场景中有大量的遮挡问题，有什么好的办法区分肢体对人的遮挡(weak feature)以及包裹等对人的遮挡(强特征)呢？

Contribution

提出了处理re-id中misalignment与occlusion问题统一的框架：Attention-Aware Compositional Network(AACN)
通过Pose-guided Part Attention得到更精细的身体区域来减少噪声的干扰
Visibility score来衡量身体区域的遮挡程度
大量的实验表明本文方法在已有数据集上的优越性

思考

利用attention机制来提高对局部特征的效果是个很值得研究的方法
如何让网络更加灵活地区别对待强特征与弱特征是一件很有意思的事情
最后对于区分相似异类人物的效果很有趣

1.Introduction

re-id的定义与意义
面临的挑战
许多工作尝试利用人体姿态估计解决视角与姿势变化造成的misalignment问题，从patches、stripes、pose-guided region of interest(RoI)来提取特征：
- 矩形框往往会包含背景噪声，如下图
- 不同视角下的姿势差异包含不一致的背景与噪声，直接匹配会影响精度
- ==> 如何产生更精细的轮廓来充分利用pose信息呢?
动机与贡献

2.1.Person Re-identification

特征学习与度量学习
行人对齐问题：
- 全局特征
- 局部特征
- 精度较低 ==> 姿态估计：
  - Spindle Net、PDC、PIE仅仅基于刚体（精度有限）
  - ==> 本文：通过关键点的连接性得到了非刚体部分(更加准确)

2.2.Human Parsing

同样能够精确定位身体部分，但是现有模仿很难泛化到监控场景的数据上
pose相比parsing更容易标注，且有很多不同的数据集得到容易泛化的模型

2.3.Attention based Image Analysis

注意力机制在很多任务都有应用
本文的attention map通过姿势估计引导学习得到，相比RoI更加精度
part attention的强度也表明了每一个部分的可见程度，能够帮助应对遮挡问题

3.Attention-Aware Compositional Network

如下图为AACN的整体架构，由两部分构成：
- Pose-guided Part Attenion(PPA):估计每个预定义身体部位的attention map与visibility score；
- Attention-aware Feature Composition(AFC)：part feature alignment

3.1. Pose-guided Part Attention

Part attentions：对part confidence map进行归一化处理
人体身体部位的分类：
- rigid parts：head-shoulder，upper torso，lower torso
- non-rigid parts：upper arms，lower arms，upper legs，lower legs(容易产生剧烈的姿势差异)
two-stage网络来学习part attention：
- 通过三个独立的预测网络分别预测non-rigid part attention $\mathbf{N}$ ,rigid part attentions $\mathbf{R}$ , keypoint confidence maps $\mathbf{K}$
$N^{1} = ρ^{1} (F^{p p a}), R^{1} = ϕ^{1} (F^{p p a}), K^{1} = ψ^{1} (F^{p p a})$
- $\mathbf{F}^{ppa}$ 为VGG-19第十层的特征图
- 考虑之前所有的预测refine the attention maps

N^{2} = ρ^{2} (F | N^{1}, R^{1}, K^{1}) R^{2} = ϕ^{2} (F | N^{1}, R^{1}, K^{1}) K^{2} = ψ^{2} (F | N^{1}, R^{1}, K^{1})

$\mathbf{N}^2 = \rho^2(\mathbf{F}|\mathbf{N}^1, \mathbf{R}^1, \mathbf{K}^1) \\ \mathbf{R}^2 = \phi^2(\mathbf{F}|\mathbf{N}^1, \mathbf{R}^1, \mathbf{K}^1) \\ \mathbf{K}^2 = \psi^2(\mathbf{F}|\mathbf{N}^1, \mathbf{R}^1, \mathbf{K}^1)$

整体的损失函数：

L^{p p a} (ρ, ϕ, ψ) = \sum_{t = 1, 2} L^{K} (K^{t}) + μ_{1} L^{n} (N^{t}) + μ_{2} L^{r} (R^{t})

$L^{ppa}(\rho,\phi,\psi)=\sum_{t=1,2}L^K(\mathbf{K}^t) + \mu_1L^n(\mathbf{N}^t)+\mu_2L^r(\mathbf{R}^t)$

Loss for Keypoint Confidence Map $L^k(\mathbf{K})$

L^{k} (K) = \frac{1}{C^{k}} \sum_{i = 1}^{C^{k}} ‖ K_{i}^{*} - K_{i} ‖

$L^k(\mathbf{K}) = \frac{1}{C^k}\sum_{i=1}^{C^k} \| \mathbf{K}^*_i -\mathbf{K}_i\|$

$\mathbf{K}^*_i$ 由在真实关键点位置使用高斯核产生， $C^k=14$ 为关键点数量

Loss for Non-Rigid Part Attention $L^n(\mathbf{N})$

借鉴了Part Affinity Field(PAF),通过两个关键点的连接区域来定义ground truth non-rigid parts

N_{i}^{*} (x) = {\begin{aligned} 1, & i f x \in R_{p}^{n}, \\ 0, & o t h e r w i s e \end{aligned}

$\mathbf{N}^*_i(\mathbf{x})=\left\{ \begin{aligned} 1, & & if \ \mathbf{x} \in \mathcal{R}^n_p, \\ 0, & & otherwise \\ \end{aligned} \right.$

non-rigid part attention损失：
$L^{n} (N) = \frac{1}{C^{n}} \sum_{p = 1}^{C^{n}} ‖ N_{p}^{*} - N_{p} ‖^{2}$ $L^n(\mathbf{N}) = \frac{1}{C^n}\sum_{p=1}^{C^n}\|\mathbf{N}_p^*-\mathbf{N}_p\|^2$

Loss for Rigid Attention $L^r(\mathbf{R})$

用一个矩阵 $\mathcal{R}^r_p$ 来定义rigid part, $S_1=\{0,1,2\}$ , $S_2=\{1,3,4,7\}, S_3=\{4,5,7,8\}$

R_{p}^{*} (x) = {\begin{aligned} 1, & i f x \in R_{p}^{r}, \\ 0, & o t h e r w i s e . \end{aligned}

$\mathbf{R}^*_p(\mathbf{x}) =\left\{ \begin{aligned} 1, &&if \ \mathbf{x} \in \mathcal{R}^r_p, \\ 0, && otherwise. \end{aligned} \right.$

rigid part attention损失:
$L^{r} (R, N) = \frac{1}{C^{r}} \sum_{p = 1}^{C^{r}} ‖ R_{p}^{*} - {\hat{R}}_{p} ‖^{2},$ $L^r(\mathbf{R,N}) = \frac{1}{C^r}\sum_{p=1}^{C^r}\|\mathbf{R}^*_p-\hat{\mathbf{R}}_p\|^2,$

Part Visibility Score.

attention map上的强度值表明了每个部分的可见程度，通过该值可以定义用来衡量不同身体部分重要性的global visibility score:
$v_{p} = \sum_{x, y} | R_{p} (x, y) |, o r v_{p} = \sum_{x, y} | N_{p} (x, y) |,$ $v_p = \sum_{x,y}|\mathbf{R}_p(x,y)|,or \ v_p=\sum_{x,y}|\mathbf{N}_p(x, y)|,$

3.2. Attention-aware Feature Composition

三个阶段：
- Global Context Network(GCN) ==> global features
- Attention-Aware Feature Alignment ==> part-attention-aware features
- Weighted Fearture Composition ==> 利用visibility scores对aligned featured重新加权

Stage 1: Global Context Network (GCN)

利用base network(GoogleNet)来提取特征
为了降低计算量，在“inception_5b/output”后增加了更多256-channel 3x3卷积层，最后将256channel的特征图送入下一个stage
输入尺寸变成了448 X 192经过最后一层卷积输出的特征图为14 X 6
先独立训练GCN，然后再与其他阶段联合fine-tune
GCN用ImageNet初始化，后面添加的卷积层随机初始化 ==> GAP ==> identification loss + verification loss

Stage 2: Attention-Aware Feature Alignment.

Global context features容易产生body part misalignment
通过global feature maps与每一个human part attention map作Hadamard Product产生attention-aware feature maps；将得到的feature maps进行GAP并concatenated成aligned feature vector
$f^{a} = C o n c a t ({{f_{p}}_{p = 1}^{P}}), f_{p} = σ_{g a p} (F_{p}^{a}), F_{P}^{a} = F \circ \hat{M_{p}}, \hat{M_{p}} = \frac{M_{p}}{m a x (M_{p})} w h e r e M_{p} \in {N_{p}, R_{p}} i s t h e a t t e n t i o n m a p f o r b o d y p a r t s$ $\mathbf{f}^a = Concat(\{\mathbf{f_p}^P_{p=1}\}),\mathbf{f}_p=\sigma_{gap}(\mathbf{F}^a_p), \\ \mathbf{F}^a_P = \mathbf{F} \circ\hat{\mathbf{M}_p}, \hat{\mathbf{M}_p} = \frac{\mathbf{M_p}}{max(\mathbf{M_p})} \\ where\ \mathbf{M}_p \in\{\mathbf{N}_p,\mathbf{R}_p\} \ is\ the \ attention \ map \ for \ body \ parts$

Stage 3: Weighted Feature Composition.

针对姿势的变化以及遮挡情况，对人体的部分进行自适应调整匹配
权重向量 $\mathbf{w}$ 通过同时考虑part visibility与feature salience估计得到
将visibility score与attention-aware aligned feature vector拼接起来送入FC

{\hat{f}}^{a} = C o n v (C o n c a t ({w_{p} \cdot f_{p}}_{p = 1}^{P}))

$\hat{\mathbf{f}}^a=Conv(Concat({\{\mathbf{w}_p\cdot\mathbf{f}_p}\}^P_{p=1}))$

3.3. Implementation Details

AFC中，GoogleNet用来提取全局特征，两个额外1X1卷积层用来part weight estimation与final feature fusion
AACN逐步训练过程：
- 独立训练PPA(part attention and pose estimation loss)与GCN(reid loss)
- 固定PPA与GCN，训练AFC中feature weighting与组合的参数
- 联合训练所有模块