Read the paper notes (b) [IJCAI2016]: Video-Based Person Re-Identi fi cation by Simultaneously Learning Intra-Video and Inter-Video Distance Metrics

Summary

(1) Method:

And within the same face of a pedestrian among pedestrian video change video presented within the video and the distance between the video while learning (the SI 2 the DL).

 

(2) model:

The video (intra-vedio) distance matrix: more compact so that the same video;

Video Inter (intra-vedio) distance matrix: such that the two do not match the video matching video than smaller distance.

Video design triples (vedio triplet), to improve the learning matrix of discrimination.

 

(3) Data collection: 

iLIDS-VID and PRID 2011 image sequence datasets

 

Introduction

(1) Most of today's methods are mainly heavy pedestrian image recognition (image-based), divided into two categories based on: learning and distance learning features.

Learning Characteristics: extracting features from the pedestrian image, comprising: a significant feature (salience features), wherein middle (mid-level features), significant color characteristics (salient color features).

Distance Learning: distance learning and efficient matrix to maximize matching accuracy, including: LMNN (large margin nearest neighbor), KISSME (keep it simple and straightforward metric), RDC (relative distance comparison).

 

(2) two video Re-id method recently proposed:

Extracting temporal characteristics (spatial-temporal) to represent each video pedestrian detailed description:

First the video segmentation, generates several fragments (fragments / walking cycles), temporal features extracted from each of the fragment, and using the extracted features to represent the video.

So between video Re-id can also be seen as a set of questions to match (set to set matching) of.

 

(3) Difficulty:

Posture (POSE), angle of view (Viewpoint), affected by light (Illumination) occlusion (Occlusion), not only there is a change among a plurality of video pedestrians, there is the same change in different frames of a video pedestrian (frame).

No change in the above method (intra-video variations) between the video and the video change (inter-video variations) for simultaneous processing.

 

Method (4) reduce the variation between the set: Learning (set-based distance learning) based on the set of distance

已提出的方法有:MDA(manifold discriminant analysis)、SBDR(set-based discriminative ranking)、CDL(covariance discriminative learning)、SSDML(set-to-set distance metric learning)、LMKML(localized multi-kernel metric learning).

 

(5)Motivation:

① major existing Re-id algorithm is based on the picture;

② Based Re-id of the video can be seen as a collection of image processing, but the existing set of methods based on distance learning is not to solve the Re-id-based video and design.

 

(6)Contribution:

① proposed called the SI 2 Re-ID methods based on the video of the DL;

② design a new model based on a set of distance learning;

③ designed a new model of the relationship between video (video triplet);

④ The iLIDS-VID and PRID 2011 data set evaluated.

 

The SI 2 the DL

(1) Problem Definition:

① training set: X-= [X- . 1 , ..., X- I , ..., X- K ]

Each video pedestrian X- i is p * n- i -dimensional, i.e., the i-th video containing n- i samples (Sample), p is the dimension of each sample, the j-th sample of the i-th video defined as X ij of .

② can be intuitively understood that, within each video so that if more compact, about separability between video significantly. This leads to the video distance matrix (intra-video distance metric), and the distance between the video matrix (inter-video distance metric ).

③ defines J (V, W):

V: intra-video distance metrics, Specifications: * K the p- 1

W:inter-video distance metrics,规格:K1*K2

V i : i th column of matrix V, specification: p * 1

W i : W i th column of the matrix, size: p * 1

f (V, X): Internal Cohesion video (congregating term)

g (W, V, X): discrimination between the video item (discriminant term)

μ: weight balance factor

The SI 2 frame is DL: training V and W, to reduce the above-described two items:

④ calculate f (V, X):

Using each video sample to represent the average of all the video, i.e. the i-th video X- i mean is:

Cohesion calculated: N represents the number of the data set of all picture frames

 Understanding of the formula:

V T (X ij of -m I ) Specification matrix operation: (K . 1 * P) * (P *. 1) = K . 1 *. 1

V T produced here the role of the vector length changes, changes in the distance matrix, so that the video sample were close to about the center.

⑤ defined triplet (video triplet):

Parameters: Video X- I , X- J , X- K , corresponding to m I , m J , m K

Wherein X- J  is X- I  of correct matches, and X- K  is the X- I  mismatched,

Satisfy

He said X- I , X- J , X- is triplet, referred to as <i, j, k>.

⑥ calculates g (W, V, X): | D | represents the number of triples.

Calculation discrimination items:

Where ρ is the penalty term:

两个范式之间的差值可以理解为:正确匹配的距离和错误匹配的距离之差,期望的结果是正确匹配的距离更小,错误匹配的距离更大,也就是这个差值更小.

为什么要加这个惩罚项?个人的理解是:为了保证区分度项始终是正值.

简写 ρ = exp(- b 式/ a 式),ρ < 1. 若 a 式的值比 b 式小很多,那么 ρ 会很小,b 式会被削弱,(a式 - ρ*b式)结果为正;若 a 式的值比 b 式大很多,那么 ρ 会接近1,那么(a式 - ρ*b式)结果也为正.

⑦目标函数:

 

(2)SI2DL的优化:

 ① 由于上面的公式不是凸的,需要将问题进行转化:

其中 M1 和 M2 矩阵的元素分别为:,其中<i,j,k>属于D.

(为什么?可能是凸优化方面的问题,还没有去学习,对这个公式的转化也不理解)

【注】Frobenius范式的计算方式:

② 确定 V、W 来更新 A、B:

初始化 V:

 通过构建拉格朗日函数,并对其求导,得到结果:

 其中 

 个人推导过程【不一定准确】:

 

 问题转化为了特征分解问题,选取 K1 个特征向量作为 V 的初始化.

初始化W:

 

采用同样的方法,选取 K2 个特征向量作为 W 的初始化.

当 V 和 W 确定后,通过优化下面的公式来获得 A 和 B :

 

 ③ 确定 A、B、W 来更新 V:

当 A、B、W 确定后,优化问题转化为:

其中:

使用 ADMM算法 对上述的公式进一步转化:

 首先引入变量S:

  ADMM算法:

 (ADMM算法这步没有理解,待查阅资料)

④ 确定 A、B、V 来更新W:

当 A、B、W 确定后,优化问题转化为:

 

同样使用ADMM算法把问题进一步优化,求解出 W.

⑤ SI2DL 算法总结:

 

(3)使用 V,W 矩阵对结果进行预测:

视频库(gallery video):Y = [Y1, ..., Yi, ..., Yn]

第 i 个视频为 Yi,规格为:p * li,其中 li 为 Yi 中的样本数量.

待测视频 Zi 的规格为:p * ni,其中 ni 为 Zi 中的样本数量.

Yi / Zi 的第 j 个样本记为 yij / zij.

识别过程:

① 计算 Zi 和 Yi 的一阶表示:

② 计算两者间的距离:

③ 在视频库中挑选出距离最近的视频,作为 Zi 的匹配结果.

 

实验结果

(1)实验设置:

① 对比试验:

discriminative video fragments selection and ranking (DVR)

改进版:Salience+DVR 、 MSColour&LBP+DVR

spatial-temporal fisher vector representation (STFV3D)

改进版:STFV3D+KISSME

② 参数设置:

对于iLIDS-VID数据集(K1,K2) = (2200,80),μ = 0.00005、τ= 0.2、τ2 = 0.2;

对于PRID数据集(K1,K2) = (2500,100),μ = 0.00005、τ= 0.1、τ2 = 0.1;

③ 评估设置:

数据集50%用作训练集,50%用作测试集.

测试集中第1个相机的数据用作测试组,第2个相机的数据用作视频库.

使用CMC曲线评测,CMC曲线的介绍:【传送门

 

(2)在iLIDS-VID数据集上的评测结果:

该数据集含有300个行人的600个图像序列,每个行人都有来自两个相机拍摄的图像序列.

每个图像序列含有22-192帧,平均还有71帧.

 

(3)在PRID2011数据集上的测评结果:

Cam-A含有385个行人的图像序列,Cam-B含有749个行人的图像序列.

每个序列含有5-675帧,平均含有84帧.(低于20帧的需要被忽略)

Guess you like

Origin www.cnblogs.com/orangecyh/p/11896658.html