A review of person re-identification "Deep Learning for Person Re-identification: A Survey and Outlook"

Original link

Table of contents

1 Overview

1.1 Query and Gallery

​1.2 Difficulties 

1.3 Overall steps

 2 Closed-world Re-ID

2.1 Representation Learning

2.2 Metric Learning

2.3 Sorting Optimization

2.4 Datasets and Metrics

3 Open-world Re-ID

3.1 Complex Re-ID

3.2 End-to-end Re-ID

3.3 Semi-supervised and unsupervised Re-ID

3.4 Re-ID that is more robust to noise

4 Outlook


This article only briefly sorts out the various directions of pedestrian re-identification, and does not involve or only briefly summarize the algorithm.

1 Overview

        The problem of pedestrian re-identification (hereinafter referred to as reid) is to retrieve the target pedestrian under the camera images without overlapping scenes.

        The reid problems at this stage are mainly divided into two categories: closed-world and open-world. In human terms, closed-world focuses on research, mainly to retrieve target pedestrians from a large number of bounding box pictures of pedestrians, while open-world focuses on "landing", mainly to directly retrieve target pedestrians from videos, Or it is biased toward unsupervised and weakly supervised learning. The following are the specific differences between the two worlds.

1.1 Query and Gallery

        Before understanding reid, we need to know what Query and Gallery are. Query is actually the target pedestrian (person of interest), and Gallery is the retrieval library, that is, a large number of photos or videos of pedestrians.

        Broadly speaking, there are many forms of Query and Gallery. Query can be a pedestrian’s bounding box (photo) (one or more), or a video, but whether it is a picture or a video, there must be only There is a person for example: (picture taken from Market-1501)

Gallery can be the bounding box of each pedestrian captured from a whole picture, or a video, for example: (picture taken from Market-1501)

1.2 Difficulties 

The main difficulty of reid lies in:

  • The perspective of the same pedestrian photo in the Gallery is different
  • different lighting conditions
  • The size of the pedestrian in the photo is very small, which leads to the low pixel of the pedestrian's bounding box
  • Pedestrians pose differently
  • There may be occlusion
  • ...

As for the "landing" of reality, there are more difficulties:

  • The number of cameras may continue to increase, and the scenes captured will become more complex
  • Gallery is huge
  • Labeled information may not exist during training (that is, unsupervised or weakly supervised learning is required)
  • The generalization ability of the network is very high (cross-domain)
  • The testing session is unknown
  • Pedestrians may have changed clothes
  • ...

1.3 Overall steps

Look at the picture, too lazy to talk nonsense.

 2 Closed-world Re-ID

2.1 Representation Learning

        Representation learning is mainly to study how to extract the features of a pedestrian. There are mainly the following methods:

  • global representation learning
  • local representation learning
  • Assisted Representation Learning
  • Video-Based Representation Learning
  • ...

        An intuitive comparison can be seen in the following picture:

 Global representation learning : Directly send pedestrian pictures to the convolutional neural network to extract features, which requires high accuracy of the backbone network. In addition, the article also focuses on the role of the attention mechanism here.

Local representation learning: divide the pictures of pedestrians into blocks, use the network to extract features for each block, and finally combine all the local features.

Auxiliary representation learning: Add some auxiliary elements to the network, for example, you can add some text describing the appearance of pedestrians, or add some Domain descriptions, or add a picture generated using a GAN network. Doing so strengthens the accuracy of the network.

Video-based representation learning: Input a series of pictures to the network, extract features from each picture, and finally synthesize a total feature.

In addition, the article also focuses on the importance of network structure design.

2.2 Metric Learning

        The metric learning at this stage is mainly to design different loss functions and how to design the strategy of training the network.

        The loss functions mainly include: identity loss, Verification loss, triplet loss, and OIM loss. The schematic diagrams of the first three loss functions are as follows:

        In terms of training strategies, we focus on solving the following problems:

  • The number of pedestrians (IDs) is too large, and it is necessary to select as many IDs as possible for training in each batch of training.
  • For each ID, the number of positive samples is much less than the number of negative samples.

2.3 Sorting Optimization

        First, let’s talk about what rank is. In the prediction phase of the network, the pictures in the Gallery need to be sorted. The higher the rank is, the more similar it is to the query. Ranking optimization, as the name suggests, is to optimize the sorting stage.

        The main methods of optimization are: re-ranking, rank-fusion...

2.4 Datasets and Metrics

  • The data set is shown in the figure

  •  The measurement indicators mainly include CMC, mAP, etc. This paper proposes another indicator mINP

3 Open-world Re-ID

3.1 Complex Re-ID

        This part mainly talks about reid in some complex situations, mainly including:

  • Reid under depth map and normal RGB map
  • Perform reid based on text information, such as giving some descriptive text of pedestrians, and then perform reid
  • Infrared based reid
  • reid across resolutions

3.2 End-to-end Re-ID

        End-to-end means that reid is performed based on the original video information, and the position of the target ID in the video is returned directly, which is also closer to the real application of reid.

3.3 Semi-supervised and unsupervised Re-ID

        Mainly how to do clustering.

3.4 Re-ID that is more robust to noise

        Noise mainly comes from the following aspects:

  • physical occlusion
  • The noise of the data set sampling, for example: no pedestrians are framed, only part of the pedestrians is framed, etc.
  • The noise of the data set labeling, for example, this person was originally A, but it was marked as B

Since there are more and more reid data sets and they are getting bigger and bigger, it is impossible to manually label many data sets, so the above problems are easy to occur. The article here parameterizes how to solve these problems.

4 Outlook

        The author of this part mainly proposes a new standard mINP to measure the quality of the model, and proposes a new baseline, which can be used for reid under single-modality and cross-modality.

        In addition, this part also discusses some hot issues of current research, such as domain adaptation, deployment, etc.

Guess you like

Origin blog.csdn.net/fuss1207/article/details/123500362