Pedestrian re-identification-representation learning

Pedestrian re-identification (ReID)-representation learning

Preface

Before recording today’s content, I would like to post a review of Luo Hao’s paper on pedestrian re-recognition. This is very helpful for understanding the field of pedestrian re-recognition. The next few blogs will also focus on this review. The content is introduced.
Pedestrian re-identification-Luo Hao knows almost the column.
By the way, a few open source codes for pedestrian re-identification are attached:
https://github.com/zhunzhong07/IDE-baseline-Market-1501
https://github.com/KaiyangZhou/deep-person- reid
https://github.com/huanghoujing/person-reid-triplet-loss-baseline

Representation of learning concepts

In the previous blog, we mentioned that training loss can be divided into representation-based learning and metric learning . This blog mainly focuses on representation learning .
The method based on representation learning is a very commonly used method of pedestrian re-identification. Its characteristic is that although the ultimate goal of pedestrian re-identification is to learn the similarity between two pictures, the method of representation learning does not directly consider the similarity between the pictures when training the network, but re- identifies pedestrians. The recognition task is treated as a classification problem or a verification problem .
Specifically, the classification problem is to use the ID or attribute of the pedestrian as the training label to train the model. It is enough to input one picture at a time; for the verification problem, you need to input a pair (two) of pictures of the pedestrian , Let the network learn whether these two pictures belong to the same pedestrian. Insert picture description hereThe classification network corresponds to the classification loss. The blue line indicates that the two images belong to the same ID. When training the network, the corresponding activation is the same neuron, which also indicates that they have similar characteristics.
The verification network is for a pair of pictures each time, red means irrelevant, and blue means the same ID.
The characteristic of this type of method is that the final fully connected layer (FC) of the network does not output the image feature vector to be used last, but has to go through a Softmax activation function to calculate the characterization learning loss, and the corresponding FC at this time The layer serves as the feature vector layer.

Classification loss

When we take each pedestrian as a category of the classification problem and use the ID of the pedestrian as the label of the training data to train the CNN network, this network loss is called ID loss, and the network with only ID loss is called ID Embedding network (IDE The internet). Insert picture description hereAs shown in the figure, the number of IDs of pedestrians in the training set is the number of categories of the network, and the feature layer is followed by a classification FC, and the cross-entropy loss is calculated through the Softmax activation function. But in the test phase, we use the feature vector of the penultimate layer for retrieval, and discard the classification FC layer, because the training set and the test set correspond to two completely independent pedestrians, which also means that they correspond to these different features, so The FC layer can no longer be reused.
Later, some researchers found that pedestrian ID information alone was not enough to learn a model with sufficient generalization ability (the model was overfitted), so additional attribute information was added, such as hair color, gender, and clothing. This requires the trained network not only to predict the pedestrian ID, but also to predict the corresponding attributes, which brings about attribute loss. The network structure is as follows:
Insert picture description here

The total loss of the network can be composed of ID loss and M attribute loss:
Insert picture description here

Verification loss

Different from the classification network, the verification network inputs a pair of (two) pictures each time, enters the same Siamese (twin network: solve the one-shot problem, can output the similarity of two given pictures) to extract features, and then output the network The two feature vectors of are fused, input to an FC layer with only two neurons, and a binary classification loss (validation loss) is calculated. In this way, you can directly input two pictures in the test phase to calculate their similarity.
However, the verification loss alone is not effective, so it is often combined with the ID loss mentioned above for training.
Insert picture description hereThe pedestrian re-identification network shown in the figure above:

  • Input as several pairs of pedestrian pictures
  • The internet
    • Classification Subnet
    • Verification Subnet
  • loss
    • The total loss is L = Lid + Lv (the specific forms of each cross-entropy loss are mentioned in the review, so I won’t repeat them)

After enough data training, you only need to input a test picture again during the test, and the network will automatically extract features for pedestrian re-recognition tasks.

to sum up

The robust ReID feature is directly obtained by constructing the network, and the similarity between pictures is not directly learned

  • Usually an additional FC layer is needed to tutor feature learning, and the FC layer will be discarded during the test phase
  • The dimensionality of the FC layer of ID loss is the same as the number of IDs. When the training set is too large, the network is huge and it is difficult to train to converge.
  • It is necessary to input a pair of pictures when verifying the loss test, the recognition efficiency is very low
  • Representation learning is generally more stable in training, and the results are easy to reproduce
  • Distributed training of representation learning is usually more mature

Guess you like

Origin blog.csdn.net/qq_37747189/article/details/109551551