Pedestrian re-identification (REID) - principle method

Pedestrian re-identification: short-term

  • Intra-class differences increase and inter-class differences decrease

Application - Pedestrian Tracking

  1. Single camera single target
  2. Single camera with multiple targets
  3. Multiple cameras and multiple targets

Pedestrian re-identification system

Pedestrian re-identification system

  1. feature extraction

    Learning features that can cope with changes in pedestrians under different cameras

  2. Metric Learning

    Mapping the learned features to a new space makes the same people closer and different people farther away

  3. search image

    Sort according to the distance between image features and return the search results

evaluation model

  • single query vs multi query

    Single query means that each person in the probe has one image (N=1), while multi query means that each person in the probe has N>1 images, and then fuses the features of N images (maximum pooling or average Pooling) as the final feature. Under the same Rank-k, generally the larger N is, the higher the recognition rate is.

feature

global features

The global information of each picture is subjected to a feature extraction, and the global features do not have any spatial information.

  • Noisy regions can greatly interfere with global features
  • Misalignment of poses also makes global features unmatched

local features

Feature extraction is performed on a certain area of ​​the image, and finally multiple local features are fused as the final feature

Horizontal slice ★★★

  • The image is equally divided horizontally, and each horizontal slice extracts a feature through horizontal pooling

  • Gate Siamese and AlignedReID fuse all local features to calculate distance through design rules

  • PCB, ICNN, SCPNet calculate a ReID loss for each local feature, and directly stitch the local features together

  • Joint local features and global features can often get better results

Gate Siamese

  • Each block gets features through the CNN network, and the local features are input to the LSTM network in order, and are automatically expressed as the final features of the image
  • Train the network with a contrastive loss

AlignedReID

  • Mainly solve the problem of posture misalignment

  • The skeleton network is ResNet50

    Dynamic Alignment (DMLI)
    • If the input image is 256×128, the output feature map size is 8×4×2048
    • Use horizontal pooling to get 8 local features and calculate an 8×8 distance matrix
    • Alignment local information cannot have skip connections (from top to bottom)
    • Use shortest path to find the optimal dynamic link

PCB

  • Input image 384×128, divided into 6 blocks
  • Use ResNet50 to extract features, the last 24×8 feature map
  • Each line extracts a local feature and connects a ReID loss
  • When using it, concatenate the 6 local features

ICNN

ICNN≈PCB + global branch with triplet loss

SCPNet

Use the spatial part feature to supervise the channel group feature, and pass the local feature to the global feature

Posture Information★★★

  • Using a pose estimation model to get (14) key pose points of pedestrians
  • Obtain the part area with semantic information according to the attitude point
  • Extract local features for each part region
  • Joint local features and global features can often get better results
  • Attitude point estimation model: Hourglass, OpenPose, CPM, AlphaPose
  • Part: Manually set some rectangular frame areas through certain rules
  • Attention: The more important arbitrary shape area automatically learned by the network

PIE

  • CRM extracts attitude points
  • Divide into several parts and perform affine transformation alignment
  • Fusing the features of the original image and the affine image
  • Train the network with ID loss

Spindle Net

  • The FFN network extracts features, and the FFN network fuses features hierarchically

PDC

  • Use attitude point information to divide into six parts
  • Improve STN network to PTN network, learn affine transformation parameters to get modified part image
  • Fusion of global and local features
  • Calculate the three ReID losses
  • Shallow network sharing, high-level network independence

GLAD

  • Divided into three parts: head, upper body and lower body
  • Fusion of global features and features of three parts

PABP

  • Extract feature map A using ReID network
  • Use openpose to extract feature map P
  • The vectors of A and R each corresponding to the pixel position are outer producted and vectorized
  • Appearance features at corresponding locations will be activated

Split information★★

  • Image semantic segmentation is an extremely fine pixel-level part information
  • Image segmentation is divided into coarse-grained pedestrian foreground segmentation and fine-grained body semantic segmentation
  • The segmentation result is usually multiplied as the image preprocessing Mask or the attention in the feature map
  • Current segmentation-based methods have not achieved particularly wide application

Foreground and background extraction

SPREAID

Grid Features★

  • Grid features are relatively fine-grained physical area features
  • Early work extended grid features to part features to calculate the difference between feature maps of two images
  • Recently, grid features are used to solve partial ReID work
  • Mesh features are not commonly used in general

EAT

  • The backbone network is Siamese network computing
  • The difference between the 5x5 grid features of the two images
  • Exchange "subject and object" to calculate K and K' respectively
  • Computing the Binary Classification Validation Loss

PersonNet

DSR

  • Take all the grid features of an image as a feature set
  • Sparse reconstruction of two feature sets to obtain the set distance

sequence re-identification

paper

  • Rich posture changes
  • Occlusion is common
  • There are always a few frames of good quality and a few frames of poor quality
  • Need to consider how to fuse the information of each frame

single frame → sequence

  • Extract a ReID feature for each frame of image
  • Get the final ReID features directly through average pooling or maximum pooling
  • Relatively simple, performance depends on the performance of single-frame ReID

CNN+LSTM

  • Similar to action recognition, use CNN to extract features, and then use LSTM to extract temporal features

difficulty

  • How to perform feature fusion on multi-frame features?
  • How to judge the quality of each frame image?
  • How to extract motion features of sequence images?
  • How to solve the problem of inconsistent sequence frame number?
  • How to improve the computational efficiency of sequence ReID?

academic attempt

AMOC

  • There are motion (gait) features between frames, which is also beneficial for ReID tasks
  • Contains spatial subnetwork and motion subnetwork
    • Spatial Subnetwork Extracts Content Features of Single Frame Image
    • The motion sub-network extracts the motion features of two adjacent frames
    • Fusion of content features and motion features as the final feature of the frame
  • Use the RNN network to fuse the feature information of all frames
  • Using Contrastive Loss to Judge Whether Two Sequences Belong to the Same Pedestrian ID

DFGP

  • Pedestrian features of each frame of image sequence are extracted using traditional LOMO feature
  1. Use the PCN network to extract the features of each frame, then average pooling to obtain the sequence features, and find the most stable frame MSVP
  2. Extract the LOMO feature for MSVP, calculate the feature distance with the sequence q, and perform softmin normalization according to the distance to obtain the weight of each frame
  3. Feature × weight followed by max pooling
  4. The sequence features after pooling and the features of the most stable frame are fused as the final features

RQEN

  • Occlusion is a very common problem in sequence re-identification, which can cause uneven distribution of features
  1. Extract 14 key pose points for each frame of pedestrians and divide them into 3 semantic parts

  2. When a pose point is occluded, the response value of the pose map will be very low

  • The global branch extracts global features
  • Local branch extracts local features
  • The attitude branch judges the quality (occlusion) of the image

GAN-based approach

Pain points

  • Insufficient data → generate image
    • Government limits collection of surveillance data
    • Manually labeling and collecting data is expensive
    • Lack of some extremely difficult extreme samples
  • The data is biased → reduce the bias
    • There is a deviation between posture and posture
    • Camera-to-camera bias
    • There are discrepancies from region to region

composition

  • Generator: Random Number → Generate Sample
  • Discriminator: judge whether the generated sample is real or not

representative method

GAN+LSRO

Use GAN network to randomly generate pedestrian pictures, use LSRO technology to smooth ID labels, and train cross-entropy loss

  • Photos are randomly generated and ID information is unreliable

CamStyle

Using CycleGAN to achieve style transfer between any two cameras

  • The original sample calculates the ID loss, and the generated sample uses the smooth label to calculate the cross-entropy loss

PTGAN

There are obvious deviations in the data collected in different scenarios

  • Using PSPNet to segment pedestrian foreground mask
  • Image style conversion using the idea of ​​​​CycleGAN
  • Calculate the mask area generation loss, keeping the pedestrian foreground as unchanged as possible
  • Joint Style Loss and Generation Loss

SPGAN

Similar to PTGAN, the source domain data is used to generate the target domain to solve the obvious deviation between the collected data in different scenarios

PNGAN

Using GAN to generate fixed pose samples

  • Generating samples of target poses using GAN
  • The original image and the generated image enter two ReID networks respectively
  • The features of the original image and the generated image are fused as the final feature, and the fusion method uses max pooling

Compared

algorithm HOWEVER CycleGAN PTGAN SPGAN PNGAN
Base HOWEVER CycleGAN CycleGAN CycleGAN InfoGAN
additional label smoothing label smoothing foreground segmentation twin network pose estimation
Target data augmentation camera bias data domain bias data domain bias attitude deviation

Guess you like

Origin blog.csdn.net/qq_36949278/article/details/130479076