Pedestrian re-identification: short-term
- Intra-class differences increase and inter-class differences decrease
Application - Pedestrian Tracking
- Single camera single target
- Single camera with multiple targets
- Multiple cameras and multiple targets
Pedestrian re-identification system
-
feature extraction
Learning features that can cope with changes in pedestrians under different cameras
-
Metric Learning
Mapping the learned features to a new space makes the same people closer and different people farther away
-
search image
Sort according to the distance between image features and return the search results
evaluation model
-
single query vs multi query
Single query means that each person in the probe has one image (N=1), while multi query means that each person in the probe has N>1 images, and then fuses the features of N images (maximum pooling or average Pooling) as the final feature. Under the same Rank-k, generally the larger N is, the higher the recognition rate is.
feature
global features
The global information of each picture is subjected to a feature extraction, and the global features do not have any spatial information.
- Noisy regions can greatly interfere with global features
- Misalignment of poses also makes global features unmatched
local features
Feature extraction is performed on a certain area of the image, and finally multiple local features are fused as the final feature
Horizontal slice ★★★
-
The image is equally divided horizontally, and each horizontal slice extracts a feature through horizontal pooling
-
Gate Siamese and AlignedReID fuse all local features to calculate distance through design rules
-
PCB, ICNN, SCPNet calculate a ReID loss for each local feature, and directly stitch the local features together
-
Joint local features and global features can often get better results
Gate Siamese
- Each block gets features through the CNN network, and the local features are input to the LSTM network in order, and are automatically expressed as the final features of the image
- Train the network with a contrastive loss
AlignedReID
-
Mainly solve the problem of posture misalignment
-
The skeleton network is ResNet50
Dynamic Alignment (DMLI)
- If the input image is 256×128, the output feature map size is 8×4×2048
- Use horizontal pooling to get 8 local features and calculate an 8×8 distance matrix
- Alignment local information cannot have skip connections (from top to bottom)
- Use shortest path to find the optimal dynamic link
PCB
- Input image 384×128, divided into 6 blocks
- Use ResNet50 to extract features, the last 24×8 feature map
- Each line extracts a local feature and connects a ReID loss
- When using it, concatenate the 6 local features
ICNN
ICNN≈PCB + global branch with triplet loss
SCPNet
Use the spatial part feature to supervise the channel group feature, and pass the local feature to the global feature
Posture Information★★★
- Using a pose estimation model to get (14) key pose points of pedestrians
- Obtain the part area with semantic information according to the attitude point
- Extract local features for each part region
- Joint local features and global features can often get better results
- Attitude point estimation model: Hourglass, OpenPose, CPM, AlphaPose
- Part: Manually set some rectangular frame areas through certain rules
- Attention: The more important arbitrary shape area automatically learned by the network
PIE
- CRM extracts attitude points
- Divide into several parts and perform affine transformation alignment
- Fusing the features of the original image and the affine image
- Train the network with ID loss
Spindle Net
- The FFN network extracts features, and the FFN network fuses features hierarchically
PDC
- Use attitude point information to divide into six parts
- Improve STN network to PTN network, learn affine transformation parameters to get modified part image
- Fusion of global and local features
- Calculate the three ReID losses
- Shallow network sharing, high-level network independence
GLAD
- Divided into three parts: head, upper body and lower body
- Fusion of global features and features of three parts
PABP
- Extract feature map A using ReID network
- Use openpose to extract feature map P
- The vectors of A and R each corresponding to the pixel position are outer producted and vectorized
- Appearance features at corresponding locations will be activated
Split information★★
- Image semantic segmentation is an extremely fine pixel-level part information
- Image segmentation is divided into coarse-grained pedestrian foreground segmentation and fine-grained body semantic segmentation
- The segmentation result is usually multiplied as the image preprocessing Mask or the attention in the feature map
- Current segmentation-based methods have not achieved particularly wide application
Foreground and background extraction
SPREAID
Grid Features★
- Grid features are relatively fine-grained physical area features
- Early work extended grid features to part features to calculate the difference between feature maps of two images
- Recently, grid features are used to solve partial ReID work
- Mesh features are not commonly used in general
EAT
- The backbone network is Siamese network computing
- The difference between the 5x5 grid features of the two images
- Exchange "subject and object" to calculate K and K' respectively
- Computing the Binary Classification Validation Loss
PersonNet
DSR
- Take all the grid features of an image as a feature set
- Sparse reconstruction of two feature sets to obtain the set distance
sequence re-identification
- Rich posture changes
- Occlusion is common
- There are always a few frames of good quality and a few frames of poor quality
- Need to consider how to fuse the information of each frame
single frame → sequence
- Extract a ReID feature for each frame of image
- Get the final ReID features directly through average pooling or maximum pooling
- Relatively simple, performance depends on the performance of single-frame ReID
CNN+LSTM
- Similar to action recognition, use CNN to extract features, and then use LSTM to extract temporal features
difficulty
- How to perform feature fusion on multi-frame features?
- How to judge the quality of each frame image?
- How to extract motion features of sequence images?
- How to solve the problem of inconsistent sequence frame number?
- How to improve the computational efficiency of sequence ReID?
academic attempt
AMOC
- There are motion (gait) features between frames, which is also beneficial for ReID tasks
- Contains spatial subnetwork and motion subnetwork
- Spatial Subnetwork Extracts Content Features of Single Frame Image
- The motion sub-network extracts the motion features of two adjacent frames
- Fusion of content features and motion features as the final feature of the frame
- Use the RNN network to fuse the feature information of all frames
- Using Contrastive Loss to Judge Whether Two Sequences Belong to the Same Pedestrian ID
DFGP
- Pedestrian features of each frame of image sequence are extracted using traditional LOMO feature
- Use the PCN network to extract the features of each frame, then average pooling to obtain the sequence features, and find the most stable frame MSVP
- Extract the LOMO feature for MSVP, calculate the feature distance with the sequence q, and perform softmin normalization according to the distance to obtain the weight of each frame
- Feature × weight followed by max pooling
- The sequence features after pooling and the features of the most stable frame are fused as the final features
RQEN
- Occlusion is a very common problem in sequence re-identification, which can cause uneven distribution of features
-
Extract 14 key pose points for each frame of pedestrians and divide them into 3 semantic parts
-
When a pose point is occluded, the response value of the pose map will be very low
- The global branch extracts global features
- Local branch extracts local features
- The attitude branch judges the quality (occlusion) of the image
GAN-based approach
Pain points
- Insufficient data → generate image
- Government limits collection of surveillance data
- Manually labeling and collecting data is expensive
- Lack of some extremely difficult extreme samples
- The data is biased → reduce the bias
- There is a deviation between posture and posture
- Camera-to-camera bias
- There are discrepancies from region to region
composition
- Generator: Random Number → Generate Sample
- Discriminator: judge whether the generated sample is real or not
representative method
GAN+LSRO
Use GAN network to randomly generate pedestrian pictures, use LSRO technology to smooth ID labels, and train cross-entropy loss
- Photos are randomly generated and ID information is unreliable
CamStyle
Using CycleGAN to achieve style transfer between any two cameras
- The original sample calculates the ID loss, and the generated sample uses the smooth label to calculate the cross-entropy loss
PTGAN
There are obvious deviations in the data collected in different scenarios
- Using PSPNet to segment pedestrian foreground mask
- Image style conversion using the idea of CycleGAN
- Calculate the mask area generation loss, keeping the pedestrian foreground as unchanged as possible
- Joint Style Loss and Generation Loss
SPGAN
Similar to PTGAN, the source domain data is used to generate the target domain to solve the obvious deviation between the collected data in different scenarios
PNGAN
Using GAN to generate fixed pose samples
- Generating samples of target poses using GAN
- The original image and the generated image enter two ReID networks respectively
- The features of the original image and the generated image are fused as the final feature, and the fusion method uses max pooling
Compared
algorithm | HOWEVER | CycleGAN | PTGAN | SPGAN | PNGAN |
---|---|---|---|---|---|
Base | HOWEVER | CycleGAN | CycleGAN | CycleGAN | InfoGAN |
additional | label smoothing | label smoothing | foreground segmentation | twin network | pose estimation |
Target | data augmentation | camera bias | data domain bias | data domain bias | attitude deviation |