MDNet, SiamFC, ADNet, CFNet, LSTM (RNN)... Have you mastered them all? This article summarizes the classic model necessary for target tracking (1)...

Follow and star

never get lost

Institute of Computer Vision

5247c1562664ca6317a753492c2e7e3e.gif

08cb219f439dc81623665ad320c288f3.gif

b54b19ef4f6a5ec5109fdb6d69be2529.png

Public Account IDComputerVisionGzq

Learning groupScan the QR code to get the joining method on the homepage

Computer Vision Research Institute column

Author: Edison_G

This article will be serialized in two phases, introducing a total of 10 classic models that have achieved SOTA on target tracking tasks.

  • Issue 1: MDNet, SiamFC, ADNet, CFNet, LSTM (RNN)

  • 第 2 期:SiamRPN、SiamMask、UpdateNet、SiamAttn、SiamGAT

You are reading issue 1 of these.

Transferred from "The Heart of the Machine"

0b8c9cb835404f0716901e81b40edadf.jpeg

0ee3add0a26c58dff4976e0600eb5b3b.jpeg

b36c8324e5d868b901abc106d5fd16f4.jpeg

242d0207509356b87756046983902ad4.jpeg

99028e24d909b71d0a0b0f08efd7dd66.jpeg

73994ec19b03e35111aab71f1a898a1c.jpeg

1dae7dd5bebc8cbb3c7114b98bd00395.jpeg

acf930ecff70447337f85c8e0e66952f.jpeg

4b95cb5d2f663c4630deec21561c8bb1.jpeg

b6130eceef6a63c0e7fb8b165e4d996c.jpeg

e4cfe3253b296f114d0811573239fda9.jpeg

5a41211d688049842a3bb4b7e06cc42d.pngFigure 3. The concept of visual tracking controlled by sequential actions proposed in this paper. The first column shows the initial position of the object, and the second and third columns show the iterative action flow for finding the object bounding box in each frame

The complete network architecture of ADNet is as follows:

a37caaa2c4ca944bb9c0c2ba35e9a2c6.pngFigure 4 Network structure. Dashed lines indicate state transitions. In this example, the "move right" action is selected to capture the target object. This action decision process is repeated until the final position of the object in each frame is determined

First analyze the reinforcement learning part.
(1) Status. State s_t is divided into two parts p_t and d_t. Among them, p_t represents the bbox (current picture information) being tracked, and d_t is a 11x10=110-dimensional vector, which stores the first 10 actions, of which 11 represents 11 different actions, expressed using one-hot encoding.
(2) Action. There are 11 actions in 3 categories. The first type is move, including up, down, left, and right, and fast up, down, left, and right; the second type is scale, including zooming in and out; the third type is stop, which terminates the operation.

5f9be9676a2118c58415f53023885e6d.png

(3) State transfer. Define a difference as follows:

6dd69e1bbbc3c3adb7e251a0e1ffbe37.png

For up, down, left, and right actions (and so on):

5099d6f30d14aff9f93cbd49642fc9bc.png

For quick up, down, left, and right actions (and so on):

14c6e3cf7648e2e6e34d9b86f5d3f71b.png

For scaling action:

fd155a171c3865f59aa8fdbf9ec29388.png

(4) Reward function. Assuming that the length of the action sequence is T, the reward is defined as follows:

4fa8354f308806bd6aaad2047f755f35.png

There are two trigger situations for the termination of an action: ①. The stop action is selected; ②. The action sequence generates fluctuations (eg: {left, right, left}).

Then analyze the training part.
(1) Training supervised learning part
This part trains {w1,w2,...,w7}, and the action label of the training part is obtained by the following method:

a8541802cb961689697cd702b6583165.png

The judgment of class label is as follows:

359b49cda0e9749afc96602198311a04.png

The loss function is as follows:

05b27c7dc41197e411208abff2a8e83b.png

(2) Training the reinforcement learning part. This part is maximizing using SGD:

20937b9bbfe9f77516f53210d0b6c57c.png

This framework can train ADNet even if the ground-truth{Gl} is partially given, that is, the semi-supervised setting shown in Fig. 5. Supervised learning frameworks cannot learn information from unlabeled frames, however, reinforcement learning can utilize unlabeled frames in a semi-supervised manner. To train ADNet in RL, the tracking score {z_t,l} should be determined, however, the tracking score in unlabeled sequences cannot be determined immediately. Instead, we assign tracking scores to the rewards obtained from tracking simulation results. In other work, if the result of tracking simulations in an unlabeled sequence is evaluated as successful on a labeled frame, then the tracking score for an unlabeled frame is given by z_t,l = +1. If unsuccessful, z_t,l is assigned -1, as shown in Figure 5.

1d2e3d2408a676ac90c6b541a8663cb9.png

Fig. 5 Illustration of the tracking simulation of the Walking2 sequence under semi-supervised conditions. The red and blue boxes represent the ground-truth and predicted target locations, respectively. In this example, only frames #160, #190 and #220 are commented. With consecutive actions, the agent gets +1 reward at frame #190 and -1 reward at frame #220. Therefore, the tracking score from frame #161 to #190 will be +1, and the tracking score between #191 and #220 will be -1
(3) Online adaptation. When updating online, only update {w1,w2,...,w7}. Each frame is fine-tuned using samples from previous frames with confidence scores greater than 0.5. If the current confidence score is less than -0.5, it means "lost tracking" and re-detection is required.

4、CFNet

The scene (domain) differences between different datasets are large (large domain differences) and the disparity distribution is unbalanced, which greatly limits the application of existing deep stereo matching models in real life. In order to improve the robustness of the stereo matching network, a cost volume network based on cascading and fusion is proposed in this paper: CFNet. Specifically, the author introduces the CF layer (Correlation Filter) based on the structure of SiamseFC. The network is trained end-to-end, which proves that the number of convolutional layers of the network can be used without reducing the accuracy. The overall structure of CFNet is shown in Figure 6:

12833e98087f09d6554dc05e32d76bc2.pngFigure 6 The overall structure of CFNet, the network consists of 3 parts: pyramid feature extraction network, fusion cost body and cascaded cost body

The CFNet network consists of three parts: pyramid feature extraction network—pyramid feature extraction; fusion cost body—fused cost volume; cascade cost body—cascade cost volume.

Pyramid feature extraction network. The network is an encoder-decoder structure with skip connections, consisting of 5 residual blocks, to extract multi-scale image features. Followed by an SPP (Spatial Pyramid Pooling, spatial pyramid pooling) module to better incorporate contextual information of multi-scale features. The SPP module is to perform pooling of different sizes on the features, and then perform information fusion.
Fusion costs. In this paper, it is proposed to fuse multiple low-resolution dense cost volumes (cost volumes less than 1/4 of the original input image resolution, 1/8, 1/16, 1/32 in the code) to reduce different data in the initial disparity estimation Domain shifts between sets. Many works have realized the importance of multi-scale cost volumes, but these works generally consider that low-resolution cost volumes have insufficient feature information to generate accurate disparity maps, so they are discarded. However, the paper believes that low-resolution cost volumes of different scales can be fused together to extract a global structured representation, and the initial disparity map generated by it is more accurate (robust). Specifically, low-resolution cost volumes are constructed on each scale, and feature concatenation and group-wise correlation are used to generate fusion cost volumes at the same time. The formula is as follows:

97959d9f6319030df3ee2fc94c5709bc.png

Next, the cost body is fused. As shown in Figure 7, first use 4 3D convolutional layers with skip connection for each cost volume (the first four blue blocks of each branch). Using 3D convolution with stride = 2 reduces the resolution of scale3 from 1/8 to 1/16. Then the downsampled scale3 and scale4 are concatenated and the feature channel is scaled by an additional 3D convolution. Continue to take similar operations to gradually downsample the cost volume of scale3 to 1/32 of the original input image resolution and fuse information with scale5, and finally use 3D transposed convolution to upsample the cost volume and upsample the cost volume Use feature information for refinement. The initial disparity map can be obtained by performing disparity regression (soft argmin operation) on the refined cost volume:

e21a8f2a91dc43efc425e9c0123c2603.png

cba47fa35b9fc2851f2c4cf90380c3f5.png

Fig. 7 Cost body fusion module structure. Three low-resolution cost volumes (i ∈ (3, 4, 5)) are fused to generate an initial difference map

Cascading costs. With the initial disparity, the next step is to build a high-resolution cost volume and refine the disparity map. The ideal disparity probability distribution should be unimodal, that is, the probability that the position belongs to a certain disparity is very high, and the probability of belonging to other disparities is very low. However, the actual probability distribution of disparity is mainly multi-modal, that is, the disparity of a certain position is not very certain, which often occurs in occluded and non-textured areas. It is proposed to define an uncertainty estimate to quantify the degree to which the disparity probability distribution tends to be multimodal:

eb95123b933c0c159a450d916a735884.png

In this paper, the disparity search range of the next stage is calculated according to the uncertainty of the current stage. The specific formula is as follows:

9919d34ab328ca4786de58346888d2bb.png

According to uniform sampling, the next-stage discrete plane hypothetical depth is obtained:

a379bbf022906f2b9ca33205136c3dbe.png


5、 LSTM(RNN)

This paper presents an online tracking method for encoding long-lived multi-cue dependencies. Among them, in order to solve the problem of not being able to distinguish and track objects that are occluded or surrounded by similar appearances, this paper proposes a method that uses the RNN architecture and combines multiple clues within a certain time window to track. Through this method, we can correct the error of data association and restore the original target observation from the occluded state. This paper demonstrates the robustness of data-driven tracking algorithms using three aspects of object appearance, motion, and interaction.


6d9b6314c50505f33b5bb77a120cbcdf.png

Figure 8 This paper proposes an approach based on the RNN structure (each RNN is described by a ladder), which can learn to encode long-term temporal dependencies of multiple cues (appearance, motion, and interaction). Our learned representation is used to compute the similarity score for the "pursuit by detection" algorithm

This paper introduces a new method to calculate the similarity. The three feature extraction modules input the extracted features into three RNNs: (A)(M)(I), and calculate the corresponding feature vectors (ϕA, ϕM, ϕI), and these vectors are then input into a RNN (O) to be combined The final feature vector ϕ(t,d) of the multi-information channel is obtained, and this vector will be used to calculate the similarity between target t and detection d.

First introduce the appearance model (A). The appearance model is mainly used to solve the re-identification problem, but also to be able to deal with occlusion and other visual problems. The appearance model is a RNN based on CNN and LSTM structure. Firstly, the trajectory target images of different frames are passed into CNN to obtain a 500-dimensional feature vector, and then all the feature vectors of the sequence are passed into LSTM to obtain an H-dimensional feature vector, and then the current The target detection is also passed to CNN to obtain the H-dimensional feature vector, and the two H-dimensional feature vectors are connected and passed to the FC layer to obtain the k-dimensional feature vector for discriminating appearance. The information contained in the final ϕA feature is: based on the long-term appearance features of target i and the appearance features of detection j, it is judged whether the two belong to the same target. The appearance model is shown in Figure 9:

c06ecda85af9f71b82dede6836e55d48.png

Figure 9 Appearance model. The inputs are the bounding box for target i from time 1 to t, and the detection j at time t+1 that we wish to compare. The output is a feature vector φA that encodes whether the bounding box at time t+1 corresponds to a specific object i at time 1, 2, .... Use a CNN as appearance feature extractor

Next, the motion model (M) is introduced. The motion model is mainly used to judge whether the target is occluded or has other conditions. Its main problem is that it will have bad results when it encounters disturbing target detection. Therefore, this article uses LSTM to deal with such problems. Except for CNN, the structure of the motion model and the appearance model are similar, only the input changes from the image to the motion vector, which mainly includes the rate changes in the two directions of x and y, and the dimensions of the other outputs and the pre-training operations remain the same. Change. Figure 10. The 2-dimensional velocity feature v_{i}^{t} extracted by the Motion feature extractor is calculated by the following formula:


5f9c05fb1308f18f2cf6c0973be030fb.png

0c575ee036eddb5f14f87afb4afc2e3b.png

Figure 10 Motion model, the input is the 2D velocity of the target (on the image plane). The output is a feature vector φM encoding whether the velocity v_{j}^{t+1} corresponds to the true trajectory v_{i}^{1}, v_{i}^{2}, .... , v_ {i}^{t}


Finally there is the interaction model (I). The interaction model is mainly used to deal with the force relationship between the target and its surroundings. Since the number of other targets near the target will change, in order to make the network model use the same input size, this paper models the surroundings of each target as a fixed "occupancy block". The structure is the same as that of the motion model, only the input becomes the "possession block graph", and the rest remain unchanged. Model the surrounding area of ​​each target as a fixed-size occupancy grid (occupancy grid map, 0/1). The interactive feature extractor generates a grid map with the target target as the grid center, and converts it into a vector expression. If the bbox center of a surrounding object falls on the grid (m, n), the grid (m, n) position is marked as 1, and the unoccupied grid position is 0. The network structure is shown in Figure 11. The mathematical formula is expressed as:


ac0009682bb9c5c13a33b1c8acc514d6.png

8637db72aaceb76c1c7239c63bee68fd.png

Figure 11. Interaction model, where the input is an occupancy map (on the image plane) across time. The output is a feature vector φI that encodes whether the occupancy map at time t + 1 corresponds to the true trajectory of the occupancy map at time 1, 2, ... t

After the k-dimensional feature vectors are extracted from the appearance model, motion model, and interaction model, these feature vectors are concatenated and used as the input of the target RNN (O). The whole training process can be divided into two steps:
First, independently pre-train the A/M/I three sub-module RNN models and the appearance feature feature extractor CNN. The appearance feature extractor CNN first uses the pre-trained weights of VGG-16, removes the last fully connected layer, and adds a 500-dimensional fully connected layer. Then use this structure to construct a Siamese network and train it on the re-identification data set. Finally, the trained CNN network is used for feature extraction alone, and 500-dimensional appearance features with strong discrimination can be obtained. The three sub-modules RNN all use the Softmax classifier for 0/1 classification pre-training, that is, add a Softmax layer to the k-dimensional feature output by the RNN, and output the probability of the positive class/negative class. Here we define that the positive class means that the input target i and detection j belong to the same object, and the negative class is vice versa.
Secondly, jointly train the target RNN(O) and the three sub-module RNNs mentioned above, that is, update their network parameters at the same time, but the CNN is no longer updated. This is an end-to-end training process. The target RNN is required to output the similarity between detection and target, and it is trained using Softmax classifier and cross-entropy loss.

© THE END 

For reprinting, please contact this official account for authorization

bb83fc3c16083331be2a1b44d4eaede6.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as face detection, face recognition, multi-target detection, target tracking, and image segmentation. The research institute will continue to share the latest new paper algorithm framework. The difference in our reform this time is that we need to focus on "research". Afterwards, we will share the practical process for the corresponding fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

VX:2311123606

1c72f8f17cf17636d3f1e411b13fbd9d.png

Guess you like

Origin blog.csdn.net/gzq0723/article/details/130939236