[Paper Reading Notes 16] BoT-SORT Paper Formula Derivation and Code Interpretation


Paper address: BoT-SORT: Robust Associations Multi-Pedestrian Tracking , like OCSORT (see OCSORT
for interpretation ), is an improvement on the Kalman filter. OCSORT is aimed at the problem that the Kalman prediction variance becomes larger when the observation (detector) is unreliable. The trajectory is smoothed. BoT-SORT is aimed at the problem of camera motion, and camera motion compensation is added. That is, in addition to using Kalman to predict the new position of the target, it also uses sparse optical flow (that is, to extract the picture outside the target The key points in to obtain the movement of the picture between two frames, and then compensate the result obtained by Kalman )


1. Kalman filtering

Kalman filtering follows the update step and prediction step as follows:

预测步:
x ^ t ∣ t − 1 = F t x ^ t − 1 ∣ t − 1 + B t u t P t ∣ t − 1 = F t P t − 1 ∣ t − 1 F t + Q \hat{x}_{t|t-1}=F_t\hat{x}_{t-1|t-1}+B_tu_t \\ P_{t|t-1}=F_tP_{t-1|t-1}F_t+Q x^tt1=Ftx^t1∣t1+BtutPtt1=FtPt1∣t1Ft+Q

更新步:
x ^ t ∣ t = x ^ t ∣ t − 1 + K t ( z t − H t x ^ t ∣ t − 1 ) K t = P t ∣ t − 1 H t T H t T P t ∣ t − 1 H t + R t P t ∣ t = P t ∣ t − 1 − K t H t P t ∣ t − 1 \hat{x}_{t|t}=\hat{x}_{t|t-1}+K_t(z_t-H_t\hat{x}_{t|t-1})\\ K_t=\frac{P_{t|t-1}H_t^T}{H_t^TP_{t|t-1}H_t+R_t}\\ P_{t|t}=P_{t|t-1}-K_tH_tP_{t|t-1} x^tt=x^tt1+Kt(ztHtx^tt1)Kt=HtTPtt1Ht+RtPtt1HtTPtt=Ptt1KtHtPtt1

where QQQ, R R R is the covariance matrix of input (observation) and state noise, respectively.

2. Selection of state variables

In SORT, the state variable is [center point, area, aspect ratio, speed, area change rate], that is, x = [ xc , yc , a , s , xc ˙ , yc ˙ , s ˙ ] x=[x_c, y_c, a ,s, \dot{x_c}, \dot{y_c}, \dot{s}]x=[xc,yc,a,s,xc˙,yc˙,s˙ ], since DeepSORT, the state variables basically usex = [ xc , yc , a , h , xc ˙ , yc ˙ , a ˙ , h ˙ ] x=[x_c, y_c, a, h, \dot{ x_c}, \dot{y_c}, \dot{a}, \dot{h}]x=[xc,yc,a,h,xc˙,yc˙,a˙,h˙].

In BoT-SORT, the author simply changed the state variable to xywh and its derivatives, namely: x = [ xc , yc , w , h , xc ˙ , yc ˙ , w ˙ , h ˙ ] x=[x_c, y_c, w, h, \dot{x_c}, \dot{y_c}, \dot{w}, \dot{h}]x=[xc,yc,w,h,xc˙,yc˙,w˙,h˙ ], similarly, the input variable of the observer isz = [ zxc , zyc , zw , zh ] z=[z_{xc}, z_{yc}, z_w, z_h]z=[zxc,zyc,zw,zh].

Paper (1) (2) formula

Paper (3) (4) formula, the definition of the noise covariance matrix is ​​not understood for the time being, and it will be supplemented after understanding

From the results of the ablation experiment, if the state variable MOTA is changed in this way, the IDF1 and HOTA indicators will be improved by about 0.1 to 0.2

3. Camera Motion Compensation

This is the core idea proposed by BoT-SORT.

Due to camera motion, the matching may be inaccurate. However, we lack prior information of camera motion (such as navigation, etc.), therefore, the image registration between two adjacent frames can approximate the camera motion in image two projection on the dimensional plane.

It can be understood here that the camera movement is equivalent to the transformation of the coordinate system, so we need to re-project the position of the target, the direction of movement, etc. into the new coordinate system, so we need to solve the transformation matrix of the coordinate system. As shown in the following figure :

insert image description here
Therefore, use the RANSAC algorithm to obtain the affine matrix of the plane coordinate change, and describe the 2-dimensional plane rotation change matrix as M ∈ R 2 × 2 M\in\mathbb{R}^{2\times 2}MR2 × 2 , the translation change vector isT ∈ R 2 T\in\mathbb{R}^{2}TR2 , define the affine matrix
A = [ MT ] ∈ R 2 × 3 A = [M \quad T]\in\mathbb{R}^{2\times 3}A=[MT]R2×3

Essay (5)

Suppose the result after Kalman prediction is x ^ t ∣ t − 1 = [ xc , yc , w , h , xc ˙ , yc ˙ , w ˙ , h ˙ ] T \hat{x}_{t|t-1} =[x_c, y_c, w, h, \dot{x_c}, \dot{y_c}, \dot{w}, \dot{h}]^Tx^tt1=[xc,yc,w,h,xc˙,yc˙,w˙,h˙]T , we need to apply affine transformation to both the center point and the height and width. In order to realize it with a matrix multiplication, construct:

M k ∣ k − 1 ′ = d i a g { M , M , M , M } ∈ R 8 × 8 T k ∣ k − 1 ′ = [ T , 0 , 0 , 0 , 0 , 0 , 0 ] ∈ R 8 M'_{k|k-1}=diag\{M, M, M, M\}\in\mathbb{R}^{8\times 8}\\ T'_{k|k-1}=[T ,0, 0, 0, 0, 0, 0]\in\mathbb{R}^{8} Mkk1=diag{ M,M,M,M}R8×8Tkk1=[T,0,0,0,0,0,0]R8

therefore:

x ^ t ∣ t − 1 ′ = M k ∣ k − 1 ′ x ^ t ∣ t − 1 + T k ∣ k − 1 ′ \hat{x}_{t|t-1}'=M'_{k|k-1}\hat{x}_{t|t-1}+T'_{k|k-1} x^tt1=Mkk1x^tt1+Tkk1

x ^ t ∣ t − 1 \hat{x}_{t|t-1} x^tt1The covariance matrix of becomes (the random variable is linearly transformed, and the variance is the coefficient square relationship):
P t ∣ t − 1 ′ = M k ∣ k − 1 ′ P t ∣ t − 1 M k ∣ k − 1 ′ T P_ {t|t-1}'=M'_{k|k-1}P_{t|t-1}M^{'T}_{k|k-1}Ptt1=Mkk1Ptt1Mkk1T

Thesis (5)~(8)

Then use the corrected x ^ t ∣ t − 1 ′ \hat{x}_{t|t-1}'x^tt1and P t ∣ t − 1 ′ P_{t|t-1}'Ptt1Then perform the update step.

Essay (9)

In the implementation, we deduct the target in the picture, and detect the key points in the rest to calculate the affine matrix: (tracker\gmc.py line120)

		# find the keypoints
        mask = np.zeros_like(frame)
        # mask[int(0.05 * height): int(0.95 * height), int(0.05 * width): int(0.95 * width)] = 255
        mask[int(0.02 * height): int(0.98 * height), int(0.02 * width): int(0.98 * width)] = 255
        if detections is not None:
            for det in detections:
                tlbr = (det[:4] / self.downscale).astype(np.int_)
                mask[tlbr[1]:tlbr[3], tlbr[0]:tlbr[2]] = 0

        keypoints = self.detector.detect(frame, mask)

        # compute the descriptors
        keypoints, descriptors = self.extractor.compute(frame, keypoints)

Then match the key points of two adjacent frames

		# Match descriptors.
        knnMatches = self.matcher.knnMatch(self.prevDescriptors, descriptors, 2)

If the number of key points is sufficient, calculate the affine matrix

		 # Find rigid matrix
        if (np.size(prevPoints, 0) > 4) and (np.size(prevPoints, 0) == np.size(prevPoints, 0)):
            H, inliesrs = cv2.estimateAffinePartial2D(prevPoints, currPoints, cv2.RANSAC)

The above part depends on the cv2 library. After getting the affine matrix, correct the result of Kalman (tracker\bot_sort.py line68):

def multi_gmc(stracks, H=np.eye(2, 3)):
        if len(stracks) > 0:
            multi_mean = np.asarray([st.mean.copy() for st in stracks])
            multi_covariance = np.asarray([st.covariance for st in stracks])

            R = H[:2, :2]
            R8x8 = np.kron(np.eye(4, dtype=float), R)
            t = H[:2, 2]

            for i, (mean, cov) in enumerate(zip(multi_mean, multi_covariance)):
                mean = R8x8.dot(mean)
                mean[:2] += t
                cov = R8x8.dot(cov).dot(R8x8.transpose())

                stracks[i].mean = mean
                stracks[i].covariance = cov

4. Matching strategy

In the matching process, the exponential moving average is used to balance the past and current appearance features, and then the motion features and appearance features are linearly combined as the cost matrix, as shown in the following two formulas:
insert image description here

insert image description here

For objects with similar appearance and IoU, give a smaller cost, otherwise set it to 1, and update the elements in C accordingly:

insert image description here
The final result is shown in the table below:

insert image description here

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/125946573