Implementation of video multi-target tracking based on deep learning

1 Introduction

Hi, everyone, this is Senior Dancheng, and today I would like to introduce

Realization of video multi-target tracking based on deep learning

The project workload can be used for the completion of the design

2 Get results first

insert image description here

3 Two methods of multi-target tracking

3.1 Method 1

Based on the tracking of the initialization frame, select your target in the first frame of the video, and then hand it over to the tracking algorithm to realize the tracking of the target. This method can basically only track the target you selected in the first frame. If a new object target appears in the subsequent frame, the algorithm will not be able to track it. The advantage of this approach is that it is relatively fast. The disadvantage is obvious, it cannot track emerging targets.

3.2 Method 2

Based on target detection tracking, all target objects of interest are detected in each frame of the video, and then they are associated with the targets detected in the previous frame to achieve the tracking effect. The advantage of this method is that it can track new targets that appear at any time throughout the video. Of course, this method requires you to have a good "target detection" algorithm.

The senior mainly shared the implementation principle of Option2, which is the tracking method of Tracking By Detecting.

4 Tracking By Detecting tracking process

**Step1:** Use the target detection algorithm to detect the target of interest in each frame, and get the corresponding (position coordinates, classification, credibility), assuming that the number of detected targets is M;

**Step2:** Associate the detection results in Step1 with the detection targets in the previous frame (assuming that the number of detection targets in the previous frame is N) in a certain way. In other words, it is to find the most similar pair among M*N pairs.

For the "some way" in Step2, there are actually many ways to achieve the association of targets, such as the common calculation of the Euclidean distance between two targets in two frames (the linear distance between two points in the plane), The shortest distance is considered to be the same target, and then the most matching pair is found through the Hungarian algorithm. Of course, you can also add other judgment conditions, such as the IOU I used to calculate the intersection and union ratio of two target boxes (position and size boxes). The closer the value is to 1, it means the same target. There are other things such as judging whether the appearance of two objects is similar, which requires the use of an appearance model for comparison, which may take longer.

In the process of association, three situations will occur:

1) The target detected this time was found among the N targets in the previous frame, indicating that it was tracked normally;

2) No target detected this time was found among the N targets in the previous frame, indicating that this target is new in this frame, so we need to record it for the next tracking association;

3) There is a target in the previous frame, but there is no target associated with it in this frame, then the target may have disappeared from the field of view, and we need to remove it. (Pay attention to the possibility here, because it is possible that the target has not been detected in this frame due to detection errors)

insert image description here

4.1 Existing problems

The tracking methods mentioned above can work very well under normal circumstances, but if the target in the video moves quickly, and the same target moves far away in the two frames before and after, then this tracking method will have problems.

insert image description here
As shown in the figure above, the solid line box indicates the position of the target in the first frame, and the dotted line box indicates the position of the target in the second frame. When the target running speed is relatively slow, (A, A') and (B, B') can be accurately associated through the previous tracking method. But when the target runs very fast (or detected every other frame), in the second frame, A will move to the position of B in the first frame, and B will move to other positions. At this time, using the above association method will get wrong results.

So how can we track more accurately?

4.2 Tracking method based on trajectory prediction

Since there will be errors in comparing and correlating the position of the second frame with the position of the first frame, we can find a way to predict the position where the next frame of the target will appear before the comparison, and then compare it with the predicted position associated. In this case, as long as the prediction is accurate enough, there will be almost no error due to the speed mentioned above

insert image description here

As shown in the figure above, before comparing and correlating, we first predict the positions of A and B in the next frame, and then use the actual detection position to compare and correlate with the predicted position, which can perfectly solve the above-mentioned problems. In theory, no matter how fast the target is, it can be connected. So the question is, how to predict the position of the target in the next frame?

There are many methods. You can use Kalman filtering to predict the position of the next frame based on the trajectory of the target in the previous few frames, and you can also use the function you fitted to predict the position of the next frame. In practice, I use the fitted function to predict where the object will be in the next frame.

insert image description here
As shown in the figure above, through the positions of the first 6 frames, I can fit a (T->XY) curve (note that it is not a straight line in the figure), and then predict the position of the target in T+1 frame. The specific implementation is very simple, and there are methods with similar functions in the numpy library in Python.

5 training code

Record the training code here and update it in the future

 if FLAGS.mode == 'eager_tf':
        # Eager mode is great for debugging
        # Non eager graph mode is recommended for real training
        avg_loss = tf.keras.metrics.Mean('loss', dtype=tf.float32)
        avg_val_loss = tf.keras.metrics.Mean('val_loss', dtype=tf.float32)

        for epoch in range(1, FLAGS.epochs + 1):
            for batch, (images, labels) in enumerate(train_dataset):
                with tf.GradientTape() as tape:
                    outputs = model(images, training=True)
                    regularization_loss = tf.reduce_sum(model.losses)
                    pred_loss = []
                    for output, label, loss_fn in zip(outputs, labels, loss):
                        pred_loss.append(loss_fn(label, output))
                    total_loss = tf.reduce_sum(pred_loss) + regularization_loss

                grads = tape.gradient(total_loss, model.trainable_variables)
                optimizer.apply_gradients(
                    zip(grads, model.trainable_variables))

                logging.info("{}_train_{}, {}, {}".format(
                    epoch, batch, total_loss.numpy(),
                    list(map(lambda x: np.sum(x.numpy()), pred_loss))))
                avg_loss.update_state(total_loss)

            for batch, (images, labels) in enumerate(val_dataset):
                outputs = model(images)
                regularization_loss = tf.reduce_sum(model.losses)
                pred_loss = []
                for output, label, loss_fn in zip(outputs, labels, loss):
                    pred_loss.append(loss_fn(label, output))
                total_loss = tf.reduce_sum(pred_loss) + regularization_loss

                logging.info("{}_val_{}, {}, {}".format(
                    epoch, batch, total_loss.numpy(),
                    list(map(lambda x: np.sum(x.numpy()), pred_loss))))
                avg_val_loss.update_state(total_loss)

            logging.info("{}, train: {}, val: {}".format(
                epoch,
                avg_loss.result().numpy(),
                avg_val_loss.result().numpy()))

            avg_loss.reset_states()
            avg_val_loss.reset_states()
            model.save_weights(
                'checkpoints/yolov3_train_{}.tf'.format(epoch))

6 last

Guess you like

Origin blog.csdn.net/HUXINY/article/details/130146063