Target Tracking Column (1) Basic tasks and common methods

Introduction Visual object tracking is an important problem in computing. Although it has been extensively studied in recent years, the target tracking problem is slightly less popular than basic visual tasks such as target detection and semantic segmentation due to its high difficulty and scarcity of high-quality data. The development of deep learning and the enhancement of computer computing power have brought about a rapid improvement in the performance of visual algorithms, while methods based on deep neural networks in the field of object tracking have only begun to appear in recent years, which can be said to be promising.
Therefore, this book introduces the basic tasks and common methods of target tracking in the early stage, and then introduces its basic content and commonly used data sets from the perspective of single target tracking and multi-target tracking. Combined with classic papers in the field, everyone can have a systematic understanding of target tracking Recognition, if you are interested in visual object tracking or a novice in this field, this column will help you have a preliminary understanding of visual object tracking problems and classic methods. If you are a researcher who already has a certain amount of relevant knowledge, you are also welcome to discuss and advise.

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

CV's major direction columns and the most complete tutorials for each deployment framework

Reproduction of this tutorial is prohibited. At the same time, this tutorial comes from Knowledge Planet [CV Technical Guide] More technical tutorials, you can join the planet to learn, and you can get a limited-time coupon at the end of the article.

This column will introduce the content: 1. Basic tasks and common methods of target tracking 2. Basic concepts and common data sets of single target tracking and multi-target tracking 3. Interpretation of classic papers on single target tracking 4. Interpretation of classic papers on multi-target tracking 5 . Goal Tracking Future Prospect 6. Summary

(1) Basic overview

Target tracking is a hot issue in the field of computer vision. It uses the context information of video or image sequences to model the appearance and motion information of the target, so as to predict the target motion state and calibrate the position of the target. Target tracking algorithms can be divided into generative models and discriminative models from the perspective of model building; they can be divided into single target tracking (SOT) and multiple target tracking (MOT) from the number of tracking targets. Target tracking integrates theories and algorithms in multiple fields such as image processing, machine learning, and optimization, and is the premise and basis for completing higher-level image understanding (such as target behavior recognition) tasks.

(2) Basic tasks

The basic task of target tracking is to give the initial position of the target in a video sequence, and continue to track and locate the target in each subsequent frame. In this process, no prior knowledge about the color, shape, and size of the target will be provided. condition, that is, the tracking algorithm can only track by learning the target in the first frame. Generally speaking, the technical difficulties in the tracking process mainly include the following aspects:

  1. Occlusion and disappearance : Occlusion is one of the most common challenging factors in target tracking. Occlusion is divided into partial occlusion and full occlusion. There are usually two ways to solve partial occlusion: (1) Use the detection mechanism to judge whether the target is occluded, so as to decide whether to update the template to ensure the robustness of the template to occlusion. (2) Divide the target into multiple blocks, and use unoccluded blocks for effective tracking. For the situation where the target is completely occluded, there is currently no effective method to completely solve it.

  1. Morphology change : Deformation is also a major problem in target tracking. The continuous change of target appearance usually leads to tracking drift (Drift). A common method to solve the drift problem is to update the apparent model of the target to adapt to the apparent changes, and the model update method becomes the key. When to update and how often to update are the issues that need to be paid attention to when updating the model.

  1. Scale Variation : Scale Variation is a phenomenon in which the scale of the target changes from far to near or from near to far during the movement process. Predicting the size of the target frame is also a challenge in target tracking. How to quickly and accurately predict the scale change coefficient of the target directly affects the accuracy of tracking. The usual practice is: when the motion model generates candidate samples, generate a large number of candidate frames of different scales, or perform target tracking on multiple targets of different scales, generate multiple prediction results, and select the best one as the last forecast target.

  1. Complex background : Background Clutter means that there are very similar targets around the target to be tracked that interfere with the tracking. The common method to solve this kind of problem is to use the target's motion information to predict the approximate trajectory of the motion, prevent the tracker from tracking other similar targets, or use a large number of sample frames around the target to update the classifier and improve the classifier. Ability to distinguish between background and object.

  1. Accuracy and Timeliness

(3) Common methods

traditional method

Traditional object tracking algorithms are the earliest algorithms in the field of object tracking. Although these algorithms have certain limitations today, we cannot ignore the foundation they laid for the vigorous development of the field of object tracking. The classic algorithms include optical flow method, Kalman filter, particle filter, mean shift, etc.

optical flow

The concept of optical flow (optical flow) was first proposed by Gibson in 1950, which is the movement of the target caused by the movement of the target, scene or camera between two consecutive frames of images. It is the instantaneous speed of the pixel movement of the spatial moving object on the observation imaging plane, and the corresponding relationship between the previous frame and the current frame is found by using the change of pixels in the image sequence in the time domain and the correlation between adjacent frames , so as to calculate a method of motion information of objects between adjacent frames. Generally speaking, optical flow is generated due to the movement of the foreground object itself in the scene, the movement of the camera, or the joint movement of both. Its calculation methods can be divided into three categories:

  1. Region-based or feature-based matching methods

  1. Frequency domain based methods

  1. gradient-based method

Simply put, optical flow is the "instantaneous velocity" of the pixel motion of a spatially moving object on the observation imaging plane. The study of optical flow exploits the temporal variation and correlation of pixel intensity data in an image sequence to determine the "motion" of the respective pixel locations. The purpose of studying the optical flow field is to approximate the motion field that cannot be obtained directly from the image sequence (the motion field is actually the movement of objects in the three-dimensional real world; the optical flow field is the projection of the motion field on the two-dimensional image plane).

Premise assumptions of optical flow method:

  1. Constant brightness between adjacent frames

  1. The frame time of adjacent video frames is continuous, or the motion of objects between adjacent frames is relatively "tiny"

  1. Maintain spatial consistency; that is, pixels of the same subimage have the same motion

The principle of optical flow method for target tracking:

  1. Process a sequence of continuous video frames

  1. For each video sequence, use a certain target detection method to detect possible foreground targets

  1. If a foreground target appears in a certain frame, find its representative key feature points (can be randomly generated, or use corner points as feature points)

  1. For any two subsequent adjacent video frames, find the best position of the key feature points that appeared in the previous frame in the current frame, so as to obtain the position coordinates of the foreground target in the current frame

  1. This iterative process can realize the tracking of the target.

Kalman filter

Kalman's full name is Rudolf Emil Kalman, a Hungarian mathematician, what is the Kalman filter for? Let's look at the explanation on the wiki: A typical example of Kalman filtering is to predict the coordinates and velocity of the object's position from a limited set of noise-containing observation sequences (possibly biased) on the object's position. For example, radar is of interest in its ability to track targets. But the measured values ​​of the target's position, velocity, and acceleration are often noisy at any time. Kalman filtering uses the dynamic information of the target to try to remove the influence of noise and get a good estimate of the target position. This estimate can be an estimate of the current target position (filtering), an estimate of future positions (prediction), or an estimate of past positions (interpolation or smoothing).

particle filter

Particle Filter (PF: Particle Filter) is based on Bayesian reasoning and importance sampling. Bayesian inference is a process similar to Kalman filtering. The Kalman filter is a linear Gaussian model. For a nonlinear non-Gaussian model, the Monte Carlo method (Monte Carlo method, that is, the frequency of occurrence at a certain time is used to refer to the probability of the event). (Particle filter belongs to the extension of Kalman filter to a certain extent)

Importance sampling is to add different weights according to the degree of trust in the particles. For particles with a high degree of trust, add a larger weight, otherwise add a smaller weight. According to the distribution of weights, the similarity with the target can be obtained.

The idea of ​​particle filter is based on the idea of ​​Monte Carlo, which uses particle sets to represent probability, and can be used in any form of state space model. In 1998, Andrew and Michael successfully applied particle filter to the field of target tracking. The target features are extracted in the initialization phase, and the particle sampling is carried out in the entire image search area according to uniform distribution or Gaussian distribution in the search phase, and then the similarity between the sampled particles and the target is calculated respectively, and the position with the highest similarity is the predicted target position . Searches in subsequent frames are resampled according to the importance of the predicted object positions in the previous frame. The traditional particle filter tracking algorithm only uses the color histogram of the image to model the image, the amount of calculation will increase with the increase of the number of particles, and when the target color is similar to the background, the tracking will often fail.

Specific process:

The core idea of ​​particle filter is the optimization based on the reward-punishment mechanism (reinforcement learning). First, according to the state transition equation, the position of each particle is updated. But this update is only based on dead reckoning. We need to integrate absolute positioning and relative positioning, and the information of absolute positioning is not integrated. Can the new state obtained according to the state transition equation work? How likely is it? It also depends on the result of absolute positioning which is the output equation.

Substitute the result obtained by the state transition equation into the output equation to obtain an output, which is an estimated value, and according to the observation of absolute positioning, the corresponding observed value of this value can also be measured, and now there is a difference between these two values , obviously, the smaller the difference, the more credible the state just arrived, and the larger the difference, the less credible the state.

Use this difference index as an evaluation function (evaluation function in optimization algorithms such as GA, pso, etc.) to correct the estimated probability of each state. Simply put, at the beginning, a large wave of particles is evenly distributed on the entire map (of course, there is an improved preprocessing algorithm, which can lean to the correct point in advance to reduce the amount of calculation), each particle can calculate an estimated value, and then get An actual observed value, leaving particles that differ from the observed value by a small amount. (The specific number of particles left depends on your system model. Now there is also an adaptive algorithm, which can change the number of particles left by yourself), so that each particle has a difference with the observed value, and then the next time the same method (this process is called resampling), and finally we will leave particles with very high confidence. This is usually the last correct value.

Mean shift (mean-shift)

The mean shift algorithm was proposed by Fukunaga in 1975. Its basic idea is to use the gradient of the probability density to climb to find the local optimum. By 1995, YizongCheng defined a family of kernel functions based on the fact that the sampling points closer to x are more effective for the statistical characteristics around x, and set a weight coefficient according to the importance of all sample points, which expanded the Mean Shift. scope of use.

The principle of the meanshift algorithm is very simple. Suppose you have a bunch of point sets and a small window. This window may be circular. Now you may want to move this window to the area with the highest density of point sets. As shown below:

The first window is the area of ​​the blue circle, named C1. The center of the blue circle is marked with a blue rectangle, named C1_o. The center of mass formed by the point sets of all points in the window is at the blue circular point C1_r, obviously the centroid and the center of mass of the ring do not coincide. So, move the blue window so that the centroid coincides with the centroid obtained earlier. Find the centroid of the point set enclosed in the circle again in the area of ​​the newly moved circle, and then move again. Usually, the centroid and the centroid do not coincide. Continue to perform the above moving process until the centroid and the centroid roughly coincide. In this way, the final circular window will fall to the place where the pixel distribution is the largest, that is, the green circle in the figure, named C2.

In addition to being used in video tracking, the meanshift algorithm has important applications in various occasions involving data and unsupervised learning such as clustering and smoothing. It is a widely used algorithm. An image is a matrix of information. How to use the meanshift algorithm to track a moving object in a video? The general process is as follows:

  1. First select a target area on the image

  1. Calculate the histogram distribution of the selected area, generally the histogram of the HSV color space

  1. Also calculate the histogram distribution for the next frame image b

  1. Calculate the area in image b that is most similar to the histogram distribution of the selected area, and use the meanshift algorithm to move the selected area along the most similar part until the most similar area is found, and the target tracking in image b is completed

  1. Repeat the process from 3 to 4 to complete the entire video target tracking

Usually we use the image obtained by histogram back projection and the starting position of the target object in the first frame. When the movement of the target object is reflected in the histogram back projection, the meanshift algorithm will move our window to the back projection The area with the highest gray density in the projected image is projected.

The process of histogram backprojection is:

Suppose we have a 100x100 input image and a 10x10 template image, the search process is as follows:

  1. Starting from the upper left corner (0,0) of the input image, cut a temporary image from (0,0) to (10,10)

  1. Generate a histogram of the temporary image

  1. Compare the histogram of the temporary image with the histogram of the template image, and record the comparison result as c

  1. The histogram comparison result c is the pixel value at (0,0) of the result image

  1. Cut the temporary image of the input image from (0,1) to (10,11), compare the histogram, and record to the result image

  1. Repeat steps 1 to 5 until the lower right corner of the input image, forming the back projection of the histogram

Mean shift video tracking implementation

The API for implementing Mean shift in OpenCV is: cv.meanShift(probImage, window, criteria)

Parameters: probImage:: ROI area, that is, the back projection of the histogram of the target window: the initial search window, which is the rectcriteria that defines the ROI: the criterion for determining the stop of the window search, mainly including the number of iterations reaching the set maximum value and the drift of the window center The value is greater than some set limit, etc. The main process of implementing Mean shift is:

  1. Read video file: cv.videoCapture()

  1. Area of ​​interest setting: Get the first frame of image and set the target area, that is, the area of ​​interest

  1. Calculate histogram: Calculate the HSV histogram of the region of interest and normalize it

  1. Target tracking: set the window search stop condition, back-project the histogram, perform target tracking, and draw a rectangular frame at the target position.

Due to the fast calculation speed of the mean shift and the certain robustness to the target deformation and occlusion, the mean shift algorithm has been widely valued. However, the extracted color histogram feature has limited ability to describe the target and lacks spatial information, so the mean shift algorithm can only be used when the target and the background can be distinguished in color, which has great limitations.

Target Tracking Method Based on Correlation Filtering

The target tracking algorithm based on correlation filter (Correlation Filter) is a major research hotspot in the field of target tracking, which has brought great changes to the field of target tracking. Correlation filtering comes from the field of signal processing. Correlation indicates the similarity between two signals. The closer the two signals are, the higher the correlation response will be. The correlation filter tracking algorithm is to find the target position with the highest response value by establishing a correlation filter. In 2010, Bolme et al. took the lead in combining correlation filtering with target tracking and proposed the MOSSE tracking algorithm. With its high-speed tracking speed and good tracking performance, it greatly promoted the development of the target tracking field. The following briefly introduces target tracking Several representative tracking algorithms in the field development process.

Least Sum of Squares Error Filter

Introducing the relevant operations in the field of signal processing to the field of target tracking, the tracking problem can be described as finding the region most similar to the initial target in the video sequence, that is, calculating the correlation between the filter h and the input image f to obtain the response graph g. The calculation process is shown in the figure below:

In order to reduce the amount of calculation and speed up the calculation, the Fourier transform is introduced. The filter trained on an image can accurately fit the image, but when tracking, the appearance of the target will be affected by factors such as fast movement, scale change, and occlusion. When the previously trained filter is applied to a new image, it often cannot adapt to the change of the target, resulting in tracking failure. Therefore, the filter needs to be able to update adaptively as the video sequence progresses.

Kernel Correlation Filter Tracking Algorithm

In 2014, Henriques et al. proposed Kernelized Correlation Filters (KCF), which improved the CSK algorithm based on gray features, introduced kernel functions, circular matrices, and HOG features, and further improved the tracking performance of the algorithm. The important links of the KCF tracking algorithm will be introduced below.

  1. Build a training sample set

One of the difficulties of tracking algorithms is that the number of samples is too small. Traditional tracking algorithms usually randomly sample around the target position to obtain samples, or obtain samples by rotating and scaling the target. However, the number of samples obtained by these two methods is very limited and the redundancy is high, and the target information cannot be fully learned, which will cause a large deviation in the follow-up tracking of the tracker and affect the tracking performance.

The KCF tracking algorithm creatively introduces a circulatory matrix. On the one hand, a large number of training samples are obtained through the cyclic shift operation of the target samples. On the other hand, the frequency domain space of the target features is combined with the ridge regression to realize the rapid learning and detection of the target features.

  1. train classifier

Training the target classifier in the KCF framework is actually a ridge regression problem, also known as a regularized least squares problem. The goal is to find the optimal classification function through the training sample set, so that the square error between the predicted value of the sample and the real value of the sample is minimized, and the obtained loss objective function can be obtained by the least squares method to minimize the optimal solution of the above loss function .

For the linear regression problem, it can be calculated through the above steps, which greatly improves the calculation speed. But in most cases, what needs to be solved is a nonlinear problem, so the KCF algorithm introduces the idea of ​​a kernel function, maps the training samples to a high-dimensional space, and converts it into a linearly separable problem, then in the high-dimensional space, it can be used Ridge regression to find a classifier.

  1. Target Rapid Detection

In the detection stage, the response values ​​of the candidate regions are usually calculated one by one, and the sample with the highest response value is selected as the target. In the KCF framework, the prediction area and its cyclic shift constitute the candidate sample set, and thus the response value of the candidate sample set can be obtained.

The largest response value indicates that the current candidate sample is closest to the training sample, and its corresponding position is the predicted target position. In addition, it can be diagonalized by discrete Fourier transform to improve computational efficiency.

  1. model update

In order for the classifier to maintain a high discrimination against the target and to cope with various interference situations that occur during the tracking process, the algorithm model needs to be updated in time. The KCF framework follows the MOSSE update strategy to update the appearance model and classification in each frame. parameters to improve the robustness of the algorithm.

Object Tracking Method Based on Deep Learning

As a popular technology in recent years, deep learning can be seen in many fields. Deep learning has also been widely used in the field of target tracking with its excellent feature modeling capabilities. Target tracking based on deep learning can be roughly divided into two categories, one is to use convolutional neural network to extract target features, and then integrate with other tracking methods to achieve target tracking. The other is to train an end-to-end neural network model, and all steps of target tracking are implemented by the neural network. The following will briefly introduce the basic structure of the neural network, the typical convolutional neural network model, the working principle of the twin neural network, and several representative deep learning tracking algorithms.

Neural Networks

Neural network is a mathematical model inspired by the biological nervous system. As the underlying model of artificial intelligence, neural network has many complex applications. A basic neural network is shown below:

The neural network consists of three parts: input layer, hidden layer, and output layer, and adjacent layers are connected to each other. Among them, the input layer is responsible for data input, the output layer is responsible for data output, and the hidden layer is responsible for a series of mathematical operations. The purpose is to better linearly divide different types of data. The number of hidden layers determines the depth of the neural network.

A very important concept in neural networks is the activation function, which determines whether a certain neuron is activated and whether the information received by this neuron is useful. A neural network without an activation function is a linear regression model whose output is a linear combination of the inputs, no matter how many layers the network contains. The activation function introduces a nonlinear transformation, so that the neural network can be applied to a nonlinear model and can handle more complex tasks. Commonly used activation functions are sigmoid function, tanh function, and ReLU function.

convolutional neural network

Convolutional Neural Network (CNN) is a feedforward neural network, which was first proposed by LeCun and applied to handwritten font recognition. Since CNN does not need to pre-process the image, it can directly input the original image into the network, and the operation is more concise, so it is widely used, and the classic network structure emerges in endlessly. A convolutional neural network consists of a convolutional layer, a pooling layer, a fully connected layer, and an output layer. A typical convolutional neural network structure is shown in the following figure:

其中,卷积层是CNN中最核心的部分,主要功能是特征提取,底层网络能够提取如边缘、轮廓等低级特征,深层网络能够从低级特征中提取更复杂的特征。对二维图像做卷积操作,类似于滑动窗口对图像滤波,因此也把卷积核称为滤波器(filter)。卷积核的选择决定了特征图的质量,图像经由不同的卷积核处理能够得到不同的特征图。卷积核的数量越多,提取到的特征图也越多,但相应的计算复杂度增加,若卷积核的数量太少,则无法提取出输入图像的有效特征。使用卷积核提取图像特征的过程如下图所示:

池化层,也称降采样层,由卷积层得到的特征图通常维度过高,因此在卷积层后连接一个池化层,用于降低特征维度。池化层的操作方式与卷积层基本相同,常用的池化方式有最大池化和平均池化,如下图所示。最大池化即取滑动窗口所对应区域的最大值作为池化输出,平均池化即取滑动窗口所对应区域的平均值作为池化输出。池化操作不仅可以降低特征维度,减少计算量,而且具备特征不变性,能够保留原始图像中最重要的特征。

全连接层通常位于卷积神经网络的尾部,与传统的神经网络连接方式一致,主要负责将所有局部特征连接成全局特征,并将输出值传送给分类器。全连接层连接所有特征的方式是将卷积输出的二维特征图转化成一维向量,然后再乘一个权重,权重矩阵是固定的,且应与由特征图生成的一维向量大小一致,这就要求网络输入层图像必须固定尺寸,才能保证传送到全连接层的特征图的大小与全连接层的权重矩阵相匹配。最后将得到的图像特征通过 sigmoid 函数或其他类型的函数映射到输出层,完成分类任务。

孪生神经网络

孪生神经网络是一种包含两个或多个相同子结构的神经网络架构,各子网络共享权重。孪生神经网络的目标是通过多层卷积获取特征图后,比较两个对象的相似程度,在人脸认证、手写字体识别等任务中常被使用。其网络结构如下图所示,两个输入分别进入两个神经网络,将输入映射到新的空间,形成输入在新空间中的表示,通过损失的计算,评价两个输入的相似度。

此外,该网络的特点是可以充分利用有限的数据进行训练,这一点对目标跟踪来说至关重要,因为在跟踪时能够提供的训练数据与目标检测相比较少。

若子网络之间不共享权重,则称为伪孪生神经网络。对于伪孪生神经网络,其子网络的结构可以相同,也可不同。与孪生神经网络不同,伪孪生神经网络适用于处理两个输入有一定差别的情况,如验证标题与正文内容是否一致、文字描述与图片内容是否相符等。要根据具体应用进行网络结构的选择。

本教程来自知识星球【CV技术指南】更多技术教程,可加入星球学习,文末可领取限时优惠券。

欢迎关注公众号CV技术指南,专注于计算机视觉的技术总结、最新技术跟踪、经典论文解读、CV招聘信息。

计算机视觉入门1v3辅导班

【技术文档】《从零搭建pytorch模型教程》122页PDF下载

QQ交流群:444129970。群内有大佬负责解答大家的日常学习、科研、代码问题。

其它文章

高效轻量级语义分割综述

超快语义分割 | PP-LiteSeg集速度快、精度高、易部署等优点于一身,必会模型!!!

数据集+插件,一把子解决遮挡下目标检测难题

AAAI | Panini-Net | 基于GAN先验的退化感知特征插值人脸修

一文带你掌握轻量化模型设计原则和训练技巧!

图像增强新思路:DeepLPF

LCCL网络:相互指导博弈来提升目标检测精度(附源代码)

与SENet互补提升,华为诺亚提出自注意力新机制:Weight Excitation

最新FPN | CFPNet即插即用,助力检测涨点,YOLOX/YOLOv5均有效

DeepLSD:基于深度图像梯度的线段检测和细化

CVPR 2023 | 基础模型推动语义分割的弱增量学习

消费级显卡的春天,GTX 3090 YOLOv5s单卡完整训练COCO数据集缩短11.35个小时

BOE告诉你:一层卷积可以做超分!

卫星图像公开数据集资源汇总

DiffusionDet:用于对象检测的扩散模型

CV小知识讨论与分析(7) 寻找论文创新点的新方式

CV小知识分析与讨论(6)论文创新的一点误区

一文看尽深度学习中的各种注意力机制

MMYOLO 想你所想:训练过程可视化

顶刊TPAMI 2023!Food2K:大规模食品图像识别

用于精确目标检测的多网格冗余边界框标注

2023最新半监督语义分割综述 | 技术总结与展望!

原来Transformer就是一种图神经网络,这个概念你清楚吗?

快速实现知识蒸馏算法,使用 MMRazor 就够啦!

知识蒸馏的迁移学习应用

TensorFlow 真的要被 PyTorch 比下去了吗?

深入分析MobileAI图像超分最佳方案:ABPN

3D目标检测中点云的稀疏性问题及解决方案

计算机视觉入门1v3辅导班

计算机视觉交流群

聊聊计算机视觉入门

Guess you like

Origin blog.csdn.net/KANG157/article/details/129501093