GOTURN network understanding

GOTURN network understanding
The author uses a completely offline method for training, and then tracks the target. The tracking can reach 100fps (referring to gtx 680), and when using Titan X, it can reach 160+fps, and the number of degrees is really fast. Such offline Training, and being able to achieve 100fps on 680, is an improvement for commercial applications.

Most of the previous work on depth tracking cannot meet the real-time requirements: the previous CNN achieved 7fps

Interspersed, this article is from 2016. At present, depth tracking is developing rapidly, and many have surpassed the GOTURN network. speed): The network of this article is similar to the simese network: (Continue to read this article by simesefc: SiameseFC tracker from Oxford Luca Bertinetto ):

Post the network first:

        Correction:

            The following introduces the input and output of GOTURN net:

Visualize the overall network structure:

Input 1: the current frame picture, perform crop to get the area with the center of the target,
Input 2: Enter the current frame, and crop to the search region:
In the previous frame, assuming that the target location is (cx, cy) and its size is (w, h), an image block of size (2w, 2h) is extracted and input to CNN. Why choose 2, this is the new idea proposed by the author (according to the Laplace distribution of the target box)
In the current frame, also take (cx, cy) as the center, extract the image block of size (2w, 2h), and input it into CNN
By inputting two images before and after
The window to output the target (upper left and lower right coordinates).
The role of the convolutional layer network structure:
卷积层是采用的5层结构(这里的5层结构是参照了CaffeNet里面的结构,其中卷积层的激励函数都采用了relu激励函数,部分卷积层后面添加了池化层)( 卷积层,用于提取目标区域和搜索区域的特征 ),并在imagenet上fine-tue进行预训练。
而全连接层则是由3层,每层4096个节点,各层之间采用dropout( 补充:理解dropout)和relu激励函数,以防过拟合和梯度消失。(全连接层,用于比较目标特征和搜索区域特征,输出新的目标位置.)输出则是一个四维向量,分别是跟踪窗口左上角和右下角坐标.
整个算法实现的框架如上图:作者将上一帧的目标和当前帧的搜索区域同时经过CNN的卷积层(Conv Layers),然后将卷积层的输出通过全连接层(Fully-Connected Layers),用于回归(regression)当前帧目标的位置。
文中训练时loss function 采用的是L1-loss
损失函数则是采用的 L 1 -Loss 的方式

作者给出几种训练数据的俄对比表格:


本文的特殊点:

目前没有人去研究目标位置与尺度的关系,但是作者通过groundtruth进行研究,前帧目标的位置和尺度变化与上一帧的目标存在着某种分布关系,符合拉普拉斯分布:对于具体的拉普拉斯分布的介绍在论文后有详细的介绍:

在看完论文后,下一篇有介绍代码的实现:


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324505459&siteId=291194637