Interpretation of Rotation Equivariant Networks for Tracking paper

1. Introduction

The task of visual object tracking with Siamese networks, referred as Siamese tracking, transforms the problem of tracking into similarity estimation between a template frame and sampled region from a candidate frame.
The twin network is a problem of describing the tracking task as a similarity response between the template and the search region
Although Siamese trackers are generally shown to work well, they are prone to failure under challenges such as partial occlusion、scale change or when one of the two inputs is rotated
Although Siamese performs well, it is prone to tracking failures during occlusions, scale changes, and rotations
The CNN archietectures used in Siamese trackers are not inherently equivariant to in-plane rotations of the target. The implication is that the model may perform well on object orientations that are represented in the training set, but may fail on other previously unseen orientations
The CNN framework in Siamese does not have in-plane rotation equivariance in nature, which means that the model will perform well in the target directions represented in the training set, but in other directions that have not been generated, the model performance will fail.
A straightforward approach to enforce learning of rotated variants is to use training dataset where in-plane rotations occur naturally or through data augmentation
A straightforward way to force the model to learn to rotate variables is to use datasets with natural rotation information or via data augmentation
Limitations of Data-Augmentation
1. Such procedures would require learning separate representations for different rotated variants of the data
This will allow the model to learn the expression of different rotation variables of the data
2. The more variations are considered, the more flexible tracker model needs to be to capture them all
The more variables to consider, the more flexible the model needs to be to capture more variables
3. Futher, such an approach would make the model invariant to rotations, thus making the predictions unreliable when the target is surrounded by similar objects, e.g.,tracking a fish in a school of fishes.
Also, this approach makes the model rotation invariant, thus making predictions unreliable, such as finding a fish in a school of fish

rotation inequalities

Example demonstrating rotation non-equivariance in regular CNN models used in object tracking:

The example describes the conventional CNN model in object tracking, which does not have rotation equivariance
ψ θ ( f ( / cdot ) ) ≠ f ( ψ θ ( ⋅ ) ) \psi_\theta(f(/cdot)) \neq f(\psi_\theta(\cdot))pi(f(/cdot))=f ( pi())

isotropic

Equivariant and other denaturation:

Operators and functions can be interchanged, and there is commutability
transform [ F ( x ) ] = F ( transform [ x ] ) transform[F(x)] = F(transform[x])transform[F(x)]=F ( t r an s f or m [ x ])
Invariant invariance:

the input x is transformed, but the output after F is unchanged
F ( x ) = F ( transform [ x ] ) F(x) = F(transform[ x])F(x)=F ( t r an s f or m [ x ])
Covariant covariance:

the input x is transformed into transform, and the output after F is also transformed, but it is not transform, but another transformation can be used to make the result the same
transform ∗ F ( x ) = F ( transform [ x ] ) transform^*F(x) = F(transform[x])transformF(x)=F(transform[x])

2. Related Work

Equivariant CNNs
SiamRPN++ proposed a training strategy which removes the spatial bias introduced in non fully-convolutional backbone
SiamRPN++ proposes a training strategy that removes the spatial bias in the backbone
Deeper and wider siamese networks for real-time visual tracking showed that existing tracking models induce positional bias, which breaks strict translation equivariance
Deeper and wider siamese networks for real-time visual tracking pointed out that existing tracking models cause position bias and break the equivariant transformation
Scale Equivariance Improves Siamese Tracking(SE-SiamNet) introduced scale-equivariant Siamese trackers which is crucial when the camera zooms its lens or when the target moves into depth
Scale Equivariance Improves Siamese Tracking (SE-SiamNet) introduces a scale equivariance twin network, which has a huge impact when the camera zooms in or the target moves in the depth of field

3. Rotation Equivariant CNNs

Rotational background

Rotation Equivariance

SFC-NNs

Learning steerable filters for rotation equivariant cnns indicated that one of the more robust ways of enforcing rotation equivariance in CNNs is through the use of steerable filter(SFC-NNs)
Learning steerable filters for rotation equivariant cnns pointed out that a relatively Lubang way to make CNNs have rotation equivariant is to use steerable filters (SFC-NNs)
For rotation equivariance with steerable filters, the network must perform convolutions with different rotated versions of each filter
Using the rotation equivariance of the controllable filter requires that each convolution filter of the network corresponds to a different rotation
Steerable filters not only facilitate efficiently computing responses for an arbitrary number of discrete filter rotations, but they also exhibit strong expressive power as well
Steerable filters not only make it more efficient to compute the rotational response of any number of discrete filters, they are also powerful


Knowledge Expansion: Spherical Harmonics

How to use the spherical harmonic function in the paper

  1. Spherical coordinates without z and θ z and \thetaz and θ are the circular harmonic function system

ψ jk ( r , φ ) = τ j ( r ) jk φ \\ \psi_{jk}(r,\varphi) = \tau_j(r)^{jk\varphi}\\ \qquad \\pjk(r,f )=tj(r)jkφ

  • The following two parameters control the bias function (radial function τ j \tau_jtj) offset range

\qquad
φ ∈ ( − π , π ] \varphi \in (-\pi,\pi]Phi( π ,π ]

_
Current times j = 1 , 2 , … , J j=1,2,\dots,Jj=1,2,,J

\qquad

  • Control pole sits ( x 1 , x 2 ) (x_1,x_2)(x1,x2) mark rotation angle

\qquad
( r , ϕ ) (r,\phi)(r,ϕ)

\qquad

  • Angular function ( ejk φ ) (e^{jk\varphi})(ejk φ ), also becomes the order

\qquad
k ∈ Z Its value is related to the current function degree j of the function system Z ∈ [ − j , j ] k \in Z Its value is related to the current function degree j of the function system Z \in [-j,j]kThe value of Z is related to the current function degree j of the function system Z[j,j]

\qquad

  1. Using Euler's rotation theorem to express the rotation of the target

ρ θ ψ jk ( x ) = e − ik θ ψ jk ( x ) \qquad \\ \rho_{\theta}\psi_{jk}(x) = e^{-ik\theta}\psi_{jk}( x) \\ripjk(x)=eik θ ψjk(x)
e − ik θ means clockwise rotation θ, e + ik θ means counterclockwise rotation θ e^{-ik\theta} means clockwise rotation \theta, e^{+ik\theta} means counterclockwise rotation \thetaeik θ means rotateθ,e+ ik θ means counterclockwise rotationθ

\qquad
Note that here ψ jk ( x ) refers to ψ jk ( ⋅ ) , x is a general reference, not a specific one. Note that \psi_{jk}(x) here refers to \psi_{jk}(\cdot) , x refers generally, not specificallyNote that here ψjk( x ) refers to ψjk( ) , x refers generally, not specifically to

\qquad

  1. Each learned weight wjk ∈ C is constructed as a linear connection between basic filters Each learned weight w_{jk} \in \mathbb{C} is constructed as a linear connection between basic filters linear connectionEach learned weight wjkC , is constructed as a linear connection between the basic filters

Ψ ( x ) = ∑ j = 1 J ∑ k = 0 K w j k ψ j k ( x ) \qquad \\ \Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}\psi_{jk}(x) \\ \qquad \\ Ψ ( x )=j=1Jk=0Kwjkpjk(x)

  1. For the angle of rotation θ, the synthesis filter can be controlled by the phase control of the basic filter For the rotation \theta angle, the control of the synthesis filter can be realized by the phase control of the basic filterFor the angle of rotation θ , control of the synthesis filter can be achieved by the phase control of the basic filter

ρ θ Ψ ( x ) = ∑ j = 1 J ∑ k = 0 K wjke − ik θ ψ jk ( x ) \quad \\ \rho_{\theta}\Psi(x) = \sum_{j=1}^ {J}\sum_{k=0}^{K}w_{jk}e^{-ik\theta}\psi_{jk}(x) \\\quad\\riΨ ( x )=j=1Jk=0Kwjkeik θ ψjk(x)
A rotation direction of the filter can be obtained through the real part of Ψ, called Re\Psi ( x ) A rotation direction of the filter can be obtained through the real part of \Psi, called Re\Psi(x)A rotation direction of the filter can be obtained through the real part of Ψ , which is called R e Ψ ( x )

\qquad

\qquad

4. Rotation Equivariant Siamese Trackers

\qquad

4.1 Formulation Based on Siam-FC

Author started from and modified the basic SiamFC model due to its simple design.
The author chose to modify based on SiamFC because it is simple

h ( z , x ) = f ( z ) ∗ f ( x ) \qquad \\ h(z,x)=f(z)*f(x) \\ \qquad \\ h(z,x)=f(z)f(x)

f ( ⋅ ) refers to the feature extraction network\qquad f(\cdot) refers to the feature extraction networkf ( ) refers to the feature extraction network

∗ refers to the convolution operation of cross-correlation\qquad * refers to the convolution operation of cross-correlationrefers to the convolution operation of cross-correlation

For rotational Siamese tracker, author introduced rotation equivariant modules and a group max pooling module that selects the cross-correlation encoding for the most approximate orientations among the multiple heatmaps generated in setup
The author introduces the rotation equivariant module and group max pooling, and group max pooling is used to select the cross-correlation encoding of the most approximate direction from the generated heat maps

network model

  1. The Candidate Head of the network (processing the Search region) uses a search image (unchanged)

    \qquad
  2. The Template Head of the network is modified so that multiple template images (as shown in the figure, the rotated template) can be input as input, and a series of rotation variables Λ \LambdaΛ definition is Z collection, in whichZ = { z 1 , z 2 , … , z Λ } Z=\{z_{1}, z_{2},\dots, z_{\Lambda}\}Z={ z1,z2,,zL} , that is, all possible rotation angles

    \qquad
  3. First calculate the feature f ( z ) f(z) of the initial tragetf ( z ) , then rotatef ( z ) f(z)f ( z ) , since it is a rotational equivariant network, it is theoretically possible to do this

    \qquad
  4. Rotate the Target in the Template:
    \qquad

y c ~ ( 1 ) ( x , θ ) = R e ∑ c = 1 C ∑ j = 1 J ∑ k = 0 K w c ^ c j k e − i k θ ( I c ∗ ψ j k ) ( x ) \qquad \\ y_{\tilde{c}}^{(1)}(x,\theta) = Re \sum_{c=1}^{C}\sum_{j=1}^{J}\sum_{k=0}^{K}w_{\hat{c}cjk}e^{-ik\theta}(I_c * \psi_{jk})(x) \\ \qquad \\ yc~(1)(x,i )=Rec=1Cj=1Jk=0Kwc^ cjkeikθ(Icpjk)(x)
in

  • I c is a picture with channel c, c ∈ { 1 , 2 , … , C } I_c is a picture with channel c, c \in \{ 1, 2, \dots, C\}Icis a picture with channel c , c{ 1,2,,C}
  • ρ θ Ψ c ^ c ( 1 ) Rotational filter\rho_{\theta}\Psi_{\hat{c}c}^{(1)}rotational filterriPsc^c(1)spin filter
  • c ^ ∈ { 1 , 2 , … , C ^ } \hat{c} \in \{1, 2,\dots, \hat{C} \} c^{ 1,2,,C^}
  • The equidistant rotation angle θ can be obtained by the set Θ = { 0 , Λ , … , 2 π Λ − 1 Λ } The equidistant rotation angle \theta can be obtained by the set \Theta=\{0, \Lambda, \dots, 2\pi \ frac{\Lambda-1}{\Lambda}\}The equidistant rotation angle θ can be given by the set Θ={ 0,L ,,2 p.mLL 1}
  • The bias term β c ^ ( 1 ) is used in the layer (first layer): ζ c ^ ( 1 ) to obtain the feature map bias term \beta_{\hat{c}}^{(1)} is used in the layer (First layer): \zeta_{\hat{c}}^{(1)}Get the feature mapBias term βc^(1)Used at layer ( first layer ) : ζc^(1)Get feature map
  • The nonlinear connection σ c ^ ( 1 ) is used in the layer (first layer): ζ c ^ ( 1 ) to obtain the feature map nonlinear connection \sigma_{\hat{c}}^{(1)} is used in the layer (First layer): \zeta_{\hat{c}}^{(1)}Get the feature mapnonlinear connection σc^(1)Used at layer ( first layer ) : ζc^(1)Get feature map
  1. rotationally equivariant convolution

yc ^ ( l ) = R e ∑ c = 1 C ∑ ϕ ∈ Θ ∑ j , kwc ^ cjk , θ − ϕ e − ik θ ( ζ cl − 1 ( , ̇ ϕ ) ∗ ψ jk ) ( x ) \qquad \\ y_{\hat{c}}^{(l)} = Re\sum_{c=1}^{C}\sum_{\phi \in \Theta}\sum_{j,k}w_{\hat {c}cjk,\theta - \phi}\hspace{1mm}e^{-ik\theta}(\zeta_c^{l-1}(\dot, \phi)*\psi_{jk})(x) \\ \qquad \\yc^(l)=Rec=1CϕΘj,kwc^ cjk,θϕeik θ (gcl1(,˙) _pjk)(x)
Among them,

the subscript θ − ϕ in the weight item w refers to the group convolution operation performed in the angle dimension. The subscript \theta-\phi in the weight item w refers to the group convolution operation performed in the angle dimensionThe subscript θ in the weight term wϕ refers to the grouped convolution operation \qquad in the angle dimension

  1. Rotationally equivariant pooling

\qquad
The output of the last grouped volume layer is further processed in the rotated dimension. Unlike the traditional classification network, this pooling is not performed on the dimension of W\times H (spatial dimension), but on the angle grouping ( { 0 , 2 π 8 , 4 π 8 , ... , 14 π 8 } \ {0, \frac{2\pi}{8}, \frac{4\pi}{8}, \dots, \frac{14\pi}{8} \}{ 0,82 p.m,84 p.m,,814 p.m} ) dimension to preserve the features of rotation equivariance

\qquad

  1. Cross-correlation of rotational equivariance

\qquad

  • A feature − map set { ϕ ( z ) and ϕ ( x ) } can be obtained from the two sub-networks of Re − Siam N et. A feature-map set can be obtained from the two sub-networks of Re-SiamNet\{\phi( z) and \phi(x)\}From Re _Two sub-networks of Siam Net can get a f e a t u rema p set { ϕ ( z ) and ϕ ( x )}

    \qquad
  • ϕ ( z ) is the feature − map set of rotation angle Λ\phi(z) is the feature-map set of rotation angle\Lambdaϕ ( z ) is f e a t u re of rotation angle Λmap p set

    \qquad
  • Through the cross-correlation layer { h ^ ( z , x ) } , calculate the heat map of the Template feature map with different rotation angles Λ, hi ( z , x ) = ϕ ( z ) ∗ ϕ ( x ) through the cross-correlation layer\{ \hat{h}(z,x)\}, calculate the heat map of the Template feature map of different rotation angles\Lambda, h_i(z, x)=\phi(z)*\phi(x)by cross-correlation layer { h^(z,x )} , calculate the heat map of the T e m pla t e feature map of different rotation angles Λ , hi(z,x)=ϕ ( z )ϕ(x)

    \qquad
  • Put { h ^ ( z , x ) } through global maximum pooling, and output a heat map h ( Z , x ) , that is, pick out the largest h ^ in { h ^ ( z , x ) } and \{\hat {h}(z, x)\} undergoes global maximum pooling, and outputs a heat map h(Z,x), that is, picks out the largest \hat in \{\hat{h}(z,x)\} {h}will { h^(z,x )} undergoes global maximum pooling and outputs a heat map h ( Z ,x),i.e. in { h^(z,x )} pick the largesth^

    \qquad

\qquad

4.2 Constructing RE-SiamNet Framework

  1. Identify the precision of the tracker in terms of discriminating between orientations of the rotational degree of freedom. Author considered here Λ \Lambda Λ rotation groups, based on which RE-SiamNets would be perfectly equivariant to angles defined by the set Θ = { ( i − 1 ) Λ ∗ 2 π } i = 1 Λ ⇒ { ( i − 1 ) 2 π 8 } i = 1 Λ = 8 \Theta=\{\frac{(i-1)}{\Lambda}*2\pi\}_{i=1}^{\Lambda} \Rightarrow \{(i-1)\frac{2\pi}{8}\}_{i=1}^{\Lambda=8} Th={ L(i1)2 p }i=1L{(i1)82 p.m}i=1L = 8
Differentiate the accuracy of the tracker in terms of the difference between different rotation angles. The author here uses a set of arithmetic difference angle sets, as shown in the formula
  1. Define the non-parametric encoding ϕ ( ⋅ ) \phi(\cdot) ϕ()based on existing Siamese trackers. Based on the choice of ϕ ( ⋅ ) \phi(\cdot) ϕ(),discriminative power of trackers varies.
Define a parameterless encoder based on the existing Siamese tracker. The discriminative power of the tracker will change based on the selection of these encoders
  1. Replace all the convolutional layers of ϕ ( ⋅ ) \phi(\cdot) ϕ() with the rotation-equivariant modules.
Use the rotation equivariant module to cancel the convolution module in SiamFC
Here e2CNN is used to achieve rotation
  1. Instead of a single convolution to generate h = ( z , x ) , Λ h=(z,x),\Lambda h=(z,x)Λ convolutions are performed to generate Λ \Lambda Λ different heatmap
8 convolutions generate 8 different heatmaps, canceling the single heatmap generated by a single convolution
  1. Perform Global max-pooling over the feature maps to generate h ( Z , x ) h(Z,x) h(Z,x), which is then processed to localize the target.
The global maximum pooling performed on the generated 8 sets of feature maps will be sent to the head processing for target positioning.

$\qquad\$

5. Unsupervised Relative Rotation Estimation

\qquad

5.1 Unsupervised 2D pose estimation

\qquad

  • The inherent design of RE-SiamNet allows to obtain an estimation of the relative changes of 2D pose of the target in a fully unsupervised manner. This information can be obtained from the result of the group maxpooling step
The design of RE-SiamNet inherently has the ability to estimate the 2D pose-related changes of objects in an unsupervised manner. This information can be obtained by group max pooling
  • Let i ∈ { 1 , 2 , … , Γ } i \in \{1,2,\dots, \Gamma\} i{ 1,2,,Γ } denote one ofΛ \LambdaΛ orientations of the template. Then, i i i is the number of rotation groups by which the pose of the template differs from that of its appearance in the candidate image if :
    h ( Z , x ) = h ^ ( z i , x ) = g r o u p − m a x p o o l ( { z , x } ) h(Z,x)=\hat{h}(z_i, x)=group-maxpool(\{z, x\}) h(Z,x)=h^(zi,x)=groupma x p z l ({ z ,x})
Let i be one of the 8 template rotations, which refers to the number of rotations, a single 45 degrees

\qquad

Guess you like

Origin blog.csdn.net/Soonki/article/details/131252684
Recommended