1. Introduction
Example demonstrating rotation non-equivariance in regular CNN models used in object tracking:
The example describes the conventional CNN model in object tracking, which does not have rotation equivariance
ψ θ ( f ( / cdot ) ) ≠ f ( ψ θ ( ⋅ ) ) \psi_\theta(f(/cdot)) \neq f(\psi_\theta(\cdot))pi(f(/cdot))=f ( pi(⋅))
Equivariant and other denaturation:
Operators and functions can be interchanged, and there is commutability
transform [ F ( x ) ] = F ( transform [ x ] ) transform[F(x)] = F(transform[x])transform[F(x)]=F ( t r an s f or m [ x ])
Invariant invariance:
the input x is transformed, but the output after F is unchanged
F ( x ) = F ( transform [ x ] ) F(x) = F(transform[ x])F(x)=F ( t r an s f or m [ x ])
Covariant covariance:
the input x is transformed into transform, and the output after F is also transformed, but it is not transform, but another transformation can be used to make the result the same
transform ∗ F ( x ) = F ( transform [ x ] ) transform^*F(x) = F(transform[x])transform∗F(x)=F(transform[x])
2. Related Work
3. Rotation Equivariant CNNs
Rotational background
Rotation Equivariance
SFC-NNs
Knowledge Expansion: Spherical Harmonics
How to use the spherical harmonic function in the paper
- Spherical coordinates without z and θ z and \thetaz and θ are the circular harmonic function system
ψ jk ( r , φ ) = τ j ( r ) jk φ \\ \psi_{jk}(r,\varphi) = \tau_j(r)^{jk\varphi}\\ \qquad \\pjk(r,f )=tj(r)jkφ
- The following two parameters control the bias function (radial function τ j \tau_jtj) offset range
\qquad
φ ∈ ( − π , π ] \varphi \in (-\pi,\pi]Phi∈( − π ,π ]
_
Current times j = 1 , 2 , … , J j=1,2,\dots,Jj=1,2,…,J
\qquad
- Control pole sits ( x 1 , x 2 ) (x_1,x_2)(x1,x2) mark rotation angle
\qquad
( r , ϕ ) (r,\phi)(r,ϕ)
\qquad
- Angular function ( ejk φ ) (e^{jk\varphi})(ejk φ ), also becomes the order
\qquad
k ∈ Z Its value is related to the current function degree j of the function system Z ∈ [ − j , j ] k \in Z Its value is related to the current function degree j of the function system Z \in [-j,j]k∈The value of Z is related to the current function degree j of the function system Z∈[−j,j]
\qquad
- Using Euler's rotation theorem to express the rotation of the target
ρ θ ψ jk ( x ) = e − ik θ ψ jk ( x ) \qquad \\ \rho_{\theta}\psi_{jk}(x) = e^{-ik\theta}\psi_{jk}( x) \\ripjk(x)=e− ik θ ψjk(x)
e − ik θ means clockwise rotation θ, e + ik θ means counterclockwise rotation θ e^{-ik\theta} means clockwise rotation \theta, e^{+ik\theta} means counterclockwise rotation \thetae− ik θ means rotateθ,e+ ik θ means counterclockwise rotationθ
\qquad
Note that here ψ jk ( x ) refers to ψ jk ( ⋅ ) , x is a general reference, not a specific one. Note that \psi_{jk}(x) here refers to \psi_{jk}(\cdot) , x refers generally, not specificallyNote that here ψjk( x ) refers to ψjk( ⋅ ) , x refers generally, not specifically to
\qquad
- Each learned weight wjk ∈ C is constructed as a linear connection between basic filters Each learned weight w_{jk} \in \mathbb{C} is constructed as a linear connection between basic filters linear connectionEach learned weight wjk∈C , is constructed as a linear connection between the basic filters
Ψ ( x ) = ∑ j = 1 J ∑ k = 0 K w j k ψ j k ( x ) \qquad \\ \Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}\psi_{jk}(x) \\ \qquad \\ Ψ ( x )=j=1∑Jk=0∑Kwjkpjk(x)
- For the angle of rotation θ, the synthesis filter can be controlled by the phase control of the basic filter For the rotation \theta angle, the control of the synthesis filter can be realized by the phase control of the basic filterFor the angle of rotation θ , control of the synthesis filter can be achieved by the phase control of the basic filter
ρ θ Ψ ( x ) = ∑ j = 1 J ∑ k = 0 K wjke − ik θ ψ jk ( x ) \quad \\ \rho_{\theta}\Psi(x) = \sum_{j=1}^ {J}\sum_{k=0}^{K}w_{jk}e^{-ik\theta}\psi_{jk}(x) \\\quad\\riΨ ( x )=j=1∑Jk=0∑Kwjke− ik θ ψjk(x)
A rotation direction of the filter can be obtained through the real part of Ψ, called Re\Psi ( x ) A rotation direction of the filter can be obtained through the real part of \Psi, called Re\Psi(x)A rotation direction of the filter can be obtained through the real part of Ψ , which is called R e Ψ ( x )
\qquad
\qquad
4. Rotation Equivariant Siamese Trackers
\qquad
4.1 Formulation Based on Siam-FC
h ( z , x ) = f ( z ) ∗ f ( x ) \qquad \\ h(z,x)=f(z)*f(x) \\ \qquad \\ h(z,x)=f(z)∗f(x)
f ( ⋅ ) refers to the feature extraction network\qquad f(\cdot) refers to the feature extraction networkf ( ⋅ ) refers to the feature extraction network
∗ refers to the convolution operation of cross-correlation\qquad * refers to the convolution operation of cross-correlation∗ refers to the convolution operation of cross-correlation
- The Candidate Head of the network (processing the Search region) uses a search image (unchanged)
\qquad - The Template Head of the network is modified so that multiple template images (as shown in the figure, the rotated template) can be input as input, and a series of rotation variables Λ \LambdaΛ definition is Z collection, in whichZ = { z 1 , z 2 , … , z Λ } Z=\{z_{1}, z_{2},\dots, z_{\Lambda}\}Z={
z1,z2,…,zL} , that is, all possible rotation angles
\qquad - First calculate the feature f ( z ) f(z) of the initial tragetf ( z ) , then rotatef ( z ) f(z)f ( z ) , since it is a rotational equivariant network, it is theoretically possible to do this
\qquad - Rotate the Target in the Template:
\qquad
y c ~ ( 1 ) ( x , θ ) = R e ∑ c = 1 C ∑ j = 1 J ∑ k = 0 K w c ^ c j k e − i k θ ( I c ∗ ψ j k ) ( x ) \qquad \\ y_{\tilde{c}}^{(1)}(x,\theta) = Re \sum_{c=1}^{C}\sum_{j=1}^{J}\sum_{k=0}^{K}w_{\hat{c}cjk}e^{-ik\theta}(I_c * \psi_{jk})(x) \\ \qquad \\ yc~(1)(x,i )=Rec=1∑Cj=1∑Jk=0∑Kwc^ cjke−ikθ(Ic∗pjk)(x)
in
- I c is a picture with channel c, c ∈ { 1 , 2 , … , C } I_c is a picture with channel c, c \in \{ 1, 2, \dots, C\}Icis a picture with channel c , c∈{ 1,2,…,C}
- ρ θ Ψ c ^ c ( 1 ) Rotational filter\rho_{\theta}\Psi_{\hat{c}c}^{(1)}rotational filterriPsc^c(1)spin filter
- c ^ ∈ { 1 , 2 , … , C ^ } \hat{c} \in \{1, 2,\dots, \hat{C} \} c^∈{ 1,2,…,C^}
- The equidistant rotation angle θ can be obtained by the set Θ = { 0 , Λ , … , 2 π Λ − 1 Λ } The equidistant rotation angle \theta can be obtained by the set \Theta=\{0, \Lambda, \dots, 2\pi \ frac{\Lambda-1}{\Lambda}\}The equidistant rotation angle θ can be given by the set Θ={ 0,L ,…,2 p.mLL − 1}
- The bias term β c ^ ( 1 ) is used in the layer (first layer): ζ c ^ ( 1 ) to obtain the feature map bias term \beta_{\hat{c}}^{(1)} is used in the layer (First layer): \zeta_{\hat{c}}^{(1)}Get the feature mapBias term βc^(1)Used at layer ( first layer ) : ζc^(1)Get feature map
- The nonlinear connection σ c ^ ( 1 ) is used in the layer (first layer): ζ c ^ ( 1 ) to obtain the feature map nonlinear connection \sigma_{\hat{c}}^{(1)} is used in the layer (First layer): \zeta_{\hat{c}}^{(1)}Get the feature mapnonlinear connection σc^(1)Used at layer ( first layer ) : ζc^(1)Get feature map
- rotationally equivariant convolution
yc ^ ( l ) = R e ∑ c = 1 C ∑ ϕ ∈ Θ ∑ j , kwc ^ cjk , θ − ϕ e − ik θ ( ζ cl − 1 ( , ̇ ϕ ) ∗ ψ jk ) ( x ) \qquad \\ y_{\hat{c}}^{(l)} = Re\sum_{c=1}^{C}\sum_{\phi \in \Theta}\sum_{j,k}w_{\hat {c}cjk,\theta - \phi}\hspace{1mm}e^{-ik\theta}(\zeta_c^{l-1}(\dot, \phi)*\psi_{jk})(x) \\ \qquad \\yc^(l)=Rec=1∑Cϕ∈Θ∑j,k∑wc^ cjk,θ−ϕe− ik θ (gcl−1(,˙) _∗pjk)(x)
Among them,
the subscript θ − ϕ in the weight item w refers to the group convolution operation performed in the angle dimension. The subscript \theta-\phi in the weight item w refers to the group convolution operation performed in the angle dimensionThe subscript θ in the weight term w−ϕ refers to the grouped convolution operation \qquad in the angle dimension
- Rotationally equivariant pooling
\qquad
The output of the last grouped volume layer is further processed in the rotated dimension. Unlike the traditional classification network, this pooling is not performed on the dimension of W\times H (spatial dimension), but on the angle grouping ( { 0 , 2 π 8 , 4 π 8 , ... , 14 π 8 } \ {0, \frac{2\pi}{8}, \frac{4\pi}{8}, \dots, \frac{14\pi}{8} \}{ 0,82 p.m,84 p.m,…,814 p.m} ) dimension to preserve the features of rotation equivariance
\qquad
- Cross-correlation of rotational equivariance
\qquad
- A feature − map set { ϕ ( z ) and ϕ ( x ) } can be obtained from the two sub-networks of Re − Siam N et. A feature-map set can be obtained from the two sub-networks of Re-SiamNet\{\phi( z) and \phi(x)\}From Re _−Two sub-networks of Siam Net can get a f e a t u re−ma p set { ϕ ( z ) and ϕ ( x )}
\qquad- ϕ ( z ) is the feature − map set of rotation angle Λ\phi(z) is the feature-map set of rotation angle\Lambdaϕ ( z ) is f e a t u re of rotation angle Λ−map p set
\qquad- Through the cross-correlation layer { h ^ ( z , x ) } , calculate the heat map of the Template feature map with different rotation angles Λ, hi ( z , x ) = ϕ ( z ) ∗ ϕ ( x ) through the cross-correlation layer\{ \hat{h}(z,x)\}, calculate the heat map of the Template feature map of different rotation angles\Lambda, h_i(z, x)=\phi(z)*\phi(x)by cross-correlation layer { h^(z,x )} , calculate the heat map of the T e m pla t e feature map of different rotation angles Λ , hi(z,x)=ϕ ( z )∗ϕ(x)
\qquad- Put { h ^ ( z , x ) } through global maximum pooling, and output a heat map h ( Z , x ) , that is, pick out the largest h ^ in { h ^ ( z , x ) } and \{\hat {h}(z, x)\} undergoes global maximum pooling, and outputs a heat map h(Z,x), that is, picks out the largest \hat in \{\hat{h}(z,x)\} {h}will { h^(z,x )} undergoes global maximum pooling and outputs a heat map h ( Z ,x),i.e. in { h^(z,x )} pick the largesth^
\qquad
\qquad
4.2 Constructing RE-SiamNet Framework
- Identify the precision of the tracker in terms of discriminating between orientations of the rotational degree of freedom. Author considered here Λ \Lambda Λ rotation groups, based on which RE-SiamNets would be perfectly equivariant to angles defined by the set Θ = { ( i − 1 ) Λ ∗ 2 π } i = 1 Λ ⇒ { ( i − 1 ) 2 π 8 } i = 1 Λ = 8 \Theta=\{\frac{(i-1)}{\Lambda}*2\pi\}_{i=1}^{\Lambda} \Rightarrow \{(i-1)\frac{2\pi}{8}\}_{i=1}^{\Lambda=8} Th={ L(i−1)∗2 p }i=1L⇒{(i−1)82 p.m}i=1L = 8
- Define the non-parametric encoding ϕ ( ⋅ ) \phi(\cdot) ϕ(⋅)based on existing Siamese trackers. Based on the choice of ϕ ( ⋅ ) \phi(\cdot) ϕ(⋅),discriminative power of trackers varies.
- Replace all the convolutional layers of ϕ ( ⋅ ) \phi(\cdot) ϕ(⋅) with the rotation-equivariant modules.
- Instead of a single convolution to generate h = ( z , x ) , Λ h=(z,x),\Lambda h=(z,x),Λ convolutions are performed to generate Λ \Lambda Λ different heatmap
- Perform Global max-pooling over the feature maps to generate h ( Z , x ) h(Z,x) h(Z,x), which is then processed to localize the target.
$\qquad\$
5. Unsupervised Relative Rotation Estimation
\qquad
5.1 Unsupervised 2D pose estimation
\qquad
- The inherent design of RE-SiamNet allows to obtain an estimation of the relative changes of 2D pose of the target in a fully unsupervised manner. This information can be obtained from the result of the group maxpooling step
- Let i ∈ { 1 , 2 , … , Γ } i \in \{1,2,\dots, \Gamma\} i∈{
1,2,…,Γ } denote one ofΛ \LambdaΛ orientations of the template. Then, i i i is the number of rotation groups by which the pose of the template differs from that of its appearance in the candidate image if :
h ( Z , x ) = h ^ ( z i , x ) = g r o u p − m a x p o o l ( { z , x } ) h(Z,x)=\hat{h}(z_i, x)=group-maxpool(\{z, x\}) h(Z,x)=h^(zi,x)=group−ma x p z l ({ z ,x})
\qquad