[Paper Notes] Video Polyp Segmentation VPS: Video Polyp Segmentation A Deep Learning Perspective

contribute:

  • Effect: 170fps
  • Video Polyp Segmentation Dataset: SUN-SEG-Easy Dataset
  • VPS Baseline: PNS+ (baseline refers to the baseline, which means that the performance lower than this method is unacceptable)
  • VPS Benchmark

Targets: colonic polyp diversity (e.g. border contrast, shape, orientation, camera angle), internal artifacts (e.g. water flow, residue) and imaging degradation (e.g. color distortion, specular reflection).

SUN-SEG dataset

Based on the SUN dataset, new annotations are added, including object masks, boundaries,
insert image description here

Network Architecture

insert image description here

Global Encoder

Take the first frame (H', W', 3) in the T frame sequence as the anchor point, and extract the anchor point feature A h ∈ RH h × W h × C h A^h ∈ R^{H^ h×W^h×C^h}AhRHh×Wh×Ch

Local Encoder

Using a continuous frame in the sliding window as input, use the encoder to extract two sets of features high and low
insert image description here
insert image description here

NS block

Dynamically update the receptive field

image-20220525151933329

channel division

After obtaining the Q, K, and V matrices (T * H * W * C), dimension them into N parts from the channel to obtain Q i , K i , V i ∈ RT × H × W × CN {Q_i, K_i, V_i} ∈R^{T\times H\times W\times \frac{C}{N}}Qi,Ki,ViRT×H×W×NC, respectively input N self-attention modules

Query dependency rules

Referenced: PCSA

In order to model the spatio-temporal relationship between consecutive frames, it is necessary to measure the segmented query features ( Q i ) i = 1 N {(Q_i)}_{i=1}^N(Qi)i=1Nand key features ( K i ) i = 1 N {(K_i)}_{i=1}^N(Ki)i=1NThe similarity between, referring to PCSA introduces N correlation measure blocks to calculate the space-time matrix of the restricted field of target pixels.

In Non-local, the relationship between the pixels in Q and all the pixels in K is calculated, and the relationship between the query position and key features of all positions is calculated, while the blocks in this paper gradually expand the range of feature blocks

non-local

Specifically, it is similar to the pyramid network, given Q i Q_iQiA pixel of the matrix X q X^qXq (more precisely, it should be height x, width y, all C/8 channel pixel values ​​​​of the zth frame), according to the size of the windowkkDilation rate di d_i of k and dilated convolutiondi, in K i K_iKiSelect the height in the matrix as ( x − kdi , x + kdi ) (x-kd_i, x+kd_i)(xkdix+kdi),宽为 ( y − k d i , y + k d i ) (y-kd_i,y+kd_i) (ykdiy+kdi) , add up the pixel values ​​of all channels in all frames, and as the number of blocks in N blocks increases,di = 2 i − 1 d_i=2i-1di=2i _1 will increase, which is equivalent to obtainingQ i Q_iQiwith a larger range of K i K_iKiThe relationship between. Similar to expanding the receptive field

normalization rules

Q i Q_i QiUsing Norm ( ) Norm()N o r m ( ) layer normalization along the time dimension
Q i ^ = N orm ( Q i ) \hat{Q_i}=Norm(Q_i)Qi^=Norm(Qi)

correlation measure

The final correlation calculation formula, the overall form is the same as the original transformer's self-attention formula

image-20220504180916061

Spatial-Temporal (space-time aggregation)

Similar to the similarity calculation, the calculation of the V matrix and the Q and K similarity results, in fact

insert image description here

In fact, the overall calculation process is the same as the transformer's self-attention mechanism, but the way of calculating the correlation between pixels has been changed.

soft-attention

Through this module, the features M i AM^A_i of the similarity matrix are fusedMiAand spatio-temporal aggregation feature M i TM^T_iMiT, should strengthen correlated spatiotemporal patterns and suppress weakly correlated spatiotemporal patterns

First, a group of similarity matrices M i A M_i^AMiAConcatenate along the channel dimension to generate MAM^AMA

The Max function calculates MAM^AMThe maximum value of A on the channel dimension, and then a set of spatiotemporal aggregation features along the channel dimensionM i TM^T_iMiTSplicing to generate MTM^TMT

image-20220525152135100
normalized self-attention

W T W_T WTIt is a learnable weight, and ※ indicates the channel-type Hadamard product (multiplying the corresponding elements of the matrix)

image-20220525152513330

Hadama product:

For m × nm\times nm×n的两个矩阵A和B,相同位置元素相乘
( a 11   a 12   a 13 a 21   a 22   a 23 a 31   a 32   a 33 ) ∗ ( b 11   b 12   b 13 b 21   b 22   b 23 b 31   b 32   b 33 ) = ( a 11 b 11   a 12 b 12   a 13 b 13 a 21 b 21   a 22 b 22   a 23 b 23 a 31 b 31   a 32 b 32   a 33 b 33 ) \left( \begin{matrix} a_{11}\ a_{12}\ a_{13}\\ a_{21}\ a_{22}\ a_{23}\\ a_{31}\ a_{32}\ a_{33}\\ \end{matrix} \right) * \left( \begin{matrix} b_{11}\ b_{12}\ b_{13}\\ b_{21}\ b_{22}\ b_{23}\\ b_{31}\ b_{32}\ b_{33}\\ \end{matrix} \right) = \left( \begin{matrix} a_{11}b_{11}\ a_{12}b_{12}\ a_{13}b_{13}\\ a_{21}b_{21}\ a_{22}b_{22}\ a_{23}b_{23}\\ a_{31}b_{31}\ a_{32}b_{32}\ a_{33}b_{33}\\ \end{matrix} \right) a11 a12 a13a21 a22 a23a31 a32 a33b11 b12 b13b21 b22 b23b31 b32 b33=a11b11 a12b12 a13b13a21b21 a22b22 a23b23a31b31 a32b32 a33b33

Output of the NS block

image-20220504184749302

global-local learning strategy

Realize long-term and short-term space-time propagation over arbitrary time distances

insert image description here

Global Spatial-Temporal Modeling

global spatiotemporal modeling

The first NS block to model long-term relationships over arbitrary temporal distances requires four-dimensional temporal features as input.

insert image description here

Using the anchor feature A h A^hAh as query matrixQ g Q^gQg , using the high features generated by the local encoder asK g K^gKg andQ g Q^gQg

The purpose is to establish the pixel similarity between the anchor point and the local high feature, residual connection, get Z g Z^gZg , where + is element-wise addition
insert image description here

Global-to-Local Propagation

In the second NS block, the long-distance dependency Z g Z^gZg is propagated to the frame within the sliding windowas input to the second NS block

image-20220504194048843

decoder

Combine the low features of the local encoder and the output features of the second NS block Z l Z^lZl restored to spatial form as input to a two-stage U-Net decoder

insert image description here

Optimizing with Binary Cross Entropy Loss

insert image description here

PCSA

CSA (constrained self-attetion) focuses on local motion patterns instead of learning the global background

Considering that protruding objects can have different sizes and move at different speeds, a set of CSAs is used to form a pyramid structure

Constrained self-attention

Constrains correlation measures and context in consecutive frames to the neighborhood of Q

For example, in the picture below, the object in the first frame has a similar position to the object in the adjacent frame. Based on this, for a feature element x(t, h, w) in the Q matrix, take its value in the K matrix The surrounding area is used to measure the correlation, which is limited to frame: 1-T, height: h-dr, h+dr, width: w-dr, w+dr

image-20220505135405205

combination of pyramids

This is the reference used in PNS-Net

A single constrained self-attention with a fixed size cannot recognize moving objects caused by various speeds and various sizes, and the multi-head mechanism has different window sizes and moving ranges for each head to adapt to different motion situations

Combining multiple heads with multiple scales

Multi-head: in parallel, divides the input features into g groups along the channel, and uses constrained self-attention for each group

mg-g44DU2tR-1653467435113)]

combination of pyramids

This is the reference used in PNS-Net

A single constrained self-attention with a fixed size cannot recognize moving objects caused by various speeds and various sizes, and the multi-head mechanism has different window sizes and moving ranges for each head to adapt to different motion situations

Combining multiple heads with multiple scales

Multi-head: in parallel, divides the input features into g groups along the channel, and uses constrained self-attention for each group

Multi-scale: different groups, different window sizes, different d and r

Guess you like

Origin blog.csdn.net/xqh_Jolene/article/details/124968813