[Semi-supervised medical image segmentation 2023 CVPR] BCP

[Semi-supervised medical image segmentation 2023 CVPR] BCP

Thesis title: Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

Chinese title: Bidirectional Copy-Paste Semi-Supervised Medical Image Segmentation

Paper link: https://arxiv.org/abs/2305.00673

Paper code: https://github.com/DeepMed-Lab-ECNU/BCP

Thesis team: East China Normal University & Shanghai Jiaotong University

Published: May 2023

DOI:

Quote:

Citation count:

Summary

In semi-supervised medical image segmentation, 存在着有标签和无标签数据分布的经验不匹配问题. If labeled and unlabeled data are processed separately or in an inconsistent manner, knowledge learned from labeled data may be largely discarded.

We propose a straightforward approach to alleviate this problem – bidirectional copy-pasting of labeled and unlabeled data in a simple Mean Teacher architecture. This method encourages unlabeled data from labeled data 向内和向外学习全面的共同语义.

More importantly, consistent learning procedures for labeled and unlabeled data can be performed on 很大程度上减少经验分布差距. In detail, we copy random crops of the labeled image (foreground) onto the unlabeled image (background), and copy the unlabeled image (foreground) onto the labeled image (background), respectively.

The two blended images are fed into a student network and supervised by a blended supervision signal of pseudo-labels and ground truth. We find that simple mechanisms for bidirectional copy-paste between labeled and unlabeled data are good enough, and experiments show tangible gains compared to state-of-the-art on other semi-supervised medical image segmentation datasets (e.g. over 21% Dice improvement on ACDC dataset with 5% labeled data).

1 Introduction

Segmentation of internal structures from medical images such as computed tomography (CT) or magnetic resonance imaging (MRI) is crucial for many clinical applications [34]. Various techniques for medical image segmentation based on supervised learning have been proposed [4, 13, 45], which usually require a large amount of labeled data. However, due to the tedious and expensive process of manual delineation when annotating medical images, semi-supervised segmentation has attracted more attention in recent years and has become ubiquitous in the field of medical image analysis.

Generally, in semi-supervised medical image segmentation, labeled and unlabeled data are drawn from the same distribution (Fig. 1(a)). But in practical scenarios, it is difficult to estimate the precise distribution from labeled data due to the small amount of labeled data. Therefore, there is always an empirical distribution mismatch between a large amount of unlabeled data and a very small amount of labeled data [30] (Fig. 1(b) and ©). Semi-supervised segmentation methods always try to train on labeled and unlabeled data symmetrically in a consistent manner. For example, self-training [1, 48] generates pseudo-labels to supervise unlabeled data in a pseudo-supervised manner. Mean teacher-based methods [40] employ a consistency loss to “supervise” unlabeled data with strong augmentation, similar to supervising labeled data with ground truth. DTC [16] proposes a dual-task consistency framework for both labeled and unlabeled data. ContrastMask [31] applies dense contrastive learning on labeled and unlabeled data. But most of the existing semi-supervised methods use labeled and unlabeled data under different learning paradigms. Therefore, it 往往导致从标记数据中学习到的大量知识被丢弃,以及标记数据和未标记数据之间的经验分布不匹配(Fig. 1(e)).

image-20230608102829227

Figure 1. Illustration of the mismatch problem in the semi-supervised tilt setting. Suppose the training set is drawn from the latent distribution in (a). But the empirical distributions for a small amount of labeled data and a large amount of unlabeled data are (b) and (c), respectively. It is difficult to construct an accurate distribution of the entire dataset with a small amount of labeled data. (d) By using our BCP, the empirical distributions of labeled and unlabeled features are consistent. (e) But other methods, such as SSNet [35] or cross-unlabeled data copy-paste, cannot solve the problem of empirical distribution mismatch. All distributions are kernel density estimates of somatic cells belonging to the cardiomyocyte class in ACDC [2].

CutMix [42] is a simple and powerful data manipulation method also known as copy-paste (CP), which has the potential to encourage unlabeled data to learn common semantics from labeled data, since pixels in the same map share Closer semantics [29]. In semi-supervised learning, forcing consistency between weak-strong augmentation pairs on unlabeled data is widely used [11, 14, 32, 47], while CP is usually used as strong augmentation. But existing CP methods only consider CP crossing unlabeled data [8, 10, 14], or simply copy crops from labeled data as foreground and paste to another data [6, 9]. They neglect to design consistent learning strategies for both labeled and unlabeled data, hindering their application in reducing the distribution gap. At the same time, CP tries to enhance the generalization ability of the network by increasing the diversity of unlabeled data, but it is difficult to achieve high performance because the cropped mixed images can only be supervised by low-precision pseudo-labels. It is intuitive to use more precise supervision to help the network segment degenerate regions cut by CP.


为了缓解有标签数据和无标签数据之间的经验不匹配问题, a successful design encourages unlabeled data to learn comprehensive common semantics from labeled data, and at the same time, further achieves distributional consistency through consistent learning strategies for both labeled and unlabeled data.

We achieve this by proposing a surprisingly simple yet very effective bidirectional copy-paste (BCP) method instantiated in the Mean Teacher framework.

Specifically, to train the student network, we copy-paste random crops from labeled images (foreground) to unlabeled images (background), and vice versa, copy-paste random crops from unlabeled images (foreground) to labeled The image (background) comes up to augment our input.

The student network is supervised by a generated supervision signal via bidirectional replication between the teacher network's pseudo-labels for unlabeled images and the label map for labeled images.

These two hybrid images help the network bidirectionally and symmetrically learn the common semantics between labeled and unlabeled data. We compute Dice scores for both labeled and unlabeled training sets from the LA dataset [39] based on the state-of-the-art and models trained by our method, as shown in Figure 2. Previous models deal with labeled and unlabeled data separately, and there is a large performance gap between labeled and unlabeled data. For example, MC-Net achieves 95.59% Dice on labeled data but only 87.63% on unlabeled data. This means that previous models incorporate knowledge from the ground truth well, but throw away a lot when transferring to unlabeled data. Our method can largely reduce the gap between labeled and unlabeled data (Fig. 1(d)), in terms of its performance. It is also interesting that our BCP has lower Dice for labeled data than other methods, implying that BCP can alleviate the overfitting problem to some extent.

image-20230608102917555

figure 2. Dice scores on unlabeled and labeled training data for different models on the LA dataset [39]. A smaller performance gap is observed in our method.

We validate BCP on three popular datasets: LA [39], Pancreas-NIH [21] and ACDC [2]. Extensive experiments show that our naive method improves even more than 21% on dice with 5% labeled data on the ACDC dataset. Ablation studies further show the effectiveness of each proposed module. Note that compared to baselines (such as VNET or UNET), our method introduces no new training parameters while maintaining the same computational cost.

2. Related work

2.1 Medical Image Segmentation

Segmenting internal structures from medical images is essential for many clinical applications [34]. Existing medical image segmentation methods can be divided into two categories. The first class designs various 2D/3D segmentation network architectures [3, 4, 13, 18, 20, 49]. The second category leverages medical prior knowledge for network training [23, 28, 33, 38].

2.2 Semi-supervised medical image segmentation

There have been many efforts in semi-supervised medical image segmentation. Entropy Minimization (EM) and Consistency Regularization (CR) are two widely used loss functions. At the same time, many studies have extended the Mean teacher framework in different ways. SASSNET [12] utilizes unlabeled data to enforce geometric shape constraints on segmented output. DTC [16] proposes a dual-task consistency framework by explicitly building task-level regularization. SIMCVD [40] explicitly models geometric structure and semantic information and constrains them between teacher and student networks. These methods use geometric constraints to supervise the output of the network. UA-MT [41] utilizes uncertainty information to guide the student network to gradually learn from meaningful and reliable targets of the teacher network. [46] combine image and sheet representations to explore more complex similarity cues, making outputs consistent given different input sizes. Coranet [22] proposes a model that can generate certain and uncertain regions, and the student network treats the regions indicated by the teacher network with different weights. UMCT [37] utilizes different views of the network to predict different views of the same image. It utilizes the predicted values ​​and corresponding uncertainties to generate pseudo-labels, which are used to supervise the prediction of unlabeled images. These methods improve the effectiveness of semi-vision medical image segmentation. However, they ignore how to learn common semantics from labeled to unlabeled data. Treating labeled and unlabeled data separately often hinders the transfer of knowledge from labeled to unlabeled data.

2.3 Copy and paste

Copy-paste is a simple yet powerful data manipulation method for many tasks such as instance segmentation [7,9], semantic segmentation [6,25] and object detection [5]. In general, copy-paste refers to copying a crop of one image and pasting it onto another image. MIXUP [43] and CutMIX [42] are classics for blending whole images and blending image crops, respectively. Many recent works extend them to address specific goals. GuidedMix-Net [25] utilizes MixUp to transfer knowledge from labeled data to unlabeled data, thereby generating high-quality pseudo-labels. Instaboost [7] and Contextual Copy-Past [5] carefully place a cropped foreground onto another image based on the surrounding visual context. CP2 [27] proposed a pre-training method to copy-paste random crops from one image into another background image, which proved to be more suitable for downstream dense prediction tasks. [9] conducted a systematic study on copy-paste in instance segmentation. UCC [6] replicates pixels belonging to classes with lower confidence scores as foreground during training to alleviate distribution mismatch and class imbalance. Previous methods only considered cross-copy-pasting unlabeled data, or simply copying crops from labeled data as foreground to another data. They neglect to design consistent learning strategies for labeled and unlabeled data. Therefore, large distribution gaps are still unavoidable.

3. Method

Mathematically, we define the three-dimensional volume of a medical image as X ∈ RW × H × L \textbf{X}\in\mathbb{R}^{W\times H\times L}XRW×H×L

The goal of semi-supervised medical image segmentation is to predict a per-voxel label map Y ^ ∈ { 0 , 1 , . . . , K − 1 } W × H × L \widehat{\mathbf{Y}}\in\{0, 1,...,K-1\}^{W\times H\times L}Y { 0,1,...,K1}W × H × L , indicating that the background and target are inXXposition in X. KKK is the number of the class.

Our training set DDD byNNN labeled data andMMM unlabeled data formN ≪ MN\ll MNM , expressed as two subsets:D = D l ∪ D u {\mathcal{D}}={\mathcal{D}}^{l}\cup{\mathcal{D}}^{u}D=DlDu,任何D l = { ( X il , Y il ) } i = 1 N {\mathcal{D}}^{l}= \{(\mathbf{X}_{i}^{l},\mathbf {Y}_{i}^{l})\}_{i=1}^{N}Dl={(Xil,Yil)}i=1ND u = { X iu } i = N + 1 M + N . \mathcal{D}^u=\{\mathbf{X}_i^u\}_{i=N+1}^{M+N}.Du={ Xiu}i=N+1M+N.

The overall pipeline of the proposed bidirectional copy-paste method is shown in Fig. 3, in the average-teacher architecture. We randomly select two unlabeled images ( X pu , X qu ) from the training set (\mathbf{X}_{p}^{u},\mathbf{X}_{q}^{u})(Xpu,Xqu) and two labeled images( X il , X jl ) (\mathbf{X}_i^l,\mathbf{X}_j^l)(Xil,Xjl)

Then we start from X il \mathbf{X}_{i}^{l}Xil(Foreground) Copy paste a random crop to X qu \mathbf{X}_q^uXqu(background), generate a blended image X out \mathbf{X}^{out}Xout,从 X p u \mathbf{X}_p^u Xpu(Foreground) copy paste to X jl \mathbf{X}_j^lXjl(background), generating another blended image X in \mathbf{X}^{in}Xin

Unlabeled images are able to learn comprehensive co-semantics from labeled images, from the inward ( X in \mathbf{X}^{in}Xin ) and outgoing (X out \mathbf{X}^{out}Xo u t ) in both directions. Then, the imageX in \mathbf{X}^{in}Xin X o u t \mathbf{X}^{out} Xo u t is fed into the student network to predict the segmentation maskY ^ in \widehat{\mathbf{Y}}^{in}Y inY^out \widehat{\mathbf{Y}}^{out}Y o u t . Segmentation masks are supervised by bidirectionally replicating predictions from the teacher network for unlabeled images and label maps for labeled images.

image-20230608095514986

Overview of our bidirectional copy-paste framework in a Meaner architecture, drawn with 2D input for better visualization. The input to the student network is generated by mixing two labeled images and two unlabeled images in the proposed bidirectional copy-paste manner. Then, to provide a supervision signal to the student network, we combine the ground truth and pseudo-labels produced by the teacher network into one supervision signal through the same two-way copy-paste, so that the strong supervision of the ground truth helps the weak supervision of the pseudo-labels.

3.1 Two-way copy and paste

3.1.1 Average teacher and training strategy

In our BCP framework, there is a teacher network, F t ( X pu , X qu ; Θ t ) \mathcal{F}_t\left(\mathbf{X}_p^u,\mathbf{X}_q^u ;\mathbf{\Theta}_t\right)Ft(Xpu,Xqu;Tht) ,for a functionF s ( X in , X out ; Θ s ) \mathcal{F}_{s}\left(\mathbf{X}^{in},\mathbf{X}^{out}; \mathbf{\Theta}_{s}\right)Fs(Xin,Xout;Ths) , whereΘ t \Theta_tThtSum Θ s \Theta_sThsis the parameter. The student network is optimized by stochastic gradient descent, and the teacher network is optimized by exponential moving average (EMA) of the student network.

Our training strategy is divided into three steps.

First, we pre-train the model using only labeled data, and then we use the pre-trained model as a teacher network to generate pseudo-labels for unlabeled images. In each iteration, we first optimize the student network parameters Θ s \Theta_s by stochastic gradient descentThs. Finally, we use the student parameter Θ s \Theta_sThsThe EMA updates the teacher network parameter Θ t \Theta_tTht

3.1.2 Pre-training by copying and pasting

Inspired by predecessors [9], we perform copy-paste augmentation on labeled data and train a supervised model that generates pseudo-labels for unlabeled data during self-training. This strategy is proven to effectively improve the segmentation performance, and more details will be illustrated in the ablation study.

3.1.3 Two-way copy and paste images

To copy-paste between a pair of images, we first generate a zero-centered mask M ∈ { 0 , 1 } W × H × L {\mathcal{M}}\in\{0,1\}^{W \times H\times L}M{ 0,1}W × H × L , indicating whether the voxel is from the foreground (0) or background (1) image. The size of the zero value region isβ H × β W × β L \beta H\times\beta W\times\beta Lb H×b W×β L,whereβ∈ ( 0 , 1 ) \beta\in(0,1)b(0,1 ) . Then we bidirectionally copy-paste the labeled and unlabeled images as follows:
X in = X jl ⊙ M + X pu ⊙ ( 1 − M ), X out = X qu ⊙ M + X il ⊙ ( 1 − M ), \begin{gathered} \mathbf{X}^{in}=\mathbf{X}_{j}^{l}\odot\mathcal{M}+\mathbf{X}_{p}^{u}\ odot\left(\mathbf{1}-\mathcal{M}\right), \\ \mathbf{X}^{out}=\mathbf{X}_{q}^{u}\odot\mathcal{M }+\mathbf{X}_{i}^{l}\odot\left(\mathbf{1}-\mathcal{M}\right), \end{gathered}Xin=XjlM+Xpu(1M),Xout=XquM+Xil(1M),
其中 X i l , X j l ∈ D l , i ≠ j , X p u , X q u ∈ D u , p ≠ A q , 1 ∈ { 1 } W × H ~ × L \mathbf{X}_{i}^{l},\mathbf{X}_{j}^{l}\in\mathcal{D}^{l},i\neq j,\mathbf{X}_{p}^{u},\mathbf{X}_{q}^{u}\in\mathcal{D}^{u},p\neq Aq,\textbf{1}\in\{1\}^{W\times\tilde{H}\times L} Xil,XjlDl,i=j,Xpu,XquDu,p=Aq,1{ 1}W×H~×L, ⊙ \odot refers to element-wise multiplication. In order to maintain the diversity of the input, two labeled and unlabeled images are adopted.

3.1.4 Two-way copy and paste monitoring signal

To train the student network, supervisory signals are also generated via BCP operations. Put the unlabeled image X pu \mathbf{X}_{p}^{u}Xpu X q u \mathbf{X}_q^u XquDefine the equivalent functions, solve the following functions:
P pu = F t ( X pu ; Θ t ) , P qu = F t ( X qu ; Θ t ) . \mathbf{P}_p^u=\mathbf{X}_t(\mathbf{X}_p^u;\mathbf{\Theta}_t),~~\mathbf{P}_q^u=\mathbf{F} _t(\mathbf{X}_q^u;\mathbf{\Theta}_t).Ppu=Ft(Xpu;Tht),  Pqu=Ft(Xqu;Tht) .
The initial pseudo-labelY ^ u \widehat{\mathbf{Y}}^{u}Y u (remove p and q for simplicity) is obtained by performing binary segmentation tasks onP u \mathbf{P}^{u}PIt is determined by taking a public threshold of 0.5 on u , or for multi-class segmentation tasks inP u \mathbf{P}^{u}PIt is determined by taking the Argmax operation on u . By choosingY ~ u \widetilde{\mathbf{Y}}^{u}Y The largest connected component of u , to get the final pseudo-labelY ~ u \widetilde{\mathbf{Y}}^{u}Y u , effectively removes outlier pixels. Then, we propose to bidirectionally copy-paste the pseudo-labels of unlabeled images and the ground-truth labels of labeled images in the same way as Equation 1 and Equation 2 to obtain the supervisory signal: Y in =
Y jl ⊙ M + Y ~ pu ⊙ ( 1 − M ) , Y out = Y ~ qu ⊙ M + Y il ⊙ ( 1 − M ) . \begin{gathered} \mathbf{Y}^{in}=\mathbf{Y}_{j}^{l }\odot\mathcal{M}+\widetilde{\mathbf{Y}}_{p}^{u}\odot\left(\mathbf{1}-\mathcal{M}\right), \\ \mathbf {Y}^{out}={\widetilde{\mathbf{Y}}}_{q}^{u}\odot{\mathcal{M}}+\mathbf{Y}_{i}^{l} \odot\left(\mathbf{1}-{\mathcal{M}}\right). \end{gathered}Yin=YjlM+Y pu(1M),Yout=Y quM+Yil(1M).
Y in \mathbf{Y}^{in}Yin Y o u t \mathbf{Y}^{out} Yo u t will be used as supervision, supervisionX in \mathbf{X}^{in}Xin X o u t \mathbf{X}^{out} Xstudent network prediction of o u t .

3.2 Loss function

Each input image to the student network consists of labeled and unlabeled image components. Intuitively, ground-truth masks for labeled images are usually more accurate than pseudo-labels for unlabeled images. We use α to control the contribution of unlabeled image pixels to the loss function. X in \mathbf{X}^{in}Xin X o u t \mathbf{X}^{out} Xo u t based on the equivalent function of
L in = L sec ( Q in , Y in ) ⊙ M + α L sec ( Q in , Y in ) ⊙ ( 1 − M ) L out = L sec ( Q out , Y out ) ⊙ ( 1 − M ) + α L sec ( Q out , Y out ) ⊙ M , \begin{aligned} &\mathcal{L}^{in} =\mathcal{L}_{sec}\ left ( \mathbf{Q}^{in},\mathbf{Y}^{in}\right)\odot\mathcal{M} +\alpha\mathcal{L}_{seg}\left(\mathbf{Q }^{in},\mathbf{Y}^{in}\right)\odot\left(\mathbf{1}-\mathcal{M}\right) \\ &{\cal L}^{out} = \mathcal{L}_{seg}\left(\mathbf{Q}^{out},\mathbf{Y}^{out}\right)\odot(\mathbf{1}-\mathcal{M})+ \alpha\mathcal{L}_{seg}\left(\mathbf{Q}^{out},\mathbf{Y}^{out}\right)\odot\mathcal{M}, \end{aligned}Lin=Lsee g(Qin,Yin)M+αLsee g(Qin,Yin)(1M)Lout=Lsee g(Qout,Yout)(1M)+αLsee g(Qout,Yout)M,
where L seg \mathcal{L}_{seg}Lsee gis a linear combination of Dice loss and cross-entropy loss. Q in \mathbf{Q}^{in}Qin Q o u t \mathbf{Q}^{out} Qo u t equations:
Q in = F s ( X in ; Θ s ) , Q out = F s ( X out ; Θ s ) . \mathbf{Q}^{in}=\mathbf{X}^{in};\theta_s),~~\mathbf{Q}^{out}=\mathbf{F}_s( \mathbf{X}^{out};\mathbf{\Theta}_s).Qin=Fs(Xin;Ths),  Qout=Fs(Xout;Ths) .
In each iteration, we use the loss function to update the parameters in the student networkΘ s \Theta_sThs
L a l l = L i n + L o u t . {\cal L}_{a l l}={\cal L}^{i n}+{\cal L}^{o u t}. Lall=Lin+Lout.

4. Experiment

Guess you like

Origin blog.csdn.net/wujing1_1/article/details/131102999