MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model

Simply ignore the version :

Please correct me if there are mistakes, thank you guys.

This is the author's first version of the article, which is relatively simple in general. In total, two improvements are proposed:

1. Due to the special nature of medical images, it is difficult to distinguish diseased tissue from the background , especially low-resolution images. In addition, the author believes that there is a lot of target information in the original image, but it is difficult to segment, while the segmentation image at any time t in the diffusion model has relatively enhanced segmentation target information, but it is not accurate. Based on these two points, the author proposes an ideal that combines the two to complement each other. The author proposes a dynamic condition encoder, dynamic condition encoding, which fuses two feature maps at each step. First, assume that the diffusion model has generated a feature map at time t, and the neural network needs to be trained to restore the image. At this time, the feature map in the diffusion model is passed to the original encoder, and the label (the paper says it is a raw image but I I think it should not be the original picture, it should be the label, and the author’s frame picture also uses the label) and pass it to the dynamic conditional encoder proposed by the author, and fuse it after each layer ends. Please see below for specific integration methods.

but! During the fusion process, the author found that although the feature map at any time t in the diffusion model contains certain target information, it will bring some high-frequency noise high-frequency noise . So with the following .

2. The author proposes FF-Parser to suppress the high-frequency noise brought by the feature map in the diffusion model . The main idea is to use Fourier transform to complete it. Please see below for details.

MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model

Paper: 2211.00611v1.pdf (arxiv.org)

Summary:

Diffusion Probability Model (DPM) is one of the hotspots in computer vision research in recent years. Its image generation applications, such as Imagen, latent diffusion models, and stable diffusion models, have shown impressive generative capabilities and attracted extensive discussions. Many recent studies have also found it useful in many other vision tasks, such as image deblurring, super-resolution, and anomaly detection. Inspired by the success of DPM , we propose the first DPM- based model for a general medical image segmentation task, which we name MedSegDiff . To improve step-by-step region attention of DPM in medical image segmentation, a dynamic conditional encoding method is proposed to establish state-adaptive conditions for each step of sampling . We further propose a feature frequency analyzer (FF-Parser) to remove the negative impact of high-frequency noise components in this process. We validate MedSegDiff on three medical segmentation tasks with different image modalities , namely optical cup segmentation on fundus images, brain tumor segmentation on MRI images, and thyroid nodule segmentation on ultrasound images. The experimental results show that the MedSegDiff model outperforms the existing SOTA methods in terms of performance, and has better generalization and effectiveness.

introduce:

Medical image segmentation is the process of dividing medical images into meaningful regions. Segmentation is a fundamental step in many medical image analysis applications, such as diagnosis, surgical planning, and image-guided surgery. This is important because it allows doctors and other medical professionals to better understand what they are looking at. It also makes it easier to compare images and track changes over time. In recent years, there has been increasing interest in automatic segmentation methods for medical images. These methods have the potential to reduce the time and effort required for manual segmentation and improve the consistency and accuracy of results. With the development of deep learning techniques, more and more studies have successfully applied neural network (NN) -based models to medical image segmentation tasks, from the popular convolutional neural network (CNN) [1] to the recent visual deformation device (ViT) [2] , [3] .

Recently, Diffusion Probability Model (DPM) [4] has gained popularity as a powerful generative model capable of generating images with high diversity and synthesis quality. Recent large-scale diffusion models, such as DALL-E2 [5] , Imagen [6] , Stable Diffusio [7] show incredible generative effects. Diffusion models were originally applied to domains without absolute ground truth . However, recent studies have shown that it is also effective for problems where the ground truth is unique, such as super-resolution [8] and [9] deblurring.

Inspired by the recent success of DPM , we design a unique DPM -based medical image segmentation model. To the best of our knowledge, we are the first to propose a DPM- based image segmentation model in the context of general medical image segmentation. We noticed that in medical image segmentation tasks, lesions / organs are usually blurred and difficult to distinguish from the background. In this case, an adaptive calibration process is the key to obtaining fine results. Following this line of thinking, we propose dynamic conditional encoding on top of a proprietary dpm to design a model, named MedSegDiff . Note that during the iterative sampling process, MedSegDiff constrains each step with image priors from which to learn segmentation maps. For adaptive region attention, we incorporate the segmentation map at the current step into the image pre-encoding at each step . The specific implementation is to perform multi-scale fusion of the current step segmentation mask and the prior image on the feature layer. In this way, the corrupted current step mask helps to dynamically enhance the conditional features, thus improving the reconstruction accuracy. In the process, in order to remove the high-frequency noise in the corrupted given mask, we further propose a feature frequency parser (FF-Parser) to filter the feature Fourier space in the given mask. FF-Parser is adopted on each skip connection path to achieve multi-scale integration. We validate MedSegDiff on three different medical segmentation tasks , eyeball segmentation, brain tumor segmentation and thyroid nodule segmentation. The images of these tasks have different forms, namely fundus images, brain CTimages, ultrasound images . MedSegDiff outperforms previous SOTA on all three tasks in different modalities , which demonstrates the generalization and effectiveness of the proposed method. In short, the contributions of this paper are :

  1. The first to propose a DPM -based generative model for medical images.
  2. A dynamic conditional coding strategy is proposed for step-wise attention.
  3. Propose FF-Parser for removing high-frequency noise
  4. Excellent performance on the dataset.

theory

We design our model based on the diffusion model mentioned in [4] DDPM . Diffusion model is a generative model composed of two stages: forward diffusion stage and reverse diffusion stage. In the forward process, Gaussian noise is gradually added to the segmentation label X0 through a series of noise-adding steps T. In the reverse process, the neural network is trained to restore the original data by reversing the noise process.

  1. Dynamic Conditional Coding

In most conditional DPMs , the conditional prior will be the only given information. However, medical image segmentation is notorious for blurring objects . Lesions or tissues are often difficult to distinguish from their background, especially with low-resolution images such as MRI . Given only a static image I as prior information, each step is difficult to learn. To solve this problem, the author proposes dynamic condition encoding for each step . We notice that, on the one hand, the original image contains accurate segmented target information but is hard to distinguish from the background, and on the other hand, the current step segmentation map contains enhanced target regions, but not accurately . This motivates us to integrate the current step size segmentation information Xt into the conditional raw image encoding for complementarity . We integrate at the feature level. In the raw image encoder, we augment the intermediate features of the original image encoder with the current-step encoded features. The conditional feature map of each scale is fused with the Xt encoded features of the same feature . This operation is applied to the middle two stages, where each stage is a convolutional stage implemented after ResNet34 . This strategy helps MedSegDiff to dynamically locate and calibrate segmentations. Although this strategy is effective, another specific problem is that the Xt embedding will cause additional high-frequency noise. To address this issue, we propose FF-Parser to constrain high-frequency noise in features.

BFF-Parser

It is mainly to suppress the high-frequency noise brought in by Xt through Fourier transformation , and construct a learnable matrix A with the same size as m .

C , training and architecture

The network is trained in the same way as DDPM . The only difference is the loss function,

In each round of iterations, labels and original images are randomly selected as a set for training. The iteration numbers are drawn randomly from a uniform distribution. The main architecture of MedSegDiff is a modified Re- sUNet [11] , which we implement with a ResNet encoder after a UNet decoder . Refer to [12] for specific network settings . I and X t use two separate encoders .

The encoder consists of three convolutional stages. Each stage contains several residual blocks. The number of remaining blocks in each stage is set according to ResNet34 . Each residual block consists of two convolutional blocks, and each convolutional block consists of group norm and SiLU [13] active layers and one convolutional layer. The residual block receives a temporal embedding via a linear layer, SiLU activations, and another linear layer.

Guess you like

Origin blog.csdn.net/qq_39333636/article/details/129270306