(Final Chapter) What is semantic segmentation? Principle + handwritten code implementation?

Unet semantic segmentation

Table of contents

Unet semantic segmentation

1. How to understand "semantics" and "segmentation".

2. Principle of semantic segmentation(Key points)

3. Semantic segmentation meaning

4. Semantic segmentation application scenarios

5. Advantages of Unet (medical field)

6. Advance knowledge reserves

7. Semantic segmentation process

8. Project structure and introduction

9. Anso environment (python=3.8, pytorch)

10. Realistic process(emphasis)

11. Loss function

12. Evaluation indicators

13. UNet paper

14. Source land site(perpetual, exemption)

15. How to modify it into your own tasks

16. Personal growth experience

17. Project implementation

  1. How to understand "semantics" and "segmentation".

        Semantic segmentation is positioned in the computer vision field of deep learning of artificial intelligence. Its related tasks include target detection, image classification, instance segmentation, pose estimation, etc.

        There are four major categories of image recognition tasks in computer vision:

                (1) Classification -Classification: To solve the problem of "what?", That is, what kind of goals contain in a picture or a video judgment.

                (2) Positioning-Location: Solve the problem of "where?", that is, locate the location of the target.

                (3) Detection: Solve the problem of "Where is it? What is it?", that is, locate the location of the target and know what the target object is.

                (4) Segmentation: It is divided into instance segmentation (Instance-level) and scene segmentation (Scene-level) to solve the problem of "which target or scene each pixel belongs to".

        Semantics refers to having meaning that people can discuss in language, and segmentation refers to image segmentation. Semantic segmentation can separate each part of the entire image so that each part has a certain category meaning. Different from target detection, target detection only needs to find the target in the picture, put a frame on it and then classify it. Semantic segmentation uses the form of strokes to divide the entire image into each area without leaving any gaps. Each area is a category. If there is no category, it defaults to the background.

        In summary, semantic segmentation aims to assign each pixel in an image to one of the predefined semantic categories. Unlike traditional object detection tasks, semantic segmentation not only requires identifying the objects present in the image, but also assigning a semantic label to each pixel. It can help computers achieve a more refined and accurate understanding of objects in images, make better use of visual information, and provide more intelligent applications in various fields.

2. Principle of semantic segmentation

        To identify each part of the entire image means to be accurate to the pixel point, so semantic segmentation is actually to classify each pixel point in the image and determine the category of each point (such as belonging to the background, person, car , horses, etc.) to divide areas.

        So how do you color the pixels?

        The output of semantic segmentation is similar to the image classification network. The number of image classification categories is a one-dimensional one hot matrix. For example, figure: five categories [0,1,0,0,0]. The final output feature map of the semantic segmentation task is a three-dimensional structure with a size similar to the original image, and the number of channels is the number of categories. The number of channels is the number of categories, and the pixels marked by each channel are the positions of the category in the image. Finally, argmax is used to take the useful pixels of each channel to synthesize an image, and different colors are used to represent different categories. As shown below:

        Input: Colored original image [3,256,256]. Output: Grayscale image [256,256]

        Note: The output result y_true of this project uses grayscale image classification instead of color image classification. The category list is [0,1,2,3....class]. 0 represents the background color black, and each following integer represents a category. The project's watch_result.py script can highlight each category.

3. Semantic segmentation meaning

       The power of CNN is that its multi-layer structure can automatically learn features and can learn features at multiple levels. For example, a shallower convolutional layer has a smaller perceptual field and can learn features of some local areas; a deeper convolutional layer has a larger perceptual field and can learn more abstract features. These abstract features are very helpful for classification and help improve classification performance. It can very well determine what categories of objects are contained in an image. However, the disadvantages are obvious. These abstract features are less sensitive to the size, position, and direction of objects. This requires semantic segmentation to determine the category of each pixel in the image and perform accurate segmentation. The segmentation is at the pixel level.

4. Semantic segmentation application scenarios

Autonomous driving: Autonomous vehicles have the ability to "environmentally perceive" so that they can drive safely.

Medical image diagnosis: The machine can intelligently analyze medical images, reducing the workload of doctors and greatly reducing the time required to run diagnostic tests, such as cell segmentation and recognition, lung shape diagnosis, and identification of tumors, lesions, and abnormal tissues.

Determination of the landing point of the drone: Before the drone lands, the image of the open space on the ground is identified and segmented, and the size and shape are used to determine whether it can land safely.

5. Advantages of Unet

Unet is a popular semantic segmentation model that has the following advantages over many other semantic segmentation algorithms:

  1. High Accuracy: Unet achieves impressive results on many public datasets, achieving high-accuracy segmentation in many cases.

  2. Small sample learning: Unet has fewer parameters and trainable variables, which makes it more robust to small sample learning and suitable for situations where the data set is small or the sample is unbalanced.

  3. Multi-task learning: Unet can handle multiple tasks at the same time, such as segmentation, detection, and classification, etc., thus integrating multiple tasks into a single model.

  4. Data augmentation: Because Unet takes into account the shape and contextual information of the segmented objects, it is able to perform effective data augmentation through data augmentation techniques such as rotation and mirroring, making the model more robust.

  5. Fast inference: Unet has a simple, intuitive architecture and parameter sharing, which makes it fast during inference. In addition, Unet can use GPU acceleration to further increase inference speed.

     Overall, Unet has advantages in accuracy, small sample learning, multi-task learning, data enhancement, and fast reasoning, making it a popular semantic segmentation algorithm.

In semantic segmentation tasks in the medical field, the advantages of the Unet network are particularly significant:

        ​ ​ ​ Less training data: Medical image data is often difficult to obtain and difficult to label. The Unet network can achieve better performance with less training data by using skip connections to combine low-level features in the encoder with high-level features in the decoder.

        Good generalization performance: Medical image data often suffers from noise, artifacts, and different imaging conditions. The Unet network avoids the loss of fine spatial information by using a deconvolution layer (transpose convolution) for upsampling operations, thereby achieving better generalization performance.

        Able to handle large-size images: Medical images usually have higher resolution and size, requiring models that can handle large-size images. The Unet network can effectively handle large-size images by using pooling operations for downsampling and deconvolution layers for upsampling.

        Multi-scale segmentation: The size and location of structures in medical images vary greatly, and multi-scale feature expression needs to be considered. By using skip connections and upsampling operations, the Unet network can extract and fuse information from feature maps at multiple scales to achieve multi-scale segmentation.

6. Advance knowledge reserve

(1) encoder-decoder: The encoder is a classification network used to extract features, while the decoder gradually restores the previously lost spatial information of the encoder. Although this type of method has a certain effect and can restore part of the information, after all, the information It has been lost and cannot be fully restored. Typical algorithm structures include segnet/refineNet.

(2) Upsampling: In convolutional neural networks, since the input image is extracted through a convolutional neural network (CNN), the size of the output often becomes smaller, and sometimes we need to restore the image to its original size for further processing. The calculation is an operation that maps the image from a small resolution to a large resolution. In Unet, it is necessary to enlarge the downsampled feature map to the size of the original image to increase the resolution, and perform classification based on the pixel points of the original image. Convolution changes the number of channels, such as (1*128*40*40) => (1*64*40*40), and upsampling changes the size, such as (1*128*40*40) => (1* 128*80*80). Method: bilinear interpolation, transposed convolution.

(3) Feature fusion: Since CNN loses image details during the convolution and pooling process, that is, the feature map size gradually becomes smaller, it cannot well point out the specific outline of the object and which object each pixel belongs to. It cannot be done. to precise segmentation. The purpose is to combine the features extracted from the image into a feature that is more discriminative than the input features. How to correctly fuse features is a difficult problem. Fusion of features of different scales is an important means to improve segmentation performance. The low-level features have higher resolution and contain more location and detailed information, but because they undergo less convolution, they have lower semantics and more noise. High-level features have stronger semantic information, but have low resolution and poor perception of details. Feature fusion comprehensively utilizes multiple image features to achieve complementary advantages of multiple features and obtain more robust and accurate recognition results. Simultaneous upsampling in Unet is to fuse the high-level features of downsampling. Two classic feature fusion methods

     a.) concat: fusion of series features, directly connecting consecutive features. If the dimensions of the two input features x and y are p and q, the dimension of the output feature z is p+q. Ensure that both high+width are consistent. It can completely retain all the information of the image, but the number of channels C will increase (*2), which will lead to more parameters for subsequent convolution. Such as (1*64*40*40)add(1*64*40*40)=>(1*128*40*40)

     b.) add: Parallel strategy, combine these two feature vectors into a composite vector. For input features x and y, z=x+iy, where i is an imaginary unit. Ensure that the number of channels C+high+width is consistent between the two. There is not as much information as concant, but C+high+width remains unchanged after fusion. Such as (1*64*40*40)add(1*64*40*40)=>(1*64*40*40)

(4) FPNFeatured image pyramid

        ​ ​ ​ First, the original image is scaled to obtain images of different sizes, then feature maps of different sizes are generated based on the images of each size, and finally predictions are made based on multi-size feature maps. This method requires generating feature maps for images of each size, which consumes more computing and memory resources.

(5) What is an 8-bit image? What is a single channel graph? What is a grayscale image? The relationship between them?

       An 8-bit image is a digital image in which each pixel uses an 8-bit binary number (i.e. one byte) to represent its color or brightness value. Each binary number can have 256 possible values, from 0 to 255, representing different colors or grayscale levels. In the standard RGB color space, an 8-bit image contains three channels: red, green, and blue. Each channel uses an 8-bit binary number to represent the color intensity of each pixel, so a pixel can have 256x256x256 different colors. Approximately 16.7 million species. 8-bit image is a widely used image format, such as in web design, photography, and image editing. It has the advantages of small file size, simple processing and good compatibility. It can also perform certain image processing and improvement by adjusting the color and contrast of pixels. It should be noted that although 8-bit images have rich color information, because each pixel of each channel only uses 8 binary numbers, the color range and details it can represent will be smaller than images with higher bit depths. More restricted. There is a certain relationship between color images and 8-bit images. Usually, in the RGB color space, a color image consists of three color channels (red, green, and blue), and each channel uses an 8-bit binary number to represent the color intensity of each pixel. Therefore, a pixel can have 256x256x256 different colors, which is approximately 16.7 million colors. Color images in this color mode are also called 24-bit color images or true color images. However, when a color image is converted to an 8-bit image, each pixel uses an 8-bit binary number to represent its color information, thereby compressing the image to a smaller size. In this case, there are only 256 colors available for each pixel in the image, so the image has relatively low color accuracy, but the file size is also relatively small. 8-bitmaps can also provide appropriate color accuracy while keeping file sizes small by using indexed color technology to map each pixel value to a color in the color table. In general, color images and 8-bit images are digital images, and there is a conversion and compression relationship between them to meet different application needs. Color images can provide higher color accuracy and richer visual effects, while 8-bit images are more suitable for application scenarios that require small file sizes and lower color requirements.

       A single-channel image refers to an image that contains only one color channel in color mode. For example, in a grayscale image, each pixel has only one channel representing the brightness value; in an indexed color image, each pixel has only one channel representing the color index; in the RGB color space, two of the color channels are set to 0, You can get single-channel images with only red, green or blue channels, etc. Since single-channel images only need to save information from one channel, they have smaller file sizes and faster processing speeds. At the same time, they also have some special application scenarios, such as: 1. Grayscale images are often used in the fields of medical imaging, text recognition and computer vision. 2. Monochrome images can be used in the design of printed or knitted items, or in some cases, used to create black and white artwork. 3. Color-indexed images are widely used in resource-constrained environments, such as low-resolution screens, old computer games, and embedded systems. Overall, single-channel images are a useful image type that provide more efficient image processing without losing important information.

        Grayscale image, also known as grayscale image or single-channel image, is a digital image without color information. In a grayscale image, the brightness value of each pixel represents the intensity or brightness level in its corresponding area, often used to represent black-and-white photos or other images that lack color information. A grayscale image consists of a single channel, with each pixel in that channel having only one gray level. In a standard 8-bit grayscale image, each pixel is represented using 8 bits, so there can be 256 different gray levels, from 0 representing the darkest black to 255 representing the lightest white. However, grayscale images can also be represented using a bit depth of 16 bits or higher for greater accuracy. Grayscale images are widely used in many fields, such as medical imaging, computer vision, image processing, and computer graphics. Due to their simplicity and low requirements, grayscale images are also commonly used for testing and validating computer vision algorithms.

        ​ ​ ​ 8-bit images and grayscale images have a certain relationship, because they both use 8-bit binary numbers (that is, one byte) to represent the brightness value of each pixel. However, there are some differences between them. An 8-bit image contains three color channels (red, green, and blue), and each channel uses an 8-bit binary number to represent the color intensity of each pixel. This means that a pixel can have 256x256x256 different colors, which is approximately 16.7 million colors. In a standard 8-bit grayscale image, each pixel represents its brightness value as an 8-bit binary number, with only 256 possible brightness levels, ranging from 0 for the darkest black to 255 for the brightest white. Therefore, grayscale images only have black and white colors and no color information. Although both 8-bit images and grayscale images use the same bit depth (8 bits), they differ in how the data is represented and the color information. 8-bit images have richer color information and are suitable for scenarios where color images need to be displayed, such as network design, photography, and image editing. Since grayscale images only contain one channel of brightness information, they are usually suitable for medical images, computer vision, and image processing.

       Single-channel images and grayscale images are closely related concepts. A grayscale image is a single-channel image in which each pixel has only one gray value, representing the brightness level. In other words, a grayscale image is a single-channel image that uses only the luminance channel in color space. In a standard 8-bit grayscale image, each pixel uses an 8-bit binary number to represent its brightness value, which can have 256 different gray levels, from 0 representing the darkest black to 255 representing the brightest white. These gray levels describe the brightness value of each pixel in a grayscale image and do not contain color-related information. Therefore, a grayscale image is a single-channel image that contains only one channel of grayscale information. However, not all single-channel images are grayscale images. For example, in indexed color mode, each pixel's value represents a color index that corresponds to a specific color in the color table. This indexed color technology enables single-channel images to provide appropriate color accuracy while maintaining a relatively small file size. Therefore, single-channel images and grayscale images, while related, are not entirely equivalent.

This project is processed by converting the original RGB color image into a grayscale image.

(6) What is a confusion matrix? What is iou?

 As shown in the figure, the left side is the label map (y_true) and the right side is the prediction map (y_pred).

       For the predicted results, in addition to observing the quality of model segmentation with the naked eye. How can we use more accurate numerical values ​​to judge the quality of the model? This requires evaluation indicators. For the field of computer vision in deep learning, there are often various image tasks such as semantic segmentation, target detection, instance segmentation, pedestrian re-identification, etc. It is necessary to compare the classification results of the target areas in the predicted pictures with the original labeled pictures to determine the consistency. For semantic segmentation, this requires statistically comparing two pictures (y_true, y_pred) to find out how many positive pixels were correctly predicted, how many pixels that should have been positive were predicted incorrectly, and how many were negative. The pixels were predicted correctly, and how many pixels that were supposed to be negative examples were predicted incorrectly.

City

TruePositive(A): It tells that the actual and predicted values ​​are the same. The TP for Type A is simply that the actual and predicted values ​​are the same, which means that cell 1 has a value of 15.

FP

FalsePositive(A): It tells that the actual value is negative, in our case it is class B and class C, but the model predicts it to be positive, i.e. class A. It is the addition of the values ​​of the corresponding columns except the TP value.

FalsePositive(A) = (cell 4 + cell 7): 7+2=9

TN

TrueNegative(A): The actual value and the predicted value have the same meaning. For A: Class B and Class C are negative classifications. It is the sum of the values ​​of all non-A rows and columns.

TrueNegative(A) = (Cell 5 + Cell 6 + Cell 8 + Cell 9): 15 + 8 +3 + 45= 71

FN

FalseNegative(A): The actual value is positive in our case, it is class A, but the model predicts it to be negative, i.e. class B and class C. Can be calculated from adjacent rows in addition to TP values.

FalseNegative(A) = (cell 2 + cell 3): 2 + 3= 5

         Then we converge these data as the above table, and then calculate the proportion of the correct predictions (correct+error), and we can use the proportions to make numericalized judgments. The table is called a confusion matrix. Many statistical indicators can be calculated through the confusion matrix:

    Accuracy - for the entire model
    Precision
    Sensitivity: Recall< /span>
    Specificity

 Specificity evaluation index—IOU:

IOU (Intersection Over Union) is an indicator used to measure the performance of target detection or segmentation models, while accuracy, recall, and precision are widely used in machine learning classification tasks. Indicators, what they have in common is that they have to calculate the confusion matrix of the two pictures. The differences between them are as follows:

  1. Calculation method: IOU is usually used in target detection or segmentation tasks to calculate the ratio of the intersection area and the union area between the model prediction results and the real label. Accuracy, recall, and precision continuously compare the similarity between model predictions and real labels in binary or multivariate classification tasks.

  2. IOU = TP/(FP+TP+FN)=15/(7+2+15+2+3)= 0.5172

  3. Sensitivity to imbalanced data: Since IOU calculates the intersection ratio, it can be problematic when the data is imbalanced, as the model can achieve higher IOU values ​​by better capturing the details of large categories. Precision, recall, and precision can handle imbalanced data better because they only focus on the number of correct and incorrect classifications.

  4. Meaning of the indicator: IOU reflects the similarity between the model prediction results and the real labels in the target detection or segmentation task, which can help us evaluate the model's segmentation effect on different categories. The accuracy, recall and precision rate focus more on the correctness and error in the classification task, and can help us evaluate the model's classification effect on different categories.

  5. miou is the average iou of each category,This project is name_classes=1+1, iou/name_classes, if it is a category, name_classes=2 (background is counted as one class by default), if there are two classes, name_classes=3. When calculating the individual iou for each category, other category pixels are regarded as negative examples.

7. Semantic segmentation process

8. Project structure and introduction

 ./datasets: stores original images, labeled json images, and converted y_true grayscale images

./evaluation: Evaluation indicator script handle_evaluation.py, input: y_ture path + y_pred path + category list [0,1,2....num_classes]. Output [miou, recall, precision]. Content: Calculate iou separately for each category of each picture. When calculating iou separately, other miscellaneous categories will be regarded as background colors and all converted to 0. This will make the whole number

./GPU: If the project environment requires gpu configuration, store the offline download package of torch_gpu. If not, you can configure it yourself according to the subsequent configuration environment steps. There is no need to use it when running on cpu

./params: model file storage address

./result: The storage location of the predicted result image (y_pred grayscale image)

./training_image: During the training process, the first image of each batch is taken and stored here so that the training effect can be viewed with the naked eye.

./rename.py: Rename the original data set in order 1.png, 2.png, 3.png, 4.png..., enter the path of the data set, and execute to get the result.

./make_mask.py (key point): Convert the original image and the labeled json file into y_ture, and store them in SegmentationClass. According to the previous y_ture, it is grayscale Figure, in order to distinguish each category in y_true, through the 12th line of the code, CLASS_NAMES=['tongue'], the default naming background black is 0, category one tongue is named 1, category two is named 2, Category three is named 3…. This project has only one category tongue. As for why naming starts from 0, please see the understanding of grayscale images in 6. Prior Knowledge Reserve. A picture can be named up to 255 categories. , generally 255 categories are not used, but it must be enough. Code content: By parsing the content rules of the json file, find the label label and points, create a blank picture, and mark it according to the point coordinates of points. , the content is filled with label.

./data.py: Process the image into torch.tensor format, and you can also view the number of data sets

./util.py: Called in conjunction with data.py, the function is to unify the image size [256*256]. Create a blank rectangle of [256,256], paste the image onto the blank rectangle and return.

./net.py: Construct Unet network structure

./train.py: Call the data.py, util.py, and net.py scripts to import the data set through the original data set path + y_true path. By default, line 16 sets the batch size to be 1 image at a time. By default, line 29 executes 200 batches and saves the model results to the file path specified by ./params every 50 rounds. At the same time, save the first image of each batch to the ./trainning_image file path. The optimizer uses Adam, and the loss function uses CrossEntropyLoss multi-class cross entropy. Input: training set + test set. Output: renderings during training +unet.pth model

./test.py: Input: normal color image path, output: y_pred grayscale image prediction result and save it to ./result. After inputting the original image, convert the image into tensor format, then load the model, then send it to the network, and then get the result.

./watch_result.py: View the highlighted grayscale image predicted by y_pred in ./result,At the same time, this script can also view the highlighted result of any grayscale image , such as: y_true. Content: p = np.where(p == 1, 1, 0).astype(float) When the prediction map has multiple categories, leave The lower label color, other category colors are regarded as interference colors and all are classified as background color 0;

p = np.where(p == 1, 1, 0).astype(float)  

./number_classes.py: View the (y_true) label map and (y_pred) prediction map category: execute number_classes.py, enter the label map and prediction map paths, and convert the image into an array, then reshape(-1) to flatten, and then remove duplicates, you can see what categories the image has

9.Installation environment

a.) Create the virtual environment to the path you want to specify (my environment path isD:\project\tongue_separate)

conda create --prefix D:\envs\handle_Unet python==3.8

b.) Activate virtual environment

conda activate D:/envs/handle_Unet

c.) Enter the project path (my project path is D:\project\handle_UNet)

cd D:\project\handle_UNet

d.) Install the gpu operating environment (if you don’t want to use a gpu or don’t have a gpu, the program will automatically identify and use the cpu)

Three are indispensable: CUDA configuration, cuDNN, torch-gpu version

Installation tutorial reference:

https://zhuanlan.zhihu.com/p/586913250

CUDA11.3 and PyTorch-GPU version installation (notes)_Senior Johngo

torch-gpu version: torch-1.13.1+cu117-cp38-cp38-win_amd64.whl

e.) Enter the virtual environment, enter the project path, and install all the environment packages used in the project I packaged.

pip install requirements.txt

If there is an error, please search Baidu for installation tutorials.

f.) No problem executing any script, done

10. Implementation process

a.)

Input input: general color pictures, no size limit

Output: predicted grayscale image, size [256*256]

Image change process (see project documents for high-definition images): input->output

 b.) Introduction to UNet: simple and practical, mainly in the medical field

The red box is the downsampling and feature extraction part. Like other convolutional neural networks, image features are extracted through stacked convolution and feature maps are compressed through pooling. The feature extraction part can use excellent networks, such as Resnet50 ,VGG etc. The blue box is the upsampling and image restoration part.

The first part isthe backbone feature extraction part. We can use the backbone part to obtain feature layers one after another, the backbone feature extraction part of Unet Similar to VGG, it is a stack of convolution and max pooling. Using the backbone feature extraction part, we can obtain five preliminary effective feature layers. In the second step, we will use these five effective feature layers to perform feature fusion.

The second part isenhancing the feature extraction part. We can use the five preliminary effective feature layers obtained from the backbone part for upsampling. And perform feature fusion (channel stacking of the upsampling results) to obtain a final effective feature layer that fuses all features.

The third part isthe prediction part. We will use the last effective feature layer finally obtained to classify each feature point, which is equivalent to to classify each pixel.

Other algorithm structures: conditional random field: undirected probability graph model, increasing the correlation between pixels; GAN adversarial network; DeepLabv1-v3; Dilation10; PSPNet (2016); FCN

 c.) Overall flow chart (the high-definition picture is in the relevant documents of the project folder, open the picture separately and enlarge it to make it clear):

  • Choose the right algorithm

        This project is tongue image recognition, that is, the user inputs a tongue image picture and identifies the tongue image area. Analysis shows that semantic segmentation in the field of computer vision is most suitable for this task. Searching for information shows that compared to other algorithms for semantic segmentation, the UNet algorithm is most suitable in the medical field. Therefore, the UNet network is used as the model architecture. Next, I started to study the ideas of the UNet paper. After getting familiar with the paper, I planned and determined the entire project process.

  • Dataset preparation

         Put the original pictures into ./datasets/JPEGImages and execute rename.py to rename 1.png, 2.png, 3.png... in order.

  • Data preprocessing

         Labeling: The purpose of labeling is to make the original image y_true, so that the y_pred prediction image generated during the training process has real values ​​for loss, thereby further reducing the loss and adjusting parameters. It is also convenient to have y_true and y_pred when calculating iou after the model is optimized.

pip install labelme==3.16.7 -i https://mirrors.aliyun.com/pypi/simple/
labelme

         9.b,9.cEnter the virtual environment, enter the project path, download labelme, and start labelme.exe. Mark the target area of ​​each picture and save it to D:\project\handle_UNet\datasets\before. The labeled file is in .json format. We need to put the original images corresponding to .json into D:\project\handle_UNet\datasets\before.

       Then execute the script make_mask.py. We convert the original image and the labeled json file into y_ture and store them in SegmentationClass. According to the previous y_ture is a grayscale image, in order to distinguish each category in y_true, Through the 12th line of the code, CLASS_NAMES=['tongue'], the default naming background black is 0, the category one tongue is named 1, the category two is named 2, the category three is named 3... . This project has only one category tongue. As for why naming starts from 0, please see the understanding of grayscale images in 6. Prior Knowledge Reserve. A picture can be named up to 255 categories. , generally 255 categories are not used, but it must be enough. After processing everything, y_true is automatically saved to the specified ./datasets/SegmentationClass.make_mask. At this time, y_true is a grayscale image and can only be seen completely black by the naked eye.

         Write data.py, utils.py scripts, the content is to convert x, y_true into torch.Tensor format [3, 256, 256] [256, 256], because in computer vision tasks, images need to be converted into values ​​that can be recognized by the computer before calculation .

 

 

  • Build network architecture

According to the UNet network structure, upsampling + convolution + downsampling, encapsulated into the UNet class

Contents of self.out=nn.Conv2d(64,num_classes,3,1,1):

  • nn.Conv2d: Indicates the use of two-dimensional convolution operation, that is, convolution operation on the input in two-dimensional space.
  • 64: Indicates that the number of input channels of the convolution layer is 64.
  • num_classes: Indicates that the number of output channels of the convolution layer is num_classes.
  • 3: Indicates that the size of the convolution kernel is 3x3.
  • 1: Indicates that the step size of the convolution operation is 1. Here, the step size refers to the distance the convolution kernel slides each time.
  • 1: Indicates that the padding value around the convolution kernel is 1. Here, padding refers to filling a circle of pixels with a value of 0 around the input data so that the convolution kernel can process edge pixels.

So what this line of code does is create a convolutional layer with 64 input channels and 5 output channels, with a convolution kernel size of 3x3, a stride of 1, and a padding value of 1. This convolutional layer can be used in deep learning models for tasks such as image classification, object detection, or image segmentation.

  • train

        ​​​​​Input: training set + test set. Output: renderings +unet.pth model during training       

        Call the data.py, util.py, and net.py scripts to import the data set through the original data set path + y_true path. By default, line 16 sets the batch size to be 1 image at a time. By default, line 29 executes 200 batches, prints the loss once in each round, and saves the model results to the file path specified by ./params every 50 rounds. The optimizer uses Adam, and the loss function uses CrossEntropyLoss multi-class cross entropy.

        At the same time, the first picture of each batch is saved to the ./training_image file path.

  • test

During Unet prediction, in order to obtain the final pixel-level segmentation result, the following steps need to be performed:

Convert the input image to a PyTorch tensor and move it to the GPU (if needed);
Perform forward propagation calculations on the input tensor using the trained Unet model , get the output prediction tensor out;
For the output prediction tensor out, find the category number with the highest prediction probability for each pixel along the direction of the channel number, that is, use torch.argmax() The function takes the maximum value along the second dimension (ie, the number of channels), resulting in a two-dimensional integer tensor of shape (H, W).
Reduce the dimensionality of the output prediction tensor out to remove unnecessary batch size and channel number dimensions. It should be noted here that in the training phase, since the input data is usually loaded in batches (batch size), the shape of the training image is usually (B, C, H, W), where B represents the batch size and C represents the channel number (that is, the number of categories), H and W represent the height and width of the image respectively. However, in the testing phase, only a single image is generally processed, so a new batch size dimension needs to be added through the torch.unsqueeze() method so that the input tensor shape becomes (1, C, H, W). The torch.squeeze() method can remove the redundant batch size dimensions and channel number dimensions, resulting in a two-dimensional integer tensor of shape (H, W). But in order to meet the requirements of subsequent visualization functions, a batch size dimension needs to be added again (i.e. using the torch.unsqueeze() method).
The final output prediction result out is a three-dimensional integer tensor with a shape of (1, H, W), which represents the pixel-level segmentation result of the model on the input image.

        Load the model, input: normal color image path, output: y_pred grayscale image prediction result and save it to ./result. After inputting the original image, convert the image into tensor format, then load the model, then send [3,256,256] to the network, and then get the result.

        Model visualization, convert the grayscale image*255 to obtain the highlight image. Visually judge the accuracy of predictions and true values. Content: p = np.where(p == 1, 1, 0).astype(float) When the prediction map has multiple categories, leave The lower label color, other category colors are regarded as interference colors and all are classified as background color 0;

  • Evaluation model

Input: y_ture path + y_pred path + category list [0,1,2....num_classes].

Output [miou, recall, precision]

Calculate the confusion matrix between the two and calculate MIOU

 

  • deploy

 11.Functions

       The difference from IOU is that although they are both evaluation indicators, multi-class cross entropy and IOU are used in different places. Multi-class cross entropy is used in line 25 of train.py. During the training process, each pixel in the y_pred map generated has its own category. The function of multi-class cross entropy is to correct the classification result of each pixel of y_pred and the true value pixel category of y_true, and continuously adjust and optimize the model parameters. IOU is used to evaluate the y_pred and y_true generated after the model is trained to judge the quality of the model.

(1) Class entropy loss Binary classification or multi-class cross entropy loss, the loss of classifying each pixel. The binary cross-entropy loss is: BCELOSS: binary_cross_entropy

(2) Focal loss, taking into account the imbalance of class pixels, such as: the proportion of sky and people in a picture is too different. You can give greater weight to a category to increase its importance.

 12.Evaluation indicators

(1) Mean Intersection over Union (MIoU), similar to target detection, calculates the intersection ratio of two sets. The two sets are the ground truth and predicted values. segmentation), is the ratio of intersection and union, and the average intersection and union ratio is the average of the intersection and union ratios of various types. Such evaluation indicators can judge the degree of capture of the target (making the predicted label and the annotation coincide as much as possible), and can also judge the accuracy of the model (making the union overlap as much as possible). Used in this project

 

IOU (Intersection Over Union) is an indicator used to measure the performance of target detection or segmentation models, while accuracy, recall, and precision are widely used in machine learning classification tasks. index of. The differences between them are as follows:

  1. Calculation method: IOU is usually used in target detection or segmentation tasks to calculate the ratio of the intersection area and the union area between the model prediction results and the real label. Accuracy, recall, and precision continuously compare the similarity between model predictions and real labels in binary or multivariate classification tasks.

  2. IOU = TP/(FP+TP+FN)=15/(7+2+15+2+3)= 0.5172

  3. Sensitivity to imbalanced data: Since IOU calculates the intersection ratio, it can be problematic when the data is imbalanced, as the model can achieve higher IOU values ​​by better capturing the details of large categories. Precision, recall, and precision can handle imbalanced data better because they only focus on the number of correct and incorrect classifications.

  4. Meaning of the indicator: IOU reflects the similarity between the model prediction results and the real labels in the target detection or segmentation task, which can help us evaluate the model's segmentation effect on different categories. The accuracy, recall and precision rate focus more on the correctness and error in the classification task, and can help us evaluate the model's classification effect on different categories.

  5. Miou is the average value of iou in each category. This project is name_classes=1+1, iou/name_classes. If it is one category, name_classes=2 (the background is counted as one category by default). If it is two categories, name_classes=3.

(2) Pixel Accuracy: The ratio of the number of correctly classified pixels to the number of all pixels.

(3) Pixel Accuracy Average (MPA): A variant of PA, the ratio of the number of correctly classified pixels in each class to the number of all pixels in the class (Ground truth), and then the average of all classes.

13.UNet paper

CNKI search:

                 U-Net: Convolutional Networks for Biomedical Image Segmentation

14.Source code address

Free permanent link:https://download.csdn.net/download/weixin_46412999/87690385

15. How to modify it into your own tasks

a.)make_mask.py, change CLASS_NAMES to your own category in line 12 (must be the category named when tagging), line 27 if your original image is jpg, change png to jpg

b.) train.py, line 15 num_classes=n+1, change n to the number of your categories. In line 16, batch_size is modified to how many images you want to train in one round. Modify line 29 epoch to the round you want to train; modify line 38 %1 to how many pictures to print the loss in each round, and modify line 47 %50 to how many rounds to save the model.

c.)handel_evaluation.py, change line 42 [0,1] to your category list, for example: three categories are [0,1,2,3], five categories are [0,1,2,3 ,4,5]

16.Personal experience

Starting from scratch:

Understanding semantic segmentation’s place in deep learning

Understand the principles of semantic segmentation

Understand evaluation metrics

Understand and choose a suitable UNet algorithm model for semantic segmentation and study the paper

Understand the principles of upsampling, downsampling, feature fusion, encoder-decoder, image processing, etc.

Understand gpu environment configuration

Understand how computers understand and process image data (different channel representation, calculation of the number of each category in a picture: deduplication of all pixels)

Comprehensive handwritten script{ { { {

Handwrite make_mask.py to convert the tagged json format into y_ture

Handwritten data.py, utils.py, rename.py, etc., data preprocessing code,

Handwritten UNnet network structure

Handwritten training code

Handwritten test code

Handwritten MIOU evaluation indicator code}}}}

And record all the knowledge and make it into a blog like what you see

Reference blogger link:

Image segmentation UNet hard-core explanation (showing you the unet code)_bilibili_bilibili

17. Project implementation application

 

Guess you like

Origin blog.csdn.net/weixin_46412999/article/details/128248726