[Paper|Reappearance]Vertebra-Focused Landmark Detection For Scoliosis Assessment

来源:2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) 

【paper】

translate

Abstract

In this paper, we propose a novel vertebra-focused landmark detection method.

Our model first localizes the vertebra centers, based on which it then traces the four corner landmarks of the vertebra through the learned corner offset. In this way, our method is able to keep the order of the landmarks.

First locate the center of the vertebral body, and then use the angular offset to track the four corner points.

Intro

What is AIS, the gold standard for measuring it, Cobb, how is Cobb measured - manual measurement is insufficient - the automatic method is a surge of interest

Parameter sensitivity and complex processing in traditional unsupervised methods - supervised methods, listed two, but are insufficient and have insufficient improvements (first, there are many parameters and a large amount of calculation, and then downsampling causes loss of image information)

⇒Keypoint-based methods are better than regression. This article is about point regression. How is it done in this article?

Method

First of all, it also explains why it is necessary to locate first and then find 4 points based on the offset (finding them at one time cannot guarantee the order, but the order is very important for accurately positioning the vertebral body), and explains how to maintain the order of the points in a very organized and clear and logical manner. .

  1. Heatmap of Center Points

  2. Center Offset

    The input center position is mapped to the position in the downsized feature map.

    The center point is tracked from the down-sampled feature map, and then mapped back to the original input image using center offset.

    The center offsets at the center points are trained with L1 loss.

  3. Corner Offset

    vectorsvectors

    use L1 loss to train the corner offsets at the center points

Summary: First, use ResNet34 and focal loss to obtain the center point heat map; secondly, obtain the mapping relationship between the input and the center points in the feature map, and use L1 loss training; define the angular offset vector to the center point to obtain 4 For corner points, L1 loss is also used.

Experiments

  1. Dataset: 580=348 (60%) training + 116 (20% verification) + 116 (20% testing), initial 2500x1000

  2. The backbone network ResNet34 [19] is pre-trained on ImageNet.

    Data enhancement, input resized, Adam optimization, 100epochs

  3. Metrics: SMAPE evaluates Cobb angle accuracy, Errordec evaluates landmarks detection accuracy

Results

  1. Comparative experiment Fig3+Table1:
    1. The regression input has been adjusted and the positioning is not accurate enough;
    2. Segmentation positioning is accurate, but when the vertebral body is unclear/fuzzy/resolution is not clear, the mask is inaccurate, resulting in positioning errors, especially the lumbar spine;
    3. Ours failed to position the vertebral body in case 5 where the morphological characteristics are not obvious.
    4. The conclusion is ours best, but it only compares with two methods, and the experimental part is not sufficient; but the quantitative visualization effect is still good; [ 7] Segmentation is four areas, [10] Regression is the top in area 1, but the effect is not good. It is the input that has been modified.

     【7】  :2019.2, District 4 , papers ; 【10】:2019.12, top journals , papers in District 1 ,

  2. The strategy of predicting center heatmaps enables our model to identify different vertebrae and allows it to detect landmarks robustly from the low contrast images and ambiguous boundaries.

【experiment】

Code: Vertebra-Focused

experiment procedure
Experiment process 1 (this experiment) error:
1、qt
loaded weights from weights_spinal/model_last.pth, epoch 16
processing 0/128 image ... sunhl-1th-01-Mar-2017-310 C AP.jpg
totol pts num is 17
qt.qpa.xcb: could not connect to display 
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "/home/think/anaconda3/envs/zj170/lib/python3.9/site-packages/cv2/qt/plugins" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.
# 此应用程序无法启动,因为无法初始化Qt平台插件。
Available platform plugins are: xcb.

Solution : This is related to the fact that the IDE cannot directly return the graphical interface. For example, after adding cv.imshow() to a code, the above error is reported. You can use third-party ssh software,

Solution 2 :

Solution 3 :

Solution 4 :

This problem is related to the fact that the remote IDE cannot directly return the graphical interface, resulting in an error. The relevant solution on the Internet is to download remote software that supports graphics return, such as MobaXterm

The above operations were performed randomly, and finally the results appeared after running the command on MobaXterm. So the problem still lies in the visualization part of pycharm. (If there is no essential solution, let’s rely on the third platform 8)

2、TypeError: Expecting miMATRIX type here, got 3220159935

Experimental process 2 (related experiments)

1. I thought it was connected to the Internet but it was not connected. Even if I changed the device, the same "network is unreachable"; orz

 2. Although the remote is configured locally, the environment still needs to be configured on the server due to the lack of Module.

from scipy.io import loadmat       ModuleNotFoundError: No module named 'scipy'

pip install scipy

relevant
Lay a solid foundation
torch.manual_seed() : The 0 in torch.manual_seed(0) represents an identifier of the random seed. Because the randomness here is not really random, but pseudo-random that can be controlled artificially, an identifier is needed. You can replace the 0 here with any other integer.
s.mkdir() and os.makedirs() : os.mkdir(path), its function is to create directories level by level, provided that the previous directory already exists. If it does not exist, an exception will be reported; os.makedirs( path), you can guess the difference just from the way it is written. It can create multi-level directories at one time, even if the intermediate directory does not exist, it can be created normally (for you) .

Adam optimizer and code , detailed explanation of pytorch optimizer Adam and LossAll : The optimizer needs to update the parameters of the network based on the gradient information of the network back propagation, so as to reduce the calculated value of the loss function.

Epoch, Batch, Iteration , three concepts √: An Epoch is the process of training all training samples once . ||  However, when the number of samples of an Epoch (that is, all training samples) may be too large (for a computer), it needs to be divided into multiple small blocks, that is, divided into multiple batches  for training. || Training a Batch is an Iteration

The CIFAR10 data set has 50,000 training images and 10,000 test images. Now select Batch Size = 256 to train the model.

  • Number of images to be trained per Epoch: 50,000
  • The number of batches in the training set: 50000/256=195+1=196
  • Number of batches required for each Epoch: 196
  • Number of Iterations each Epoch has: 196
  • Number of model weight updates occurring per Epoch: 196
  • After 10 generations of training, the number of model weight updates: 196*10=1960
  • Training in different generations actually uses data from the same training set. Although the 1st and 10th generations both use 50,000 images from the training set, the weight update values ​​for the model are completely different . Because models of different generations are at different positions in the cost function space, the later the training generation of the model is, the closer it is to the bottom, and the smaller its cost is.

The relationship and difference between optimizer.step(), loss.backward() and scheduler.step() ★:

1. The optimizer needs to update the parameters of the network based on the gradient information backpropagated by the network to reduce the calculated value of the loss function. Starting from the role of the optimizer, two main things are needed to make the optimizer work:

(1) The optimizer needs to know the parameter space of the current network or other model, which is why in the training file, the parameters of the network need to be placed in the optimizer before officially starting training.

(2) Need to know the gradient information of backpropagation

2. The step function uses grad in the parameter space (param_groups), which is the gradient corresponding to the current parameter space. This also explains why optimzier needs to be zero cleared before use , because if it is not cleared, then use This grad must be related to the previous mini-batch , which is not the result we need.

        The optimizer needs to update the parameter space based on the reverse gradient . Therefore, when calling optimizer.step() it should be when loss.backward() is called, that is, loss.backward() comes first, followed by a step.

        optimizer.step() needs to be placed in each batch training instead of epoch training . This is because the current mini-batch training mode assumes that each training set is only as large as the mini-batch, so in fact each training set can be A mini-batch is regarded as a training, and the parameter space is updated once during training , so optimizer.step() is placed here.

np.savetxt()

np.savetxt(os.path.join(save_path, 'train_loss.txt'), train_loss, fmt='%.6f') fname: save file path 
X: save np object 
fmt: saved data format 
delimiter: delimiter default is a comma, you can specify it yourself

 Loss and Val_loss determine the quality of the model results : loss - the loss value of the training set Val_loss - the loss value of the test set

Situation 1: The train loss continues to decrease , and the test loss continues to decrease , indicating that the network is still learning.

Solution: The network model at this time is the best and no other measures are needed

Scenario 2: The train loss continues to decrease and the test loss tends to remain unchanged , indicating that the network has overfitted .

Solution: Use data enhancement, max pooling, and regularization

Scenario 3: The train loss tends to remain unchanged and the test loss continues to decrease , indicating that there is 100% problem with the data set.

Solution: Check the dataset

Scenario 4: The train loss tends to remain unchanged , and the test loss tends to remain unchanged , indicating that learning has encountered a bottleneck.

Solution: Reduce the learning rate or reduce the number of batches

Scenario 5: The train loss keeps rising , and the test loss keeps rising , indicating that there is a problem with the network design (the worst case)

Solution: Reset the model structure and reset the data set

unning_loss += loss.item() : item() returns the value of loss. After superposition, the total loss is calculated. Finally, it is divided by the number of mini-batches to obtain the average loss .

Why does loss need to add item() : PyTorch version 0.4.0 removes Variable and integrates Variable and Tensor . Variable can be regarded as a Tensor with requires_grad=True . Its dynamic principle remains unchanged.

It also becomes more elegant when obtaining data: use loss += loss.detach() to obtain the parts that do not require gradient return.

Or use loss.item () to directly obtain the corresponding python data type.

What are the pitfalls of PyTorch :

Commonly used loss function criteria and codes √

to(device) and cuda() in pytorch : The functions of these two methods are to specify the CPU or GPU for data processing . to(device) can specify the CPU or GPU; .cuda() can only specify the GPU.

Both methods can achieve the same effect. In pytorch, even if the machine has a GPU, it will not automatically use the GPU, but needs to be explicitly specified in the program. Call model.cuda() to load the model onto the GPU. This method is not recommended, but it is recommended to use the model.to(device) method, which can display and specify the computing resources to be used, especially when there are multiple GPUs.

with torch.no_grad()☆:

First of all, the with statement in Python is suitable for accessing resources. It ensures that regardless of whether an exception occurs during use, the necessary "cleaning" operations will be performed to release resources , such as automatic closing of files after use/automatic acquisition and release of locks in threads. wait. Working principle :
(1) After the statement immediately following with is evaluated , the "-enter-()" method of the returned object is called, and the return value of this method will be assigned to the variable following as;
(2) When with After all subsequent code blocks have been executed , the "-exit-()" method of the previously returned object will be called.

 Regarding torch.no_grad(), let’s start with requires_grad : In pytorch, tensor has a requires_grad parameter. If it is set to True, the tensor will automatically be differentiated during backpropagation . The tensor's requires_grad attribute defaults to False . If a node (leaf variable: a self-created tensor) requires_grad is set to True, then all nodes that depend on it require_grad are True (even if other dependent tensors have requires_grad = False); When requires_grad is set to False, derivation will not be automatically performed during backpropagation, thus greatly saving video memory or memory .

The role of with torch.no_grad: Under this module, the requirements_grad of all calculated tensors are automatically set to False; even if a tensor (named x) has requires_grad = True, when calculated with torch.no_grad, it is obtained by x The new tensor (named w-scalar) requires_grad is also False, and grad_fn is also None, that is, w will not be differentiated.

 Comparison of the results of using and not using with torch.no_grad() : When using pytorch, not all operations require the generation of calculation graphs (the construction of the calculation process for operations such as gradient backpropagation ). For the calculation operation of tensor, the calculation graph is constructed by default. In this case, you can use with torch.no_grad(): to force the subsequent content not to construct the calculation graph .

 torch.enable_grad() : Enable gradient calculation

criterion = loss.LossAll()

Multiple definitions of loss functions : About the difference and connection between nn.Module and nn.functional : When customizing the layer, it can be done through both nn.Module and nn.functional, but it is recommended to use the former. Because nn.Module is a packaged class that specifically defines a network layer that can maintain state and store parameter information; while nn.functional only provides a calculation and does not maintain state information and store parameters. Within the Module class, the layer functions are actually implemented through nn.functional. For some layers that do not need to store parameters and status information, such as activation functions, such as (relu, sigmoid, etc.), dropout, pooling, etc. without training parameters, you can use the functional module, and of course you can also use the nn.Module class to complete it . Commonalities between loss functions and layers: In essence, loss functions and custom layers have many similarities. They both perform functional operations on the input to obtain an output . Isn't this the function of the layer? It’s just that the function operations of the layers are different. They may be linear combinations, convolution operations, etc., but they are also functional operations after all. It is based on this commonality, so we can uniformly use the nn.Module class to define the loss function, and define The method is also very similar to the previous layer.

In pytorch , there are already many functions defined as classes

Focal loss : Binary Cross Entropy loss. This training goal requires the model to be really confident in its predictions. And what Focal Loss does is it allows the model to be more "relaxed" in predicting things without having to be 80-100% sure that this object is "something". In short, it gives the model more freedom to take some risks when making predictions. This is especially important when dealing with highly imbalanced data sets, as in some cases (e.g. cancer detection) even false positive predictions are acceptable, which really requires the model to take risks and try to predict as much as possible. Therefore, focal loss is particularly useful in cases of sample imbalance. Especially in the case of "object detection", most of the pixels are usually background and only a very few pixels in the image have the object of interest to us

In the image above, the "blue" line represents the cross-entropy loss. The x-axis is the "probability of prediction as the true label" (for simplicity, let's call it pt). For example, suppose the model predicts that something is a bicycle with probability 0.6, and it is indeed a bicycle, in which case the pt is 0.6 . And if the object is not a bicycle in the same situation . Then pt is 0.4, because the true label here is 0, and the probability that the object is not a bicycle is 0.4 (1-0.6) . The Y-axis is the value of Focal loss and CE loss after a given pt .

As can be seen from the image, when the probability of the model predicting a true label is around 0.6, the cross-entropy loss is still around 0.5. Therefore, in order to reduce the loss during training, our model will have to predict the true label with a higher probability. In other words, cross-entropy loss requires the model to be very confident in its predictions. But this will also have a negative impact on model performance. Deep learning models can become overconfident, so the model's generalization ability decreases.

When using Focal Loss with γ > 1, you can reduce the training loss of "well-classified samples" or "samples with a high probability of correct prediction by the model". However, for "difficult-to-classify samples", such as those with a prediction probability less than 0.5, it will not It will reduce too much loss. Therefore, when the data categories are imbalanced , the model will focus on rare categories , because the samples of these categories have been seen less and are more difficult to distinguish.

The function function of gather() : It can be interpreted as returning the value of the corresponding position in the array according to the index parameter (that is, the index).
The writing method of b.gather() here and the writing method of torch.gather (b) can be used. The focus is on the two parameters. , dim and index

dim= 0 means indexing by row , that is to say, the value of index indicates which row.
dim= 1 means indexing by column , that is, the value of index indicates which column.

Understanding usage of gather :

torch.gather function

[Supplementary knowledge points]

The backbone network ResNet34 [19] is pre-trained on ImageNet.

1、pre-trained on ImageNet.

Transfer learning :The neural network needs to be trained with data. It obtains information from the data and converts them into corresponding weights. These weights can be extracted and transferred to other neural networks. By "transferring" these learned features, we do not need to train a neural network from scratch. It is used when the training data set is small to prevent overfitting. Usually pre-trained in computer vision imagenet.

Pre-trained model : Simply put, a pre-trained model is a model created by predecessors to solve similar problems . When you solve a problem, instead of training a new model from scratch, you can start with a model that has been trained on a similar problem. For example, if you want to build a self-driving car, you can spend several years building a high-performance image recognition algorithm from scratch, or you can use the inception model (a pre-trained model) trained by Google on the ImageNet data set. Let’s start by recognizing images. A pre-trained model may not be 100% accurate for your application, but it can save you a lot of effort . How to use a pre-trained model : When training a neural network, we hope that the network can find appropriate weights during multiple forward and reverse iterations. By using pre-trained models that have been previously trained on large datasets, we can directly use the corresponding structures and weights and apply them to the problem we are facing. This is called "transfer learning", where a pre-trained model is "transferred" to the specific problem we are solving. The ImageNet dataset has been widely used as a training set because it is large enough (including 1.2 million images) to help train universal models. The training goal of ImageNet is to correctly classify all images into 1000 classification items. These 1,000 categories are basically derived from our daily lives, such as types of cats and dogs, various household items, daily commuting tools, etc. In transfer learning, these pre-trained networks also show good generalization performance for images outside the ImageNet dataset. Since the pre-trained model has been trained well, we will not modify too many weights in a short period of time. When using it in transfer learning, we often only perform fine tune . In the process of modifying the model, we will adopt a lower learning rate than the general training model. Ways to fine-tune your model: 1 Feature extraction : We can use the pre-trained model as a feature extraction device. The specific approach is to remove the output layer, and then use the remaining entire network as a fixed feature extraction machine to apply to new data sets. 2. Use the structure of the pre-trained model : We can also use the structure of the pre-trained model, but first randomize all the weights, and then train based on our own data set. 3. Train specific layers and freeze other layers. 4. Another way to use a pre-trained model is to partially train it. The specific method is to keep the weights of some initial layers of the model unchanged, and retrain the subsequent layers to obtain new weights. In this process, we can try many times to find the best match between frozen layers and retrain layers based on the results. How to use and train the model is determined by the size of the data set and the similarity of the data between the old and new data sets (the pre-trained data set and the data set we want to solve).

ImageNet pre-training : Generally in the field of image processing, I like to use ImageNet for network pre-training. There are two main points. On the one hand, ImageNet is a data set with a lot of pre-annotated training data in the image field. It is very large. The advantage of All can be used.

So how to do pre-training in the image field? The above figure shows this process. After we design the network structure, for images, it is usually a multi-layer superimposed network structure of CNN. You can first use a certain training set such as training set A or training set A. Set B pre-trains this network, learns the network parameters on task A or task B, and then saves them for later use. Suppose we are facing the third task C. The network structure adopts the same network structure. In the shallower layers of CNN structure, the parameters learned by task A or task B can be loaded when initializing the network parameters. Other high-level CNN parameters are still randomly initialized. . After that, we use the training data of the C task to train the network. There are two methods at this time. One is that the shallow loading parameters do not move during the training of the C task. This method is called " Frozen "; the other is Although the underlying network parameters are initialized, they continue to change with the training process during the C task training. This is generally called " Fine-Tuning ". As the name suggests, it is to better adjust the parameters to make them more suitable for the current C task. . This is generally done for pre-training in the image or video field.

There are several advantages to doing this. First, if the training set data for task C at hand is small, the current useful CNNs such as Resnet/Densenet/Inception have very deep network structures and millions or tens of millions of parameters. Calculating the starting price, hundreds of millions of parameters are also very common. It is difficult to train such a complex network well with little training data. However, if a large number of parameters are pre-trained through a large training set such as ImageNet, it can be directly used to initialize most of the network structure. parameters, and then use the Fine-tuning process to adjust the parameters to make them more suitable for solving the C task using the relatively poor amount of data on hand for the C task, then things will be much easier to handle. In this way, tasks that cannot be trained originally can be solved. Even if there is a lot of training data for the task at hand, adding a pre-training process can greatly speed up the convergence speed of task training. Therefore, this pre-training method is a solution suitable for all ages. In addition, it has good curative effects, so it quickly became popular in the field of image processing.

At present, we already know that for hierarchical CNN structures, neurons at different levels learn different types of image features , forming a hierarchical structure from bottom to top features, as shown in the figure above. If we have a face recognition task at hand, training is After the network is built, visualize the features learned by each layer of neurons. Take a look with the naked eye to see what features each layer has learned. You will see that the bottom layer of neurons learns features such as line segments. The second hidden layer in the figure learns What we learn is the outline of the facial features. The third layer learns the outline of the face . Through three steps, a hierarchical structure of features is formed. The lower-level features are common to all images in any field, such as corners. Bottom-level basic features such as lines and arcs, the higher the features extracted, the more relevant they are to the task at hand. Because of this, the features extracted from pre-trained network parameters, especially the underlying network parameters, are more irrelevant to the specific task, and the more versatile the task is. Therefore, this is why the underlying pre-trained parameters are generally used to initialize new task networks. The reason for the parameter. The high-level features are closely related to the task, so they can actually not be used, or Fine-tuning can be used to clean out the high-level irrelevant feature extractors with new data sets.

Do you want ImageNet pre-training :

Why pre-trained models are easy to use : First, there is too little domain data. Second, it is difficult to learn. Just like people studying, if you have general knowledge, such as high school Chinese, it will be easier to learn domain knowledge on this basis. If you don't even know how to make basic sentences, it will be a big headache to learn professional knowledge. The pre-training model uses a large amount of common data such as Wikipedia to teach the model basic knowledge. I think this is one of the reasons why the pre-training model chooses Wikipedia and the like as corpus (the easy availability of data is of course the more important reason emm)

Guess you like

Origin blog.csdn.net/sinat_40759442/article/details/126725058