[Jan Erxia | Knowledge Learning | Ing]

heat map

Gaussian heatmap applies a Gaussian distribution function to the location of each key point to generate a Gaussian heat map centered on the key point to represent the location of the key point. Specifically, the Gaussian distribution function centered on the key point will achieve a maximum value at that point and gradually decrease as the distance from the center point increases . This method represents the location of key points as a continuous, smooth function and is suitable for pixel-level key point detection tasks. In key point detection tasks, we usually need to generate a Gaussian heatmap to represent the location probability distribution of each key point. Gaussian heat map is a smooth probability distribution with continuity and smoothness, which can better represent the location information of key points.

When generating a Gaussian heat map, the usual method is to generate a two-dimensional Gaussian distribution centered at the key point with a standard deviation of $\sigma$, and then superimpose this Gaussian distribution onto the corresponding position in the image. , and finally get a Gaussian heat map. When generating a Gaussian heat map, a suitable Gaussian kernel size $k$ is usually selected, and then applied to the Gaussian distribution to generate a Gaussian kernel $G$ of size $k \times k$. Next, we perform a convolution operation on this Gaussian kernel $G$ and the original heat map $H$ to obtain a smooth heat map $H'$. This convolution process can be implemented using regular convolution operations, such as using PyTorch's conv2dfunction. During implementation, we need to pay attention to issues such as the generation and normalization of Gaussian kernels, as well as boundary processing during convolution.

||

In order to generate high-quality Gaussian heatmaps, we usually need to perform convolution smoothing operations on the heatmaps. This is because the original Gaussian heat map often contains some noise and discontinuity, and these problems may have a certain impact on the accuracy of key point detection. By performing a convolution smoothing operation on the Gaussian heat map, we can remove these noises and make the Gaussian heat map smoother and more continuous, thus improving the accuracy of key point detection. Gaussian-smoothed heatmap convolves and smoothes the original image, and then applies a Gaussian distribution function to generate a heat map . Different from Gaussian heatmap, Gaussian-smoothed heatmap considers the similarity of adjacent pixels in the image when generating the heat map, which can make the heat map more spatially continuous and suitable for key point detection tasks above the pixel level. In addition, since the Gaussian heat map is obtained by modeling the Gaussian distribution of key point positions, it is usually necessary to generate a Gaussian kernel through a convolution operation during implementation, and then apply it to each pixel of the image, obtaining Corresponding Gaussian heatmap. In this process, the convolution smoothing operation is part of generating a Gaussian kernel. It can help us generate a smooth Gaussian kernel and obtain a high-quality Gaussian heat map.

||

Convolutional smoothing (Convolutional Smoothing) is a commonly used image processing method. Its main idea is to filter the image to remove noise and details in the image, making the image smoother and more continuous. Specifically, convolution smoothing performs a weighted average operation on the neighborhood around each pixel in the image to obtain the smooth value of the pixel. The weight of the weighted average usually uses the value in the convolutional kernel. The convolutional kernel is a two-dimensional matrix, the center point of which has the largest value, and the values ​​of other points gradually decrease as the distance from the center point increases. The convolution smoothing operation can be achieved through the convolution operation of the convolution kernel and the image. Convolution smoothing can effectively reduce the noise and details in the image, making the image smoother, which is very useful for certain image processing tasks. For example, in key point detection tasks, by performing convolution smoothing operation on the original image, a smoother image can be obtained. Clear and continuous feature representation helps improve the accuracy of key point detection.

Reasons why smoothing can denoise : In image processing and computer vision, smoothing operations can reduce noise on images. This is because the smoothing operation can remove high-frequency noise in the image and retain the low-frequency information in the image. In images, high-frequency signals usually represent details and noise in the image, and low-frequency signals usually represent the overall characteristics of the image. Through smoothing operations, we can remove high-frequency noise in the image and retain low-frequency information in the image, thereby reducing noise in the image and highlighting features in the image.

In smoothing operations, we usually use smoothing kernels to convolve images. The smoothing kernel can smooth pixel values ​​within a certain range, thereby reducing noise and details in the image and highlighting features in the image. Commonly used smoothing kernels include mean kernel and Gaussian kernel. Among them, the Gaussian kernel is usually widely used due to its good smoothing effect and adjustable parameters.

It should be noted that smoothing operations not only reduce noise in the image, but also blur the image. Therefore, in practical applications, it is necessary to weigh the appropriate smoothing kernel size and parameters according to the specific situation to obtain a better balanced effect.

2D keypoint coordinates are extracted from the predicted heatmap . Based on the literature [2], the author uses the differentiable local soft-argmax method to reduce the error caused by the heat map quantization error. Specifically, for each keypoint, the authors first convert its heatmap into a probability distribution using the softmax function. They then treated the grid coordinates and corresponding probability values ​​as pixels in the 2D image and used bilinear interpolation to estimate the location of the local maximum. Finally, these locations are decoded into predicted 2D keypoint coordinates. This method is smoother than the traditional argmax method and can reduce errors caused by discretization.

||

argmax is a mathematical function used to find the independent variable of a function that achieves the maximum value within a certain domain . Specifically, in a sequence or vector, the argmax function returns the index of the element with the largest value. For example, argmax([3, 5, 1, 7]) returns 3 because 7 has index 3 in the array.

The softmax function is also a mathematical function used to compress a K-dimensional vector (K is any positive integer) into another K-dimensional vector such that each element of the vector is between 0 and 1, and the sum of all elements is 1 . The softmax function is often to convert the output of the model into a probability distribution for training and prediction of classification tasks.

The argmax function is usually used to obtain the position information of the maximum value in a sequence or vector, while the softmax function is often used to convert a vector into a probability distribution . In some cases, the output of the softmax function can also be used to obtain the final classification result through the argmax function.

||

When we get a heatmap (heatmap), we need to convert it into the corresponding coordinate point (keypoint). This process can be implemented using the argmax function, that is, the position with the largest value in the heatmap is found as the coordinate point.

However, there is a problem with this method, that is, because the pixel coordinates are discrete, the coordinates obtained by argmax may have errors. To solve this problem, you can use the soft-argmax function to convert discrete pixel coordinates into continuous coordinate values. The soft-argmax function can be obtained by multiplying the coordinate value of each pixel by the probability value of the corresponding pixel, and then summing the weighted coordinates of all pixels. This method can reduce the error caused by discretization.

Local soft-argmax is an improvement based on soft-argmax. Local soft-argmax can limit pixel coordinates to a local area to prevent coordinate points from being too sensitive and further reduce errors. This method usually requires setting a convolution kernel size and step size to determine the scope of the local area. Local soft-argmax is a method for extracting coordinate information from Gaussian heatmaps. In this method, each feature point is represented as a peak of a Gaussian distribution on the heat map, so the coordinates of each point can be regarded as the location of the peak. The traditional argmax operation leads to quantization errors because it only returns the position of the maximum value and ignores subtle changes around the peak value.

To solve this problem, Local soft-argmax uses a differentiable method to interpolate the heat map to obtain more accurate coordinates. Specifically, it treats each peak as the center of a local coordinate system and performs bilinear interpolation on the pixels around each peak to obtain the sub-pixel level coordinates of the peak. Therefore, this method can effectively reduce quantization errors and improve the accuracy of coordinate prediction.

||

Heatmap quantification error : In a heatmap, each pixel represents the probability distribution of a keypoint. Therefore, when converting the heat map into keypoint coordinates, you need to find the most likely location, that is, find the maximum value pixel in the heat map. This process is called quantization because the maximum pixel can only be expressed as discrete pixel coordinates. If the resolution of the pixels is low, it will lead to quantization errors, that is, the predicted keypoint positions may deviate from their true positions, which will lead to a degradation in the performance of the model. Therefore, adopting a differentiable local soft-argmax method can reduce this amount.

Discretized pixel coordinates : Pixel coordinates refer to the coordinates of each pixel in the image. They are expressed as integer values ​​rather than continuous real values. For example, in a 1000x1000 image, the coordinates of each pixel are expressed as integers, ranging from (0,0) to (999,999. Discrete here means that the pixel coordinates are finite and discrete, not continuous. Real values. Therefore, for the discretization of continuous functions, pixel coordinates are a typical discretization representation.

Discretization is because the data processed by the computer is presented in discrete form, including pixel coordinates, pixel values, etc., all exist in discrete form. Therefore, when performing computer vision tasks, continuous data needs to be converted into discrete data. For example, in human pose estimation, key point coordinates need to be converted from continuous values ​​into discrete Gaussian heat maps. Discretization not only conforms to the way computers process data, but can also effectively reduce the amount of calculation and storage space, allowing computers to calculate and process faster.

||

sargmax (Scaled Argmax) is an improvement on argmax that can be applied to various tasks in computer vision, such as semantic segmentation, human pose estimation, etc. In argmax, we usually get a discrete point, while in sargmax, we get a weighted average of the area around that discrete point . The weight of this weighted average is determined by the distance of the discrete point to its surrounding pixels.

Soft-argmax is a variant of sargmax that avoids discrete sampling of spatial transformations by replacing the discrete argmax operation with a differentiable softmax operation. Compared with sargmax, soft-argmax responds smoother and more continuously to approximately continuous functions.

Local soft-argmax is an improvement proposed for semantic segmentation tasks. In traditional soft-argmax, all pixels are weighted equally, while in local soft-argmax, only pixels close to a specific pixel are weighted . This weighting method can improve the model's ability to perceive local details of the image.

sargmax_coordThe attribute is the coordinate of each key point on the predicted heat map, which is obtained based on the predicted heat map and represents the position of each key point in the image. In this code snippet, this property is copied into manual_predso that when marking manually it can be used to determine the location of the marker.

the non-normalized Gaussian kernel

参数σ is used to control the propagation of the peak

Heatmap key point detection algorithm_CSDN blog reading notes

Generally speaking, we can divide pose estimation tasks into two schools: Heatmap-based and Regression-based.

The main difference lies in the difference in supervision information. The Heatmap-based method supervises the model to learn a Gaussian probability distribution map, that is, each point in GroundTruth is rendered into a Gaussian heat map, and the final network output is K feature maps corresponding to K key points. , and then obtain the maximum value point as the estimation result through argmax or soft-argmax. Since this method needs to render a Gaussian heat map, and because the maximum value point in the heat map directly corresponds to the result, it is inevitably necessary to maintain a relatively high-resolution heat map (commonly 64x64). If it is smaller, the error lower bound will be too large. Causes serious loss of accuracy), which naturally leads to a large amount of calculation and memory overhead.

The regression-based method is very simple and crude. It directly supervises the model to learn coordinate values ​​and calculates the L1 or L2 loss of the coordinate values. Since there is no need to render a Gaussian heat map or maintain high resolution, the feature map output by the network can be very small (such as 14x14 or even 7x7). Taking Resnet-50 as an example, FLOPs are 20,000 times that of the Heatmap-based method. First, this is quite friendly to devices with weak computing power (such as mobile phones). In actual projects, this method is more often used.

When it comes to the advantages of the Regression-based method, it is simply good. It can be simply summarized into the following three points:

1. Without high-resolution heat maps, nothing is a burden. Computational costs and memory overhead are greatly reduced together.
2. The output is continuous, so there is no need to worry about quantization errors. (Where the maximum value point of the heat map output by Heatmap-based is, the point corresponding to the original image is determined. The smaller the resolution of the output heat map, the less accurate the corresponding point will be after it is enlarged. Regression-based output is A value can have many digits after the decimal point, and the accuracy is not affected by scaling)
3. High scalability. Whether it is one-stage or two-stage, image-based or video-based, 2D or 3D, Regression-based method can be used. This method has been used before to put 2D and 3D data together for joint training. This is something that the Heatmap-based method cannot do because the output is highly customized. 2D output must render a 2D Gaussian heat map, and 3D output must render a 3D Gaussian heat map. The amount of calculation and memory overhead also skyrocketed.

The Heatmap-based method allows the model to learn the target distribution of the output by explicitly rendering a Gaussian heat map. It can also be seen as the model simply learning a filtering method, filtering the input image into the final desired Gaussian heat map. This greatly simplifies the learning difficulty of the model, and is very consistent with the characteristics of the convolutional network (convolution itself can be regarded as a kind of filtering), and this method stipulates the distribution of learning, compared with everything inside except the result. The regression-based method of boxes is much more robust to various situations (occlusion, dynamic blur, truncation, etc.).

Based on the above advantages, Heatmap-based methods are dominant in the field of attitude estimation, and SOTA solutions are also based on this. This has also led to a separation between academic research and algorithm implementation. In various data sets and competitions, The indicators are soaring, but when the project is implemented, our engineers can only worry because the method you use is slow and consumes memory, and it cannot be used in real projects.
-------------------------------------------------- -----Dividing line------------------------------------------- ----------

(2) Use the predicted Gaussian heat map method, and then argmax finds the index corresponding to the peak, which is the coordinate point, such as cornernet, grid-rcnn, cpn, etc. Taking single-person pose estimation as an example, the output is a picture containing only one person, the input is a Gaussian heat map of all key points, and the label is a Gaussian map generated based on each key point. If each person wants to regress 17 key points, then the predicted output feature map is (batch,h_o,w_o,17), that is, each channel is a heat map predicting a joint point, and then argmax is performed on each channel to obtain Integer coordinates.
The output method based on Gaussian heat map will be more accurate than the direct regression coordinate point . The reason is not that the output method of Gaussian heat map is better, but because its output feature map is larger and its spatial generalization ability is stronger . So naturally It can explain that if I still use (1) direct regression coordinate prediction, but I no longer use full connection, but full convolution, the accuracy will still be lower than the Gaussian heat map. The reason is that even if the full convolution output , but the output feature maps of yolo-v2, ssd, etc. are very small, resulting in a spatial generalization ability that is not as good as method (2). From a numerical point of view, it is definitely better to directly return the coordinate points, because if you directly return the coordinate points, the output It is a floating point number and will not lose accuracy, while the Gaussian thermal output must be an integer, which involves a theoretical error lower bound. Assume that the input image is 512x512 and the output is reduced by 4 times, that is, 128x128. Then assuming that the position of a key point is 507x507, then after reducing by 4 times, even if the Gaussian heat map is restored without any error, there will be a maximum of 507-126*4=3 pixel error, this 3 is the theoretical error lower bound. If the reduction factor increases, the theoretical error lower bound will increase. Therefore, most current practices compromise speed and accuracy and adopt a 4-fold reduction method.

The advantage of this type of approach is that the accuracy is usually higher than the direct regression coordinate point method of the fully connected layer; the disadvantage is also obvious, from the input to the coordinate point output is not a fully differential model, because from the heatmap to the coordinate point, it is obtained offline through the argmax method (In fact, since argmax is not differentiable, use soft argmax instead. Some papers do this). And because the output feature map it requires is very large, the training and forward speed are very slow, and the memory consumption is large.

In the process of heatmap generating coordinates, the disadvantages are: (1) The processing process such as argmax used is not differentiable and cannot be directly learned; (2) There is a quantization error in the process of heatmap to coordinates. The larger the downsampling factor between the heatmap and the input resolution, the larger the quantization error. What is more noteworthy is that the supervision is based on the heatmap, which will cause the loss function to be separated from our metric (in coordinates). During inference, we only use certain (certain) pixels for numerical coordinate calculation, but during training, losses will be caused to all pixels.

To sum up, although the prediction accuracy of Gaussian heat map is usually higher than that of regression methods, it has several very troublesome problems: (1) The output map is very large, resulting in large memory usage and slow inference and training speed; (2) There are Theoretical error lower bound; (3) mse loss may cause the learned results to deviate; (4) It is not a fully differential model;

Pooling

Ordinary pooling (Max Pooling or Average Pooling) is usually performed independently on each feature map. It divides each feature map into several non-overlapping sub-regions, and then takes the maximum value (or average) in each sub-region. value) as the output of this subregion. Specifically, for an input of C×W×H, Max Pooling will downsample each C separately, divide W and H into several non-overlapping sub-regions, and then find the maximum value in each sub-region. As the output of the sub-region; Average Pooling downsamples each C separately, divides W and H into several non-overlapping sub-regions, and then takes the average value in each sub-region as the output of the sub-region. . The shape of the final output feature map is C'×W'×H', where C'=C is the constant number of channels, and W' and H' are determined by the input W and H and the size and step size of the sub-region.

Ordinary pooling is an operation commonly used in convolutional neural networks. It divides the input feature map into several areas according to the set window size. Then perform a pooling operation on the values ​​in each area, such as taking the average or the maximum value, etc., to obtain a pooled value. Such pooling results can usually compress the data, reduce the amount of calculation and improve the robustness of the model.

The global pooling operation is different from ordinary pooling. It does not divide the feature map into multiple areas for pooling operations, but pools the entire feature map. Global average pooling is to average the values ​​in each channel in the feature map to obtain a single value in a channel; while global maximum pooling is to perform a maximum operation on the values ​​in each channel in the feature map to obtain a channel a single value within. This pooling operation can capture the global information of the entire feature map and is often used in feature extraction and classification tasks.

For an input feature map, its dimensions are [C, W, H], where C represents the number of channels, W represents the width of the feature map, and H represents the height of the feature map. In global average pooling, the feature maps in each channel are averaged to obtain a single value in a channel, so the output feature map dimension is [C, 1, 1]. That is, for each channel, all its elements will be affected by the average pooling operation, resulting in a single value within a channel.

In global max pooling , we take the maximum value of all pixel values ​​in each channel on the entire feature map to obtain a single value in a channel. For example, if the size of the input feature map is C×W×H, where C is the number of channels, W and H are the width and height respectively, then after global max pooling, each channel will become a vector of length 1 .

In global average pooling , we average all pixel values ​​of each channel on the entire feature map to obtain a single value within a channel. Similar to global max pooling, each channel becomes a vector of length 1.

These two pooling methods are usually used to encode the global information of the entire feature map into a vector in order to capture the characteristics of the image globally.

The results of ordinary pooling and global pooling are to perform dimensionality reduction operations on the input feature map to obtain an output feature map. Each channel is pooled. The difference is the pooling area. The space size of the output feature map of ordinary pooling will be smaller than the space size of the input feature map, while the output feature map of global pooling has only one element (it can also be understood as the space size is 1×1). . Ordinary pooling is the operation of taking the average or maximum value in a local area, while global pooling is the operation of taking the average or maximum value on the entire feature map. Therefore, global pooling can aggregate the information of the entire feature map to produce a single value, while ordinary pooling produces a feature map with a smaller size.

The output feature map of a convolutional neural network is smaller than the size of the input image. This design not only reduces computational costs but also alleviates the imbalance problem between positive and negative points caused by reduced output resolution .

To put it simply, when a convolutional neural network performs feature extraction, it usually gradually reduces the size of the feature map through multiple convolution and pooling operations, thereby extracting more abstract and higher-level features. Since these operations result in a reduction in resolution , the size of the output feature map is usually smaller than the size of the input image.

This design can save computational costs because as the size of the output feature map decreases, the number of feature points that need to be calculated will also decrease. In addition, due to the reduced output resolution, the difference in number between positive and negative points is smaller , which can alleviate the resulting imbalance problem and improve the stability and performance of the model.

Reducing the output resolution does alleviate the imbalance between the number of positive and negative samples, but the impact is usually limited. Specifically, for an input image, if after multiple convolution and pooling operations, the size of the output feature map is reduced from the original H × W to h × w, then each pixel corresponds to It refers to a local area of ​​size k × k in the original image, where k is usually an odd number. Therefore, each pixel on the output feature map can be regarded as the result of convolution and pooling operations on a small area in the original image.

In classification tasks, the number of positive and negative samples is usually relatively balanced, so the number of positive and negative samples corresponding to each pixel on the output feature map will also be relatively balanced. However, in some special cases, such as in tasks such as target detection and semantic segmentation, the imbalance problem between the number of positive and negative samples may be serious. In this case, reducing the output resolution can indeed alleviate this problem to a certain extent.

Understand entropy (Entropy) and cross-entropy (Cross-Entropy) in one article

Experimental indicators

In the medical field, reliable performance verification is crucial: in general, the model will get better results using more training data. However, in some special cases, a training set that is too small may cause the model to overfit, that is, the model performs well on the training set but performs poorly on the test set. In the medical field, because the data acquisition and labeling processes are complex and expensive, there is usually less data available for training. In order to avoid overfitting and obtain reliable performance evaluation, the author may use a small training set and a large test set. This allows for a better assessment of the model's generalization ability, i.e., how well the model performs on unseen data .

|| 

Mean radial error (MRE) is a widely used keypoint detection performance metric. For an image containing K keypoints, MRE is defined as the average of the Euclidean distances between all keypoints and their corresponding estimated positions . The calculation method of MRE is MRE = 1/K * ∑K_i=1 ||p_i - ˆp_i||, where p_i is the true position of the i-th key point and ˆp_i is the corresponding estimated position.

MRE is suitable for keypoint detection tasks because it considers the prediction error of all keypoints, not just the error of a single keypoint. In addition, MRE can well express the average accuracy of keypoint prediction results because it is the average of all keypoint errors.

It should be noted that MRE does not consider the direction of the prediction error, so in some cases the MRE may be the same but the actual key point prediction results are different. Therefore, when using MRE as an evaluation indicator, it should be considered in conjunction with specific task requirements and other evaluation indicators.

||

BRS (Boundary Recall Score), f-BRS (free Boundary Recall Score) and RITM (Region Intersection over the Minimum area) are indicators used to evaluate the performance of image segmentation models. They are generally used to check the model's segmentation boundaries and region overlap. Performance.

Specifically, both BRS and f-BRS are used to measure the degree of overlap between the segmentation boundaries predicted by the model and the true boundaries . Among them, BRS considers boundary coincidence degree and error, while f-BRS considers boundary translation invariance based on BRS. The calculation of these two indicators requires setting a threshold to determine which predicted boundaries are considered to match the true boundaries. When calculating BRS and f-BRS, pixels within, on, and outside the prediction boundary are generally considered respectively.

RITM is used to measure the degree of overlap between the segmentation area predicted by the model and the real area . In particular, it takes into account the minimum area of ​​the area. It can simultaneously measure the performance of the model in terms of correct classification and correct positioning. The calculation of RITM usually requires the intersection and union operation of the predicted area and the real area, and calculating their areas.

experiment procedure

self.model.train()This code reflects the training mode with Dropout and BatchNormalization turned on:

In PyTorch, by calling train()the method, the model can be set to training mode. At this time, the Dropout and BatchNormalization layers in the model will be turned on for model training. On the contrary, if eval()the method is called, the model will be set to evaluation mode, and the Dropout and BatchNormalization layers in the model will be turned off for model evaluation.

The part that performs model verification. During the verification phase, the Dropout and BatchNormalization layers need to be turned off. Use self.model.eval()the command to convert the model into validation mode.

Then, the model is forward propagated on the validation set to obtain the output outand batch. During the verification phase, there is no need to perform backpropagation and parameter updates on the model, so the torch.no_grad()gradient calculation can be turned off using the command to improve code efficiency.

Accumulate the verification loss of each batch into val_loss. Then use self.metric_manager.average_running_metric()to calculate the average metric on the current validation set and store it in val_metric.

optimizer.scheduler_step()Finally, the learning rate is updated through the command based on the current verification metrics . Specifically, this function save_manager.config.Train.decision_metricadjusts the optimizer's learning rate based on the value of the decision metric. This decision indicator is controlled by parameters in the configuration file Train.decision_metric.

Complete summary of Pytorch optimizers (1) SGD, ASGD, Rprop, Adagrad

Complete summary of Pytorch optimizers (2) Adadelta, RMSprop, Adam, Adamax, AdamW, NAdam, SparseAdam

Complete summary of Pytorch optimizer (3) Newton's method, BFGS, L-BFGS including code

Complete summary of Pytorch optimizers (4) Performance comparison of commonly used optimizers including code

Guess you like

Origin blog.csdn.net/sinat_40759442/article/details/129911248