Common concepts of target detection

 

Common concepts of target detection: https://zhuanlan.zhihu.com/p/290833705?utm_source=wechat_session

1. IOU: Intersection ratio, which measures the degree of overlap between the Detectoin Result generated by the model and the artificially labeled Ground Truth

2, official: IOU = ∩ \ ∪

3. GT: Ground Truth standard answer 

4. DR: Detection Result

5. Bounding Box: bounding box

6. NMS: NMS means non-maximum suppression, removing bounding boxes with low IOU values ​​and overlapping and rereading high bounding boxes.

7. Bounding box regression:                                                                                                                    

8. mAP (average average accuracy)

(1) mAp: mean Average Precision mean: mean average: average precision: precision, accuracy, the average of all types of AP.

(2) PR curve: Precision-Recall curve

(3) Precision: TP/(TP+FP) The proportion of the predicted correct part in the predicted result

(4) Recall: Recall=TP/TP+FN The proportion of the correct prediction to the real answer

(5) TP: the number of detection frames with IoU>0.5 (calculated with the same Ground Truth) FP: the number of detection frames with IoU<=0.5 FN: the number of detected GTs as negative (the number of GTs not detected )

(6) TP: (True Positives) true positive samples = [positive samples are correctly divided into positive samples] the number of detection frames with IoU>=0.5

(7) TN: (True Negatives) true negative samples = [negative samples are correctly divided into negative samples]

(8) FP: (False Positives) false positive samples = [negative samples are mistakenly divided into positive samples] the number of detection frames with IoU<=0.5

(9) FN: (False Negatives) false negative samples = [positive samples are mistakenly classified as negative samples] the number of GT detected as negative (the number of GT not detected)

(10) fps: frames per second, that is, the number of pictures processed per second or the time required to process each picture, compared under the same hardware conditions.

(11) FLOPs (Forward Pass calculation amount): floating point operations per second floating point operations per second, understood as calculation speed. It is a measure of hardware performance.

(12) Feature map output feature value: In the CNN setting, the Feature Map is the funny core sliding out, and the original image will be multiplied by the funny core in various situations, and various feature maps will be obtained. It can be understood as analyzing pictures from multiple angles, and different feature extraction (funny core-funny pictures) will extract different features.

https://blog.csdn.net/nanhuaibeian/article/details/100528584?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_baidulandingword-1&spm=1001.2101.3001.4242#commentBox

Concept: In each convolutional layer of cnn (funny neural network), data exists in three-dimensional form. Many two-dimensional pictures are stacked together (tofu skins are vertically pasted into tofu pieces), and each of them is called a feature map.

Construction: In the input layer, if it is a grayscale image, there is only one feature map, if it is a color image (RGB), there are generally 3 feature maps (red, green and blue). The three parts in the figure below are the input RGB image, the convolution kernel (also called the filter), the convolution result (output), and * represents the convolution operation. The three leftmost slices represent three feature maps. If it is a gray image (two-dimensional), only one piece on the left represents a feature map, and the corresponding convolution kernel (the filter is also two-dimensional). Other layers: There will be several convolution kernels (kernels) (also called filters) between layers. Each feature map of the previous layer is convolved with each convolution kernel, and a feature of the next layer will be generated. map, there are N convolution kernels, the lower layer will generate N feature maps

Function: In a convolutional neural network, we hope to use a network to simulate the characteristics of the visual pathway. The concept of layering is to construct simple to complex neurons from the bottom up. We hope to construct a set of bases that can form a complete description of a thing. For example, when describing a person, we describe height/weight/appearance, etc. This is also true in convolutional networks. In the same layer, we hope to get a description of a picture from multiple angles. Specifically, it is to use a variety of different convolution kernels to make the image funny, and get the correspondence on the different kernels (descriptions) as the image features. Their connection lies in forming the description of the image on the same level and different bases.

Insert picture description here

(13) Definition of convolution kernel

https://blog.csdn.net/qq_37764129/article/details/86220354?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.control&dist_request_id=&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.control#commentBox

Convolution kernel is also called filter or kernel. The convolution kernel can transform a sub-node matrix on the current network into a unit node matrix on the next layer of neural network. The unit node matrix refers to a node matrix whose height and width are both 1, but the depth (length) is not limited. .

(14) Input data

https://blog.csdn.net/qq_37764129/article/details/86220354?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.control&dist_request_id=&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.control#commentBox

The input image size is: 64*64*3, it will be converted into a vector, (a vector can be understood as an array of 1*n or n*1, the former is a row vector, the latter is a column vector), then this vector The total dimension of is 64*64*3, and the result is 12288. This image is 12288 numbers in the computer. In the field of artificial intelligence, every piece of data input to a neural network is called a feature. The input image has 12288 features, and this 12288-dimensional vector is also called a feature vector. The neural network receives this feature vector as input and makes predictions.

Observe the figure below: each funny core (filter-1, filter-2, filter-3...) in step 1 actually contains a set of 3 weight matrices (Wt-R, Wt-G, Wt- B). The three weight matrices in each funny core are used to process the red (R), green (G), and blue (B) channels in the input image. During the forward propagation period, the R, G, and B pixel values ​​in the image are multiplied by the Wt-R, Wt-G, and Wt-B weight matrices respectively to process the red (R)-green (G) in the input image. -Blue (B) channel. During the forward propagation period, the R, G, and B pixel values ​​in the image are multiplied by the Wt-R, Wt-G, and Wt-B weight matrices to generate an intermittent activation map (marked in the figure), and then this The outputs of the 3 weight matrices (ie 3 intermittent activation maps) are added to produce an activation map for each filter

Subsequently, each of these activation maps must be de-linearized by the activation function ReLU, and finally to the maximum pooling layer, and the latter mainly reduces the dimensionality of the activation map. Finally, what we get is a set of activation maps processed by the activation function and the pooling layer. Now its signals are distributed in a set of 32 (the number of funny cores) two-dimensional tensors (also with 32 feature maps, each A funny core will get a feature map)

The output from the convolutional layer is often used as the input for subsequent convolutional layers.

(15) Convolutional layer: Convolutional neural network (CNN) has a total of 5 hierarchical structures: input layer, convolutional layer, activation layer, pooling layer, and fully connected layer.

(16) Receptive fiels and local connection: each value on the feature map corresponds to only a small area of ​​the original image. This local area on the original image is called the idea of ​​receptive fiels local connection, inspired by In the visual system in biology, neurons in the visual cortex receive information locally.

(17) Pooling: Preserving features while compressing the amount of data. Pooling is also called subsampling, which uses one pixel to replace several adjacent pixels on the original image, and compresses its size while retaining the features of the feature map.

The role of pooling:

Prevent data explosion, save calculation amount and calculation time.

Use large-scale methods to prevent over-fitting and over-learning. Prevent artificial intellectual disability who develops high scores and low abilities, dominates in the examination room (training set) but cannot be mixed in the society (test set).

(18) Stochastic gradient descent: In the process of reducing the loss function, a step-by-step approach is adopted to optimize a single sample and a single sample input, instead of calculating all samples and then optimizing the same. Although individual samples will be biased, as the number of samples increases, the minimum value of the loss function can still be gradually approached. The future is bright, and the road is tortuous.

(19) Gradient: The architecture and latest developments of deep learning, including CNN, RNN, and GAN, which creates countless fake faces, are inseparable from the gradient descent algorithm. Gradient can be understood as the fastest rising direction at a certain point on the hillside, and its opposite direction is the fastest falling direction. To descend the mountain in the fastest way, you have to walk in the opposite direction of the gradient. What looks like a sand table deduction is actually the small balls we withdrew, and they will roll to the bottom along the gradient.

Stochastic gradient descent: The ultimate goal of the gradient descent algorithm is to find the lowest point (global minimum) in the entire "terrain", which is the valley with the lowest altitude. But in this terrain, there may be more than one valley (local minimum), so we need to withdraw a lot of balls, let them fall into different valleys, and finally compare the altitude to find the lowest point.

 

Guess you like

Origin blog.csdn.net/weixin_42133481/article/details/114882269