Notes for Machine Learning Beginners——Baidu Flying Paddle Zero-Basic Practical Deep Learning Course

machine learning

Machine learning is a branch of artificial intelligence that uses algorithms and statistical models to enable computer systems to automatically learn from empirical data and continuously improve their performance. In simple terms, machine learning is a method of training a computer program to recognize and predict patterns.

In machine learning, a computer program learns from the data it is given and then uses those learnings to make decisions or predict future events. This process includes steps such as data acquisition, cleaning, processing, training models, evaluating models, and forecasting.

This article is mainly the data collection I conducted before the model tuning

Model Evaluation and Tuning - Learn to evaluate the performance of your models and tune hyperparameters for better performance. Learn about techniques like cross-validation, grid search, and more.

Neural Networks:

RNN

deep neural network DNN

Convolutional Neural Network (CNN)

Generative Adversarial Network GAN

Basic networking modules commonly used in convolutional neural networks

Convolution:

Convolution kernel (kernel) padding (stride) receptive field (Receptive Field) multiple input channels, multiple output channels and batch operations

Pooling

activation function

Batch Normalization

Dropout

Pooling:

Average Pooling

max pooling

Spatial Pyramid Pooling (Spatial Pyramid Pooling, SPP)

Global Average Pooling (Global Average Pooling, GAP)

Global Max Pooling

NetVLAD pooling

random pooling

overlapping pooling

RoI pooling

Activation function:

Sigmoid function

Tanh function

ReLU function

LReLU

PRELU

UP

swish

hswish

Normalization method:

Batch Normalization

Layer Normalization (Layer Normalization, LN)

Group Normalization (GN)

Instance Normalization (IN)

Among them, N knows the batch size, H and W represent the height and width of the feature map, C represents the number of channels of the feature map, and the blue pixels represent normalization using the same mean and variance:

edit

Add picture annotations, no more than 140 words (optional)

LN: Normalize the mean variance of [C, W, H] dimensions, that is, normalize in the channel direction, regardless of the batch size, and the effect may be better on small batch sizes

GN: First group the channel direction, and then normalize the [Ci​,W,H] dimension within each group, and it has nothing to do with the batch size

IN: Only normalize the [H, W] dimension, and the image stylization task is suitable for using the IN algorithm

Dropout method (Dropout):

DropConnect

Standout

Gaussian Dropout

Spatial Dropout

Cutout

Max-Drop

RNNDrop

Loop Drop

Two methods are provided in Paddle:

downscale_in_infer

During training, a part of neurons is randomly discarded at a ratio r, and their signals are not transmitted backwards; during prediction, the signals of all neurons are transmitted backwards, but the value on each neuron is multiplied by (1−r).

upscale_in_train

During training, some neurons are randomly discarded at a ratio p, and their signals are not transmitted backwards, but the values ​​on those neurons that are retained are divided by (1−p); during prediction, the signals of all neurons are transmitted backwards, without do any processing.

Optimizer:

Four mainstream optimization algorithms

SGD

Momentum

AdaGrad

Adam

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

SGD: Stochastic gradient descent algorithm, each time a small amount of data is trained, and the parameters are oscillated during the convergence process caused by sampling deviation.

Momentum: Introduce the concept of physical "momentum", accumulate speed, reduce shock, and make the direction of parameter update more stable.

The data of each batch contains sampling error, which causes the direction of gradient update to fluctuate greatly. If we introduce the concept of physical momentum and add a certain amount of "inertial" accumulation to the gradient descent process, we can reduce the shock on the update path, that is, the gradient of each update is changed from "accumulation direction of multiple historical gradients" and "current time Gradient" weighted addition. The cumulative direction of multiple gradients in history is often the more correct direction from a global perspective, which is very similar to the physical concept of "inertia", which is why it is named "Momentum". Basketballs of different brands and materials have certain weight differences. Shooters in street basketball teams (good at mid- and long-range shots) tend to prefer slightly heavier basketballs. A very important reason is that a heavy basketball has a large inertia and is less susceptible to small deformations of gestures or wind.

AdaGrad: Dynamically adjust the learning rate according to the distance of different parameters from the optimal solution. The learning rate decreases gradually, and the learning rate is adjusted according to the change of each parameter.

Through the experiment of adjusting the learning rate, it can be found that when the current value of a parameter is far away from the optimal solution (expressed as the absolute value of the gradient is large), we expect the parameter update step to be larger in order to converge to the optimal solution faster. Excellent solution. When the current value of a parameter is closer to the optimal solution (shown as the absolute value of the gradient is smaller), we expect the update step of the parameter to be smaller in order to approach the optimal solution more finely. Similar to playing golf, when a professional player tees off the first shot, he usually hits a long ball vigorously so that the ball lands near the hole as much as possible. When the second shot faces the ball closer to the hole, he will putt more softly and carefully to avoid flying the ball. Similarly, the step size of the parameter update should gradually decrease with the optimization process, and the degree of reduction is related to the size of the current gradient. The optimization algorithm written based on this idea is called "AdaGrad". Ada is the abbreviation of Adaptive, which means "adapt to the environment and change". RMSProp is an improvement on the basis of AdaGrad. The learning rate adapts as the gradient changes, which solves the problem of a sharp drop in the learning rate of AdaGrad.

Adam: Since the two optimization ideas of momentum and adaptive learning rate are orthogonal, the two ideas can be combined, which is a widely used algorithm at present.

learning rate:

In the deep learning neural network model, the standard stochastic gradient descent algorithm is usually used to update the parameters, and the learning rate represents the size of the parameter update range, that is, the step size. When the learning rate is optimal, the effective capacity of the model is the largest, and the final effect can be achieved the best.

The learning rate is not as small as possible. The smaller the learning rate, the slower the loss function changes, which means we need to spend longer time to converge, as shown in the left diagram of Figure 2.

The learning rate is not as big as possible. The gradient is only calculated based on one batch in the total sample set, and the sampling error will cause the calculated gradient to not be in the direction of the global optimum, and there will be fluctuations. When approaching the optimal solution, an excessively large learning rate will cause the parameters to oscillate near the optimal solution, and the loss is difficult to converge, as shown in the right figure of Figure 2.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Regularization term:

In order to prevent the model from overfitting, without the possibility of expanding the sample size, the complexity of the model can only be reduced, which can be achieved by limiting the number of parameters or possible values ​​(parameter values ​​should be as small as possible). Specifically, a penalty term for parameter size is artificially added to the optimization objective (loss) of the model. When there are more parameters or larger values, the penalty item will be larger. By adjusting the weight coefficient of the penalty item, the model can be balanced between "minimizing the training loss" and "maintaining the generalization ability of the model". The generalization ability means that the model is still effective on unseen samples. The existence of the regularization term increases the loss of the model on the training set.

Paddle supports adding a unified regularization term for all parameters, and also supports adding a regularization term for specific parameters. The implementation of the former is shown in the following code, which can be realized only by setting the weight_decay parameter in the optimizer. Use the parameter coeff to adjust the weight of the regularization term. The larger the weight, the higher the penalty for the model complexity.

Basic concepts of target detection:

Anchor:

Anchor-Based

Anchor-Based methods can be divided into two-stage detection algorithms and single-stage detection algorithms.

Anchor-Based uses Anchor to extract candidate target boxes, and then classifies and regresses Anchor at each point on the feature map. The two-stage detection algorithm first uses Anchor to generate candidate areas on the image, divides the foreground and background, and then classifies the candidate areas and predicts the location of the target object. The typical two-stage detection algorithm is the R-CNN series (Fast R-CNN, Faster R-CNN, etc.), the classic Faster R-CNN learns the candidate region (Region Proposal, RP) through the RPN (Region Proposal Network), and then the candidate Regions are classified and regressed, and the final target boxes and categories are output. The two-stage model based on generating candidate regions first and then detecting usually has better accuracy, but the prediction speed is slower.

In addition, Anchor-Based also has some single-stage models, which can predict the object category and location while generating candidate regions, and do not need to be divided into two stages to complete the detection task. A typical single-stage algorithm is the YOLO series (YOLOV2, YOLOv3, YOLOv4, PP-YOLO, PP-YOLOV2, etc.). The single-stage algorithm abandons the step of RPN generating candidate areas in the two-stage algorithm, and combines the two stages of candidate area and detection into one, making the network structure simpler and the detection speed faster.

However, the Anchor-Based method has some shortcomings in practical applications, such as: manually designing the Anchor needs to consider the number and size (aspect ratio) of the Anchor; the detection frame generated by densely sliding pixels on the feature map will have a large number of negative sample areas , it is necessary to consider the problem of imbalance between positive and negative samples; the design of Anchor leads to more network hyperparameters, making model learning more difficult; changing different data sets requires readjusting Anchor.

Anchor-Free

Therefore, the researchers proposed the Anchor-Free method, which no longer uses the preset Anchor, and usually detects the target by predicting the center or corner of the target. Including methods based on center region prediction (FCOS, CenterNet, etc.) and methods based on joint expression of multiple key points (CorNert, RepPoints, etc.). The Anchor-Free algorithm no longer needs to design an Anchor, the model is simpler, and the time consumption of the model is reduced, but the accuracy is also lower than that of the Anchor-Based method.

Model structure:

R-CNN

SSD

YOLO(1, 2, 3……7)

R-FCN

 

Guess you like

Origin blog.csdn.net/weixin_49844623/article/details/129806547