[Summary and Analysis of CV Knowledge Points] | Regularization

[Summary and Analysis of CV Knowledge Points] | Regularization

【Written in front】

This series of articles is suitable for students or people who have already started Python and have a certain programming foundation, as well as students or people who are looking for jobs in artificial intelligence, algorithms, and machine learning. The series of articles includes deep learning, machine learning, computer vision, feature engineering, etc. I believe it can help beginners quickly get started with deep learning, and help job seekers fully understand the knowledge points of algorithms.

1. Definition of Overfitting

In order to better describe underfitting and overfitting, let's first borrow a picture from Wu Enda's course to describe it.

For a simple dataset (x, y), x represents features and y represents outcomes.
The picture on the left in the figure above uses a hypothesis with only two parameters: h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_{0}+\theta_{1 } xhi(x)=i0+i1x . We can see that this function does not fit all the points very well, and this model is also calledunderfitting.

For the picture in the middle, it adds a feature to get a hypothesis with three parameters: h θ ( x ) = θ 0 + θ 1 x + θ 2 x 2 h_{\theta}(x)=\theta_{0 }+\theta_{1} x+\theta_{2} x^{2}hi(x)=i0+i1x+i2x2 Get a curve in the figure, we can see that the fitting effect is not bad.

For the picture on the far right, we added multiple parameters to it on the basis of the picture on the far left so that the hypothesis becomes a quintic polynomial: h θ ( x ) = ∑ i = 0 5 θ ixi h_{\ theta}(x)=\sum_{i=0}^{5} \theta_{i} x^{i}hi(x)=i=05iixi . We can see that this curve passes through all the points in the graph, but when it is used for testing data, its correct rate may be very low, because it overfits the data in the test set and makes the assumption too strict. The situation can be calledoverfitting. A symptom of overfitting is generally a high accuracy rate on the training set, but a low accuracy rate on the test set.

2. Some reasons for overfitting

(1) Modeling sample selection errors , including (but not limited to) too few samples, wrong sampling methods, insufficient consideration of business scenarios or business characteristics when sampling, etc., resulting in the extracted sample data not being effective enough to represent business logic or Business scene;

(2) The noise data in the sample interferes too much , so much that the model memorizes the noise characteristics too much, but ignores the real relationship between input and output;

(3) The "logical assumption" in modeling cannot be established when the model is applied . Any forecasting model can only be built and applied on the basis of assumptions. Common assumptions include: assuming that historical data can predict the future, assuming that there are no significant changes in business links, assuming that the modeling data is similar to the subsequent application data, etc. If the above assumptions violate the business scenario, the model built based on these assumptions cannot be effectively applied.

(4) Too many parameters and high model complexity

(5) Decision tree model . If we do not have reasonable restrictions and pruning on the growth of the decision tree, the free growth of the decision tree may only contain pure event data (event) or non-event data (no event) in each leaf. It is conceivable that this kind of decision-making Of course, the tree can perfectly match (fit) the training data, but once it is applied to new business real data, the effect is a mess.

(6) Neural network model .

a. Due to the sample data, there may be hidden unit representations that are not unique, that is, the decision-making surface of the generated classification is not unique. As the learning progresses, the BP algorithm may make the weight converge to an overly complex decision-making surface, and reach the extreme.

b. The number of weight learning iterations is sufficient (Overtraining), which fits the noise in the training data and the unrepresentative features in the training samples.

3. What are the ideas for solving model training overfitting?

(1) Weight decay. Mainly used in neural network models

It reduces each weight by a small factor during each iteration, which is equivalent to modifying the definition of E and adding a penalty term corresponding to the total amount of network weights. The motivation of this method is to keep the weights relatively small. Small, to avoid weight decay, so that the learning process is biased towards the opposite direction of the complex decision-making surface.

(2) Appropriate stopping criterion

In the case of a quadratic error function, an explanation of the reasons for the similar results of early stopping and weight decay. The ellipse gives the contour of the constant error function, and Wml represents the minimum value of the error function. If the starting point of the weight vector is the origin and moves in the direction of the local negative gradient, it will move along the path given by the curve. By stopping the training process early, we find a weight vector w. Qualitatively, it is similar to using a checkpoint weight decay regularization term and then minimizing the regularization error function to get the weights.

(3) Verify data

One of the most successful approaches is to provide the algorithm with a set of validation data in addition to the training data. The number of iterations that produce the minimum error on the validation set should be used. It is not always obvious when the validation set has reached the minimum error.

(4) Cross Validation

The cross-validation method works well when additional data is available to provide a validation set, but the overfitting problem is more serious for small training sets. Schematic

(5) Add regular items .

L1 regularization is easier to produce sparse solutions, and L2 regularization tends to make the parameter w tend to be 0.

(6) For the tree model

a. Stop growing before the tree is too large. How much data (threshold) is needed in each leaf at least?

b. Prune the branches and leaves after the tree grows large enough , until any changes will reduce the accuracy rate

(7) To prevent overfitting, some methods such as early stopping , data augmentation (Data augmentation) , regularization (Regularization) , and Dropout are used .

4. What is regularization ?

First understand the regularity (regularity), regularity measures the degree of smoothness of the function, the higher the regularity, the smoother the function. (Smoothness measures the derivability of a function. If a function is a smooth function, then the function is infinitely derivable, that is, any n-order derivable).

In machine learning, it can be seen that an extra item will be added after the loss function. There are generally two types of extra items commonly used. Generally, they are called ℓ1 and ℓ2 in English, and they are called L1 regularization and L2 regularization in Chinese, or L1 norm and L2 norm. L1 and L2 are actually the norms in mathematics, and using norms can just achieve what we want. The definition of the 1 norm is the sum of absolute values, and the 2 norm is the sum of squares

L1 regularization and L2 regularization can be regarded as penalty items of the loss function. The so-called "punishment" refers to some restrictions on certain parameters in the loss function. For linear regression models, the model using L1 regularization is called Lasso regression, and the model using L2 regularization is called Ridge regression (ridge regression).

Regularization is to solve the problem of overfitting. It's mentioned in Andrew Ng's Machine Learning video. There are two ways to solve overfitting:

Method 1: Minimize the number of selected variables as much as possible. Manually check each variable to determine which variables are more important, and then retain those more important feature variables. Obviously this approach requires a good understanding of the problem, professional experience or prior knowledge. Therefore, deciding which variables should stay is not an easy task. In addition, when you discard some feature variables, you also discard some information in the problem. For example, maybe all the characteristic variables are useful for predicting housing prices, and we don't actually want to discard some information or discard these characteristic variables. The best way is to adopt some kind of constraints to automatically select important feature variables and automatically discard unnecessary feature variables.

Method 2: Regularization. Using the regularization method will automatically weaken unimportant feature variables, automatically "extract" important feature variables from many feature variables, and reduce the order of magnitude of feature variables. This method is very effective when we have many feature variables, each of which can have a small impact on the prediction. As seen in the house price prediction example, we can have many feature variables, each of which is useful, so we don't want to remove them, which leads to the concept of regularization.

**5, **Intuitive understanding of L1 and L2 regularization

The colored circle in the upper right corner is the function of the error term. When minimized is when these two intersect. The image of the L1 function on the left is with a sharp corner. It is obviously easier to intersect on the number line, that is, at points that are integers, so that there will be more solutions that are exactly 0. While L2 intersects on the arc, various positions are possible.

The above figure represents the objective function - the contour of the square error term and the contour of the L1 and L2 norms (the left is L1). The goal of our regularized cost function is to solve the empirical risk and model complexity. The balance between the trade-offs is visualized in the figure as the intersection of the black line and the colored line.

The colored line is the contour line encountered in the optimization process, a circle represents an objective function value, the center of the circle is the sample observation value (assuming a sample), the radius is the error value, and the restricted condition is the black boundary (that is, the regularization part) , the intersection of the two is the optimal parameter.

The value of this vertex in the left image is (w1,w2)=(0,w). It can be intuitively imagined that because the L1 function has a lot of "protruding corners" (four in two-dimensional cases, more in multi-dimensional cases), the probability of contacting these corners without adding regular terms will be much greater than that of other parts of L1. The probability of , and on these corners, there will be many weights equal to 0, which is why L1 can generate a sparse model, which can then be used for feature selection.

In the figure on the right, the function graph of L2 regularization under the two-dimensional plane is a circle, and compared with the square, the edges and corners have been ground away. Therefore, when the loss function without a regular term intersects L, the probability of w1 or w2 equal to zero is much smaller, which is why L2 regularization does not have sparsity.

L2 regularization is equivalent to defining a circular solution space for the parameters, while L1 regularization is equivalent to defining a diamond-shaped solution space for the parameters. The "sharp-edged" solution space of L1 is obviously more likely to collide with the contour of the objective function at the foot point. resulting in a sparse solution

L1 regularization

Add an L1 regularization term after the original cost function, that is, the sum of the absolute values ​​of all weights w, multiplied by λ/n (here is not like the L2 regularization term, which needs to be multiplied by 1/2)

C = C 0 + λ n ∑ w ∣ w ∣ C=C_{0}+\frac{\lambda}{n} \sum_{w}|w| C=C0+nlww

Also calculate the derivative first:

∂ C ∂ w = ∂ C 0 ∂ w + λ n sgn ⁡ ( w ) \frac{\partial C}{\partial w}=\frac{\partial C_{0}}{\partial w}+\frac{\lambda}{n} \operatorname{sgn}(w) wC=wC0+nlsgn(w)

In the above formula, sgn(w) represents the symbol of w. Then the update rule of weight w is:

w → w ′ = w − η λ n sgn ⁡ ( w ) − η ∂ C 0 ∂ ww \rightarrow w^{\prime}=w-\frac{\eta \lambda}{n} \operatorname{sgn}( w)-\eta \frac{\partial C_{0}}{\partial w}ww=wnthe lsgn(w)thewC0

There is an additional term of η * λ * sgn(w)/n than the original update rule. When w is positive, the updated w becomes smaller. When w is negative, the updated w becomes larger - so its effect is to let w lean towards 0, so that the weight in the network is as zero as possible, which is equivalent to reducing the complexity of the network and preventing overfitting

Also, there is a problem not mentioned above, what to do when w is 0? When w is equal to 0, |W| is non-differentiable, so we can only update w according to the original unregularized method, which is equivalent to removing the term η λ sgn(w)/n, so we It can be stipulated that sgn(0)=0, so that the situation of w=0 is also unified.

(When programming, let sgn(0)=0, sgn(w>0)=1, sgn(w<0)=-1)

L2 regularization (weight decay)

L2 regularization is to add a regularization term after the cost function:

C = C 0 + λ 2 n ∑ w w 2 C=C_{0}+\frac{\lambda}{2 n} \sum_{w} w^{2} C=C0+2 nlww2

C0 represents the original cost function, and the latter term is the L2 regularization term, which comes from this: the sum of the squares of all parameters w, divided by the sample size n of the training set. λ is the coefficient of the regularization term, which weighs the proportion of the regularization term and the C0 term. In addition, there is a coefficient 1/2, 1/2 is often seen, mainly for the convenience of the result of derivation later, and the derivation of the latter item will produce a 2, which is just rounded up when multiplied by 1/2.

How does the L2 regularization term avoid overfitting? Let's deduce it and see, first ask for the derivative:

∂ C ∂ w = ∂ C 0 ∂ w + λ n w ∂ C ∂ b = ∂ C 0 ∂ b \begin{aligned} \frac{\partial C}{\partial w} &=\frac{\partial C_{0}}{\partial w}+\frac{\lambda}{n} w \\ \frac{\partial C}{\partial b} &=\frac{\partial C_{0}}{\partial b} \end{aligned} wCbC=wC0+nlw=bC0

It can be found that the L2 regularization term has no effect on the update of b, but has an effect on the update of w:

w → w − η ∂ C 0 ∂ w − η λ nw = ( 1 − η λ n ) w − η ∂ C 0 ∂ w \begin{aligned} w & \rightarrow w-\eta \frac{\partial C_{ 0}}{\partial w}-\frac{\eta \lambda}{n} w \\ &=\left(1-\frac{\eta \lambda}{n}\right) w-\eta \frac {\partial C_{0}}{\partial w} \end{aligned}wwthewC0nthe lw=(1nthe l)wthewC0

When L2 regularization is not used, the coefficient before w in the derivation result is 1, and now the coefficient before w is 1−ηλ/n, because η, λ, and n are all positive, so 1−ηλ/n is less than 1, and its The effect is to reduce w, which is the origin of weight decay. Of course, considering the subsequent derivative terms, the final value of w may increase or decrease.

In addition, it needs to be mentioned that for mini-batch-based stochastic gradient descent, the formulas for updating w and b are a bit different from those given above:

w → ( 1 − η λ n ) w − η m ∑ x ∂ C x ∂ w b → b − η m ∑ x ∂ C x ∂ b w \rightarrow\left(1-\frac{\eta \lambda}{n}\right) w-\frac{\eta}{m} \sum_{x} \frac{\partial C_{x}}{\partial w}\\b \rightarrow b-\frac{\eta}{m} \sum_{x} \frac{\partial C_{x}}{\partial b} w(1nthe l)wmhxwCxbbmhxbCx

Comparing the update formula of w above, it can be found that the latter item has changed and becomes the sum of all derivatives, multiplied by η and divided by m, where m is the number of samples in a mini-batch.

So far, we have only explained that the L2 regularization term has the effect of making w "smaller", but we haven't explained why w "smaller" can prevent overfitting? A so-called "obvious" explanation is: a smaller weight w, in a sense, means that the complexity of the network is lower, and the fit to the data is just right (this rule is also called Occam's razor), while In practical applications, this has also been verified, and the effect of L2 regularization is often better than that without regularization. Of course, for many people (including me), this explanation does not seem so obvious, so here is a slightly more mathematical explanation (quoted from Zhihu):

When overfitting, the coefficient of the fitting function is often very large, why? As shown in the figure below, overfitting means that the fitting function needs to consider every point, and the final fitting function fluctuates greatly. In some small intervals, the value of the function changes drastically. This means that the derivative value (absolute value) of the function in some small intervals is very large. Since the value of the independent variable can be large or small, only when the coefficient is large enough can the derivative value be large.

Regularization is to constrain the norm of the parameter so that it is not too large, so it can reduce overfitting to a certain extent.

6. The difference between L1 and L2 regularization

  1. **L2 regularizer**: The solution of the model is biased towards W with a smaller norm, and the limitation of the model space is realized by limiting the size of the norm of W, thereby avoiding overfitting to a certain extent. However, ridge regression does not have the ability to generate sparse solutions, and the obtained coefficients still require all the features in the data to calculate the prediction results, and the amount of calculation has not been improved. Because of the "sparse solution" feature of the L1 norm regularization term, L1 is more suitable for feature selection, finding more "key" features, and setting some less important features to zero.

  2. L1 regularizer : Its nice property is that it can generate sparsity, causing many terms in W to become zero. Sparse not only removes the benefits of calculation, but more importantly, it is more "explainable". The L2 norm regularization term can generate many models with small parameter values, which means that such models have strong anti-interference ability and can adapt to different data sets and different "extreme conditions".

In general regression analysis, the regression w represents the coefficient of the feature. From the above formula, we can see that the regularization term processes (restricts) the coefficient.

The description of L1 regularization and L2 regularization is as follows:

  • L1 regularization refers to the sum of the absolute values ​​of each element in the weight vector w, usually expressed as ||w||1

  • L2 regularization refers to the sum of the squares of each element in the weight vector w and then the square root (you can see that the L2 regularization item of Ridge regression has a square symbol), usually expressed as ||w||2

So what is the use of adding L1 and L2 regularization? The following is the role of L1 regularization and L2 regularization, these expressions can be found in many articles.

  • L1 regularization can generate a sparse weight matrix, that is, generate a sparse model that can be used for feature selection

  • L2 regularization can prevent model overfitting (overfitting); to a certain extent, L1 can also prevent overfitting

7. Do you know BN? What are the functions and advantages?

BN (Batch Normolization) is proposed by Google to solve the problem of gradient disappearance and gradient explosion in deep networks , and it can play a certain role in regularization . Let's talk about its principle:

Batch normalization, that is, during each random gradient descent training of the model, the output of each layer of convolution is normalized through mini-batch, so that the mean value of the result (each dimension) is 0 and the variance is 1 .

The BN operation is divided into four steps. The input is xi x_{i}xi, the first step is to calculate the mean:

μ β = 1 m ∑ i = 1 m x i \mu_{\beta}=\frac{1}{m} \sum_{i=1}^{m} x_{i} mb=m1i=1mxi

The second step calculates the variance of the data:

σ β 2 = 1 m ∑ i = 1 m ( x i − μ β ) 2 \sigma_{\beta}^{2}=\frac{1}{m} \sum_{i=1}^{m}\left(x_{i}-\mu_{\beta}\right)^{2} pb2=m1i=1m(ximb)2

The third step is normalization:

xi ∗ = xi − μ β σ β 2 + ϵ x_{i}^{*}=\frac{x_{i}}\mu_{\beta}}{\sqrt{\sigma_{\beta}^{2} +\epsilon}}xi=pb2+ϵ ximb

The fourth step scale transformation and offset:

y i = γ ⋅ x i ∗ + β = B N γ , β ( x i ) y_{i}=\gamma \cdot x_{i}^{*}+\beta=B N_{\gamma, \beta}\left(x_{i}\right) yi=cxi+b=BNc , b(xi)

m m m represents the number of data in the mini-batch. It can be seen that BN actually performs awhitening operation. The whitening operation is linear, and the final "scale transformation and offset" operation is to allow BN to make a trade-off between linearity and nonlinearity, and the parameter of this offsetγ \gammaγ and $\beta$ are learned by the neural network during training.

After the BN operation, the small value of the output of each layer of the network is "stretched" and the large value is "shrunk", so the gradient disappearance and gradient explosion are effectively avoided. All in all, BN is a learnable network layer with parameters (γ, β) .

8. What is the difference between BN training and testing?

During training, the mean and variance are for a Batch .

When testing, the mean and variance are for the entire dataset . Therefore, in addition to the normal forward propagation and reverse derivation during the training process, we also need to record the mean and variance of each batch , so that after the training is completed , the overall mean and variance can be calculated according to the following formula :

E [ x ] ← EB [ µ B ] Var ⁡ [ x ] ← mm E 2 EB [ σ 2 2 ] \begin{aligned}\mathrm{E}[x] & \leftarrow\mathrm{E}_{\mathcal {B}}\left[\mu_{\mathcal{B}}\right]\\\operatorname{Var}[x] & \leftarrow\frac{m}{m E_{\mathcal{2}}}\mathrm {E}_{\mathcal{B}}\left[\sigma_{2}^{2}\right]\end{aligned}E [ x ]yes [ x ]EB[ mB]mE2mEB[ p22]

The simple understanding above is: in the test model , directly calculate all batch μ β \mu_{\beta} for the meanmbthe mean of the values; then for the standard deviation each batch σ β \sigma_{\beta}pbUnbiased estimation of (unbiased estimation is an unbiased inference when using sample statistics to estimate population parameters) .

In the final test stage, the formula for using BN is:

y = γ Var ⁡ [ x ] + ϵ ⋅ x + ( β − γ E [ x ] Var ⁡ [ x ] + ϵ ) y=\frac{\gamma}{\sqrt{\operatorname{Var}[x]+ \epsilon}} \cdot x+\left(\beta-\frac{\gamma\mathrm{E}[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}}\right)y=yes [ x ]+ϵ cx+( byes [ x ]+ϵ γE[x])

9. Tell us about BN and LN? What's the difference? On which dimension is LN normalized?

LN: Layer Normalization, LN is "horizontal", and normalizes a sample through all neurons of the same layer .

BN: Batch Normalization, BN is "vertical", after all samples of a neuron are normalized , so it has a relationship with the batch size .

The purpose of both is to speed up model convergence and reduce training time.

10. How to use BN and dropout at the same time?

When using BN and dropout at the same time, there may be a problem of variance shift

For the variance offset , the paper gives two solutions:

  1. Reject the variance offset, and only use the dropout layer behind all BN layers (most of the open source models now add BN in the middle of the network, so we can only add dropout to the previous layer of softmax, and the effect is still OK, at least it will not be worse than without dropout. Another method is to fix the parameters after the model is trained, and use the test mode to calculate the mean and variance of the BN for the training data, and then normalize the test data. The paper proves that This method is better than baseline)

  2. The original text of dropout proposes a Gaussian dropout. The paper further expands Gaussian dropout and proposes a uniformly distributed Dropout . One advantage of this is that this form of Dropout (also known as "Uout") offsets the variance reduced sensitivity to

11. Two regularized parameter distributions

L1 regularization assumes that the parameter distribution is a Laplace distribution ; L2 regularization assumes that the parameter distribution is a normal distribution

12. When predicting, should the weights trained by dropout be used or multiplied by keep-prib? Why?

To be multiplied by keep-prib .

Because neurons cannot be randomly discarded during prediction, a "compensation" scheme is to multiply the weight of each neuron by a p , so that the test data and training data are roughly the same "overall" . It is guaranteed that multiplying the weight of this neuron by p can get the same expectation when testing .

Note: The current mainstream is to use inverted dropout instead of dropout, and inverted dropout does not need to be multiplied by keep-prib. Its approach is to divide the output activation value of the layer that has performed the dropout operation during the training phase by keep_prob , and the tested model does not need to make any changes.

13. Why can L1 regularization alleviate overfitting

L1 is the sum of the absolute values ​​of each parameter of the model ∣ w ⃗ ∣ 0 \left|\vec{w}\right|_{0}w 0, then after optimizing the objective function, some of the parameters will become 0, and the other part of the parameters will be non-zero real values. This acts as a filter feature . Overfitting is due to too many features, and L1 can filter features, so overfitting can be alleviated .

14. BN+CONV fusion formula and function

After the network is trained, in the inference stage, in order to speed up the operation, the convolutional layer and the BN layer are usually fused:

Inference stage, E[x] is the sliding mean , Var[x] is the sliding variance

Fusing the BN layer into the convolutional layer is equivalent to modifying the convolution kernel to a certain extent, without increasing the calculation amount of convolution, and at the same time, the calculation amount of the entire BN layer is omitted.

15. What other Normalization methods are there?

1 Layer Normalization

In order to be able to find a reasonable statistical range even when there is only one current training example, one of the most direct ideas is: the same hidden layer of MLP itself contains several neurons; similarly, the same convolutional layer in CNN Contains k output channels, each channel contains m n neurons, and the entire channel contains k m*n neurons; similarly, the hidden layer of each time step of RNN also contains several neurons. Then we can directly use the response value of the neurons in the hidden layer of the same layer as the range of the set S to calculate the mean and variance. This is the basic idea of ​​Layer Normalization. The figure below shows the calculation range of the set S of Layer Normalization of MLP, CNN and RNN. Because it is very intuitive, it will not be described in detail here.

As mentioned above, BN is very inconvenient to use in RNN, and Layer Normalization, a mode of calculating statistics in the same hidden layer, is more suitable for dynamic networks such as RNN. Currently, only LayerNorm seems to be relatively effective in RNN, but Layer Normalization currently seems to be only suitable for application in RNN scenarios, and the effect in CNN and other environments is not as good as BatchNorm or GroupNorm and other models. Judging from the current situation, the Normalization mechanism in dynamic networks is a field worthy of further study.

2 Instance Normalization

It can be seen from the above that Layer Normalization puts aside the dependence on Mini-Batch, in order to be able to count the mean variance, it is natural to use the response values ​​​​of all neurons in the same layer as the statistical range, then can we further use Statistics narrowed? It is obviously possible for CNN, because each convolution kernel in the same convolution layer will generate an output channel, and each output channel is a two-dimensional plane, which also contains multiple activated neurons. Naturally, the statistical range can be further expanded Reduced to the inside of the output channel corresponding to a single convolution kernel. Figure 14 shows the Instance Normalization in CNN. For a certain convolutional layer in the figure, the neurons in each output channel will be used as a set S to count the mean variance. For RNN or MLP, if the same hidden layer is narrowed like CNN, then only a single neuron is left, and the output is also a single value rather than a two-dimensional plane of CNN, which means that no set S is formed, so RNN and MLP cannot perform Instance Normalization operation, which is well understood.

If the Batch Size in CNN BN is set to 1, compare it with the above picture of Instance Norm, are the two equivalent? In other words, it looks like Instance Normalization is a special case of Batch Normalization with Batch Size=1. But if you think about it carefully, you will find that there is still a difference between the two. As for what the difference is, readers can think for themselves.

Instance Normalization is significantly better than BN for some image generation tasks such as image style conversion, but it is not as effective as BN in many other image tasks such as classification.

3 Group Normalization

From the above Layer Normalization and Instance Normalization, it can be seen that these are two extreme cases. Layer Normalization uses all neurons in the same layer as the statistical range, while Instance Normalization uses the same convolution layer in CNN. Each convolution kernel The corresponding output channel is used as its own statistical range alone. So, is there a statistical range in between? Channel grouping is a commonly used model optimization technique in CNN, so it is natural to think of grouping the output or input channels of a certain convolutional layer in CNN, and making statistics within the grouping range. This is the core idea of ​​Group Normalization, which is an improved model proposed by Facebook's He Kaiming research group in 2017.

The figure below shows Group Normalization in CNN. Theoretically, MLP and RNN can also introduce this mode, but no related research has been seen yet. However, in theory, if MLP and RNN do this, there are too few neurons in the group, and the estimation lacks statistical validity, and the guessing effect is not good. would be great.

Group Normalization is better than BN in scenarios where the Batch Size is required to be relatively small or in application scenarios such as object detection/video classification.

【Project recommendation】

The core code library of top conference papers for Xiaobai: https://github.com/xmu-xiaoma666/External-Attention-pytorch

YOLO target detection library for Xiaobai: https://github.com/iscyy/yoloair

Analysis of papers for Xiaobai's top journal and conference: https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading

reference

https://blog.csdn.net/qq_35556254/article/details/90813562

https://blog.csdn.net/SecondLieutenant/article/details/78931706

https://blog.csdn.net/zwqjoy/article/details/79806989

https://www.nowcoder.com/issue/tutorial?zhuanlanId=qMKkxM&uuid=24c5ab2b16094e04b0c8c5d44c6c949a

Guess you like

Origin blog.csdn.net/Jason_android98/article/details/127259168