[Machine Learning] What is an effective method for model convergence?

Hello everyone, data normalization plays a crucial role in model convergence. How do data normalization methods from classical machine learning to deep learning evolve step by step? Why is it the most effective means of deep learning model convergence?

First, the normalization method of classical machine learning

There are two main data normalization algorithms in the field of classical machine learning, namely 0-1 normalization and Z-Score normalization. There is little difference in the use effect of the two, and both process the input data column by column.

1.1 0-1 Normalization

0-1 normalization is the simplest and easiest method to think of, and it is also the most commonly used normalization method in the field of classical machine learning:

This method traverses each data in the input feature column by column, records the Max and Min of each column, and then scales the data to between 0-1 according to the formula:

import torch
t = torch.arange(12).reshape(6, 2).float()
print(t)

# tensor([[ 0.,  1.],
#         [ 2.,  3.],
#         [ 4.,  5.],
#         [ 6.,  7.],
#         [ 8.,  9.],
#         [10., 11.]])

def min_max(t):
    # 提取每列最大值
    t_max = t.max(0)[0]
    # 提取每列最小值
    t_min = t.min(0)[0]
    # 计算0-1标准化后结果
    zo_ans = (t - t_min) / (t_max - t_min)
    return zo_ans
    
zo_ans = min_max(t)
print(zo_ans)
# tensor([[0.0000, 0.0000],
#         [0.2000, 0.2000],
#         [0.4000, 0.4000],
#         [0.6000, 0.6000],
#         [0.8000, 0.8000],
#         [1.0000, 1.0000]])

1.2 Z-Score Standardization

Unlike 0-1 normalization, Z-score normalization uses the mean and standard deviation of the original data to normalize the data. The same is done column by column, and each piece of data is subtracted from the mean of the current column and divided by the standard deviation of the current column.

After processing in this way, it will become zero-point symmetric data, and if the original data obeys a normal distribution, it will obey a standard normal distribution after being processed by Z-Score.

def z_score(t):
    std = t.std(0)
    mean = t.mean(0)
    ans = (t - mean) / std
    return ans 
    
zs_ans = z_score(t)
print(zs_ans)
# tensor([[-1.2649, -1.2649],
#         [-0.6325, -0.6325],
#         [ 0.0000,  0.0000],
#         [ 0.6325,  0.6325],
#         [ 1.2649,  1.2649]])

1.3 Limitations of Z-Score

Although Z-Score normalization can ensure a stable gradient to a certain extent, thereby improving the convergence speed of the model and even improving the model effect, but because it is only a modification of the "initial value", it will gradually destroy the zero-point symmetry as the number of iterations increases. A condition that makes the zero-point symmetry condition disappear .

This problem can also be seen as a limitation of classical machine learning normalization methods when applied to deep neural networks.

2. The idea of ​​deep learning normalization

Let's review the calculation formula of the gradient of the third layer in the three-layer neural network (the principle of each layer is the same, because the formula of the third layer is simple, it is represented by the third layer):

It can be found that the gradient is actually affected by the activation function F(), the input data of each layer Xand the parameters w.

2.1 Parameters

I introduced the Glorot condition in a previous article, which is to make the gradient of each layer as smooth as possible during the calculation process by cleverly setting the initial value of the parameters.

However, due to the particularity of the parameter itself, it is the goal of our final learning, so we can only set its initial value. Once the model starts to iterate, the parameters will begin to be adjusted "uncontrolled". The initial value setting is difficult for a long time. Ensure that the gradient is stable, which is consistent with the problem of Z-Score initializing the data.

2.2 Activation function

When talking about the disappearance and explosion of gradients, it was mentioned that the sigmoid and tanh activation functions are easy to cause gradients and explosions, so the big guys invented relu. When the input is greater than 0, the gradient is constant to 1. The purpose is to make the complex model gradient when updating. remain balanced.

picture

2.3 Input

输入XAt this point, there is only one way left to ensure gradient stability . At present, the most widely used and proven data normalization method with the best practical effect is Batch Normalization [1]. This method improves the stability of the gradient of each layer of the model by modifying the data distribution in each batch, thereby improving the model learning efficiency and the model training results.

2.4 Is it just multiple uses of Z-Score?

It differs from classical machine learning in two main ways:

  • The purpose is different:

Classical machine learning normalization is mainly to eliminate the dimensional influence of different features, so the data distribution of each column is adjusted, and not all machine learning models require data normalization.

The goal of deep learning normalization is to ensure that the gradients are stable, the model can be trained efficiently, and is a necessary optimization method applicable to all models.

  • different calculation methods

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/124167209