[Recommended Collection] 30,000-word detailed explanation of TensorFlow deep learning essential knowledge points (below)

Hello, classmates, I shared this with you before: [Recommended Collection] 30,000-word detailed explanation of TensorFlow deep learning essential knowledge points (Part 1)

Today we will continue to talk about the necessary knowledge points for TensorFlow deep learning, the directory is as follows:

Answer questions

If you have forgotten what you have learned and cannot understand the technical points, you can join the technical exchange. The best way to remark when adding is: source + direction of interest, which is convenient for finding like-minded friends

Method ①, add WeChat account: mlc2060, remarks: from CSDN + add group
Method ②, WeChat search public number: machine learning community, background reply: add group

hard

The main contents are:

1.metrics indicator2.compile
model configuration3.fit
model training4.evaluate
model evaluation5.predict
prediction6.custom
network

1. Metrics performance indicators

Weighted Mean: tf.keras.metrics.Mean Accuracy
of predicted and true values: tf.keras.metrics.Accuracy

1.1 Create a new metrics indicator

The accuracy metrics metrics.Accuracy() are generally used for the training set, and the weighted average metrics.Mean() is generally used for the test set

# 新建准确度指标
acc_meter = metrics.Accuracy() 
# 新建平均值指标
mean_meter = metrics.Mean()  

1.2 Add data to metrics

Add data: update_state() . In each iteration, the actual value of the test data and the predicted value of the test data are added to the accuracy index, and the accuracy is saved in the buffer area and retrieved when needed. Add the loss generated by each training to the average loss indicator, calculate the weighted average for each value added , sample_weight specifies the weight of each item, save the result in the buffer area, and retrieve it when needed.

# 计算真实值和预测值之间的准确度
acc_meter.update_state(y_true, predict) 
# 计算平均损失
mean_meter = mean_meter.update_state(loss, sample_weight=None)

1.3 Get data from metrics

Get the data: result().numpy() . result() returns tensor type data, converted to numpy() type data.

# 取出准确率
acc_meter.result().numpy() 
# 取出训练集的损失值的均值
mean_meter.result().numpy()

1.4 Clear the cache

Clear cache: reset_states() . The previous data will be saved in the buffer area for each cycle. Before starting the second cycle, the buffer area should be emptied and the data should be read again.

# 清空准确率的缓存
acc_meter.reset_states()
# 清空加权均值的缓存
mean_meter.reset_states()

2, compile model configuration

compile(optimizer, loss, metrics, loss_weights)

parameter settings:

optimizer: The optimizer used to configure the model. You can call the tf.keras.optimizers API to configure the optimizer required by the model.

loss: The loss function used to configure the model, you can call the defined loss function in the tf.losses API by name.

metrics: methods used to configure model evaluation, metrics during model training and testing, such as accuracy, mse, etc.

loss_weights: float type, loss weighting coefficient, the total loss is the weighted sum of all losses, and the number of its elements is in a 1:1 relationship with the number of outputs of the model.

# 选择优化器Adam,loss为交叉熵损失,测试集评价指标accurancy
network.compile(optimizer=optimizers.Adam(lr=0.01), #学习率0.01
    loss = tf.losses.CategoricalCrossentropy(from_logits=True),
    metrics = ['accuracy'])

3. Fit model training

fit(x, y, batch_size, epochs, validation_split, validation_data, shuffle,validation_freq)

parameter:

x: The input data of the training set, which can be array or tensor type.

y: The target data of the training set, which can be array or tensor type.

batch_size: The size of each batch, the default is 32

epochs: the number of iterations

validation_split: Configure the proportion of the test set data to the training data set, ranging from 0 to 1.

validation_data: Configure test set data (input features and targets). If the validation_split parameter has been configured, this parameter can be omitted. If the validation_split and validation_data parameters are configured at the same time, the configuration of the validation_split parameter will be invalid.

shuffle: Configure whether to randomly shuffle the training data. When the configuration steps_per_epoch is None, the configuration of this parameter is invalid.

validation_freq: how many times to do a test every time

# ds为包含输入特征及目标的数据集
network.fit(ds, eopchs=20, validation_data=ds_val, validation_freq=2)
# validation_data给定测试集,validation_freq每多少次大循环做一次测试,测试时自动计算准确率

4. evaluate model evaluation

evaluate(x, y, batch_size, sample_weight, steps)

Returns the loss and accuracy of the model and other related indicators

parameter:

x: input test set feature data

y: target data for the test set

batch_size: Integer or None. Number of samples per gradient update. If not specified, batch_size will default to 32. If the data is in dataset, generator form, do not specify batch_size.

sample_weight: An optional Numpy array of weights for the test samples used to weight the loss function.

steps: Integer or None. The total number of steps until the end of the evaluation phase is declared.

5. predict prediction

predict(x, batch_size, steps)

parameter:

x: numpy type, tensor type. Feature data required for prediction

batch_size: The number of samples per gradient update. If not specified, batch_size will default to 32

steps: Integer or None, the total number of steps (batches of samples) before the prediction round is declared complete.

Equivalent to:

sample = next(iter(ds_pred)) # 每次从验证数据中取出一组batch
x = sample[0] # x 保存第0组验证集特征值
pred = network.predict(x)  # 获取每一个分类的预测结果
pred = tf.argmax(pred, axis=1) # 获取值最大的所在的下标即预测分类的结果
print(pred)

6、sequential

The Sequential model is suitable for simply stacking network layers, that is, each layer has only one input and one output.

# ==1== 设置全连接层
# [b,784]=>[b,256]=>[b,128]=>[b,64]=>[b,32]=>[b,10],中间层一般从大到小降维
network = Sequential([
    layers.Dense(256, activation='relu'), #第一个连接层,输出256个特征
    layers.Dense(128, activation='relu'), #第二个连接层
    layers.Dense(64, activation='relu'), #第三个连接层
    layers.Dense(32, activation='relu'), #第四个连接层
    layers.Dense(10), #最后一层不需要激活函数,输出10个分类
    ])
# ==2== 设置输入层维度
network.build(input_shape=[None, 28*28])
# ==3== 查看网络结构
network.summary()
# ==4== 查看网络的所有权重和偏置
network.trainable_variables
# ==5== 自动把x从第一层传到最后一层
network.call()

7. Custom layers to build networks

By subclassing tf.keras.Model and defining your own forward propagation model. Create layers in the __init__ method and set them as attributes of the class instance. Define forward propagation in the call method.

# 自定义Dense层
class MyDense(layers.Layer): #必须继承layers.Layer层,放到sequential容器中
    # 初始化方法
    def __int__(self, input_dim, output_dim):
        super(MyDense, self).__init__() # 调用母类初始化,必须
        
        # 自己发挥'w''b'指定名字没什么用,创建shape为[input_dim, output_dim的权重
        # 使用add_variable创建变量
        self.kernel = self.add_variable('w', [input_dim, output_dim])
        self.bias = self.add_variable('b', [output_dim])
    
    # call方法,training来指示现在是训练还是测试
    def call(self, inputs, training=None):
        out = inputs @ self.kernel + self.bias
        return out


# 自定义层来创建网络
class MyModel(keras.Model):  # 必须继承keras.Model大类,才能使用complie、fit等功能
    # 
    def __init__(self):
        super(MyModel, self).__init__() # 调用父类Mymodel
        # 使用自定义层创建5层
        self.fc1 = MyDense(28*28,256) #input_dim=784,output_dim=256
        self.fc2 = MyDense(256,128)
        self.fc3 = MyDense(128,64)
        self.fc4 = MyDense(64,32)
        self.fc5 = MyDense(32,10)

    def call(self, inputs, training=None):
        # x从输入层到输出层
        x = self.fc1(inputs)
        x = tf.nn.relu(x)
        x = self.fc2(x)
        x = tf.nn.relu(x)        
        x = self.fc3(x)
        x = tf.nn.relu(x)
        x = self.fc4(x)
        x = tf.nn.relu(x)
        x = self.fc5(x) #logits层
        return x

Hello classmates, today I will share with you the cross-validation method and regularization method in TensorFlow2.0 deep learning, and finally show a small case of custom network .

Cross-validation, regularization, custom networks

1. Cross-validation

Cross-validation mainly prevents overfitting caused by overly complex models , and finds parameters that optimize the generalization ability of the model . We divide the data into training set, validation set, and test set . The training set is used to input the network model as a sample for learning. The validation set is to evaluate the model in an iterative process to find the optimal solution. The test set is evaluated after the entire network has been trained.

K-fold cross-validation is to divide the training set data into K parts in equal proportions, one of which is used as the verification data , and the other K-1 data is used as the training data . In each iteration, a different data part is selected from K parts as test data, and the remaining K-1 parts are used as training data, and finally the obtained K experimental results are divided equally.

Division method

(1) Division when constructing a dataset

First import the training set (x, y) and the test set (x_test, y_test), K-fold cross-validation is the division of the test set, specifying 500 iterations, each iteration selects a part from the training set as the verification data ds_val , the remaining below as the training data ds_train . Use tf.random.shuffle() to randomly shuffle the index order without affecting the correspondence between x and y. tf.gather() picks the value based on the index.

# 以手写数字为例,获取训练集和测试集
(x,y),(x_test,y_test) = datasets.mnist.load_data()

# 预处理函数
def processing(x,y): 
    # 从[0,255]=>[-1,1]
    x = 2 * tf.cast(x, dtype=tf.float32) / 255.0 - 1
    y = tf.cast(y, dtype=tf.int32)
    return(x,y)

# 交叉验证K=500
for epoch in range(500):

    idx = tf.range(60000) # 假设training数据一共有60k张图象,生成索引
    idx = tf.random.shuffle(idx) # 随机打乱索引
    
    # 利用随机打散的索引来收集数据,不改变xy之间的关联
    x_train, y_train = tf.gather(x, idx[:50000]), tf.gather(y, idx[:50000])
    x_val, y_val = tf.ga,ther(x, idx[-10000:]), tf.gather(y, idx[-10000:])
    
    # 构建训练集
    ds_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))  # 自动将输入的xy转变成tenosr类型
    ds_train = ds_train.map(processing).shuffle(10000).batch(128) # 对数据集中的所有数据使用预处理函数
    
    # 构建验证集
    ds_val = tf.data.Dataset.from_tensor_slices((x_val, y_val))  
    ds_val = ds_test.map(processing).batch(128) # 每次迭代取128组数据,验证不需要打乱数据

(2) Use the parameter division in the training function fit()

If it is too troublesome to use the above method to construct the data set, you can specify the division method validation_split=0.1 in the model training function fit() . Each iteration takes 0.1 times the training data as the validation set, and the rest as the training set. The ds_train_val requirement is the training set data that has not been divided. In this case, there is no need to specify the validation_data validation set data, which is automatically generated during division.

# ds_train_val指没有划分过的train和val数据集,validation_split=0.1动态切割,0.1比例的数据分给val
network.fit(ds_train_val, epochs=6, validation_split=0.1, validation_freq=2)
# 不需要再指定validation_data,已经在被包含在validation_split中了

Use the validation set during model iteration to see when the model works best, and jump out of the loop if you find the best one. When selecting model parameters in the validation set, first save the weight corresponding to the minimum error value. If the error detected later is greater than it, the current weight is used.

2. Regularization

When a more complex model is used to fit the data, it is easy to over-fit, which will lead to a decrease in the generalization ability of the model. Adding a regularization term to the model can limit the complexity of the model, making the model between complexity and Performance is balanced.

L1 regularization is to add the absolute value of the weight parameter to the original loss function . L1 can produce 0 solutions, and L1 obtains sparse solutions.
J( \theta )= J(w,x,y)+\lambda \sum_{i=1}^{n}\left | w_{i} \right |

L2 regularization is to add the sum of squares of weight parameters to the original loss function . L2 can produce solutions approaching 0, and L2 obtains non-zero dense solutions.

J( \theta )= J(w,x,y)+\lambda \sum_{i=1}^{n} w_{i}^{2}

Specify the regularization parameter kernel_regularizer when building the network layer , use the two-norm method keras.regularizers.l2 , and the penalty coefficient 0.01.

# 使用二范数正则化,loss = loss + 0.001*regularizer,指定正则化的权重
model = keras.Sequential([
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001), activation=tf.nn.relu),
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001), activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)])

3. Customize the network

3.1 Data acquisition

First import the library files we need, import image data from the system, and divide the test set and training set.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import datasets, layers, optimizers, Sequential, metrics
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # 输出框只输出有意义的信息

#(1)数据获取
(x,y),(x_test,y_test) = datasets.cifar10.load_data() #获取图像分类数据
# 查看数据信息
print(f'x.shape: {
      
      x.shape}, y.shape: {
      
      y.shape}')  #查看训练集的维度信息
print(f'x_test.shape: {
      
      x_test.shape}, y_test.shape: {
      
      y_test.shape}')  #测试集未读信息
print(f'y[:5]: {
      
      y[:5]}')  #查看训练集目标的前5项
# 绘图展示
import matplotlib.pyplot as plt
for i in range(10): # 展示前10张图片
    plt.subplot(2,5,i+1)  # 2行5列第i+1个位置
    plt.imshow(x[i])
    plt.xticks([]) # 不显示x和y轴坐标刻度
    plt.yticks([])

# 输入的图像形状
# x.shape: (50000, 32, 32, 3), y.shape: (50000, 1)
# x_test.shape: (10000, 32, 32, 3), y_test.shape: (10000, 1)

The pictures that need to be trained are as follows. The pictures themselves are not clear. Here we only talk about the structure of the basic custom network. The accuracy rate is only 80% at most. The model is optimized to the convolutional neural network chapter.

3.2 Data preprocessing

Since the shape of the imported target value y is two-dimensional [50k, 1], the axis of axis=1 needs to be compressed and turned into a one-dimensional vector [50k]. Use **tf.squeeze()** to compress and specify axis, the value of the corresponding index becomes 1 for the target value one-hot encoding, the value corresponding to other indexes becomes 0, and the shape becomes [b, 10]. Map the range of eigenvalues ​​x to [-1,1].

#(2)数据预处理
# 定义预处理函数
def processing(x,y): 
    # 由于目标数据是而二维的,把shape=1的轴删除,从向量变成标量
    y = tf.squeeze(y)  # 默认压缩所有维度为1的轴,shape为[50k]
    y = tf.one_hot(y, depth=10) # one-hot编码,分成10个类别,shape为[50k,10],对应下标所在的值为1
    # 每个像素值的范围在[-1,1]之间,从[0,255]=>[-1,1]
    x = 2 * tf.cast(x, dtype=tf.float32) / 255.0 - 1
    y = tf.cast(y, dtype=tf.int32)
    return(x,y)

# 构建训练集数据集
ds_train = tf.data.Dataset.from_tensor_slices((x, y))  # 自动将输入的xy转变成tenosr类型
ds_train = ds_train.map(processing).batch(128).shuffle(10000)  # 对数据集中的所有数据使用预处理函数

# 构建测试集数据集
ds_test = tf.data.Dataset.from_tensor_slices((x_test, y_test))  
ds_test = ds_test.map(processing).batch(128) # 每次迭代取128组数据,测试不需要打乱数据

# 构造迭代器,查看数据集是否正确
sample = next(iter(ds_train))  # 每次运行从训练数据集中取出一组xy
print('x_batch.shape', sample[0].shape, 'y_batch.shape', sample[1].shape)
# x_batch.shape (128, 32, 32, 3)   y_batch.shape (128, 10)

3.3 Custom Network

#(3)构造网络
class MyDense(layers.Layer): #必须继承layers.Layer层,放到sequential容器中
    # 代替layers.Dense层
    def __init__(self, input_dim, output_dim):
        super(MyDense, self).__init__()   # 调用母类初始化,必须

        # 自己发挥'w''b'指定名字没什么用,创建shape为[input_dim, output_dim的权重
        # 使用add_variable创建变量        
        self.kernel = self.add_variable('w',[input_dim, output_dim])
        self.bias = self.add_variable('b', [output_dim])

    # call方法,training来指示现在是训练还是测试         
    def call(self, inputs, training=None):
        
        x = inputs @ self.kernel + self.bias
        
        return x

# 自定义网络层
class MyNetwork(keras.Model):  # 必须继承keras.Model大类,才能使用complie、fit等功能
    
    def __init__(self):
        super(MyNetwork, self).__init__()  # 调用父类Mymodel
        # 新建五个层次
        self.fc1 = MyDense(32*32*3, 256)  #input_dim=784,output_dim=256
        self.fc2 = MyDense(256, 128)
        self.fc3 = MyDense(128, 64)
        self.fc4 = MyDense(64, 32)        
        self.fc5 = MyDense(32, 10)
  
    def call(self, inputs, training=None):
        # 前向传播,可以接收四维的tensor
        x = tf.reshape(inputs, [-1,32*32*3]) # 改变输入特征的形状
        x = self.fc1(x) #第一层[b,32*32*3]==>[b,256]
        x = tf.nn.relu(x) #激活函数
        x = self.fc2(x)
        x = tf.nn.relu(x)
        x = self.fc3(x)
        x = tf.nn.relu(x)
        x = self.fc4(x)
        x = tf.nn.relu(x)
        x = self.fc5(x)  #logits层
        return x

3.4 Network Configuration

#(4)网络配置
network = MyNetwork()       
network.compile(optimizer = optimizers.Adam(lr=0.001),  # 指定优化器
                loss = tf.losses.CategoricalCrossentropy(from_logits=True), #交叉熵损失
                metrics = ['accuracy'])  # 测试指标     

#(5)网络训练,输入训练数据,循环5次,验证集为ds_test,每一次大循环做一次测试
network.fit(ds_train, epochs=5, validation_data=ds_test, validation_freq=1)

# 循环5次后的结果为
Epoch 5/5
391/391 [==============================] - 3s 8ms/step - loss: 1.2197 - accuracy: 0.5707 - val_loss: 1.3929 - val_accuracy: 0.5182

Learning rate decay strategy

How to use TensorFlow to build a polynomial learning rate decay strategy , a single-cycle cosine annealing learning rate decay strategy , a multi-cycle cosine annealing learning rate decay strategy , and use the Mnist dataset to verify that the constructed method is feasible.

The custom learning rate classes created below all inherit ** tf.keras.optimizers.schedules.LearningRateSchedule

1. Polynomial decay

1.1 Method introduction

There are two cases for the polynomial of the learning rate, as shown in the figure below. First set the maximum and minimum values ​​of the learning rate, when the learning rate drops from the highest point to the lowest point. (1) cycle==False , all subsequent learning rates are kept at the lowest value ; (2) cycle==True , the learning rate rises from the lowest point to a new higher value, and starts to fall again , with a fixed period form, and then rises again after falling to the lowest point.

insert image description here

(1) Attenuation formula for cycle==False

First determine whether the current step is in the decay period decay_period . If in this cycle, let the current number of steps used for calculation current_step = step , indicating that the learning rate is in the decay process ; if it is not in this cycle, that is, it has dropped to the lowest value, let current_step used to calculate the current number of steps = decay_period , indicating that the decay process has ended .

Calculated as follows:

lr represents the adjusted learning rate; initial_lr represents the initial learning rate, that is, the maximum learning rate; min_lr represents the minimum learning rate; power represents the power of the polynomial; the rest are the same as above

lr = (initial_lr - min_lr) * (1 - current_step / decay_period) ** (power) + min_lr

(2) Attenuation formula for cycle==True

First, determine which cycle the current step is in. The calculation formula is as follows. current_period represents which cycle the current step is in; decay_period represents the number of steps in a decay cycle; ceil represents rounding up

current_period = decay_period * ceil(step / decay_period)

The next step is to calculate the decayed learning rate , lr represents the adjusted learning rate; initial_lr represents the initial learning rate, that is, the maximum learning rate; min_lr represents the minimum learning rate; power represents the power of the polynomial

The step / current_period in the formula must be a number greater than 0 and less than 1. As the step increases, the step is getting closer and closer to the number of steps in the current cycle, and this term is getting closer and closer to 1, then the entire lr is getting closer and closer 0

lr = (initial_lr - min_lr) * (1 - step / current_period) ** (power) + min_lr

1.2 Code display

The calculation of the cycle==True decay method here is different from the formula. A number keras.backend.epsilon() , which is infinitely close to 0, is added after the denominator current_period to prevent the denominator from being 0 and the entire learning rate will change. into infinity.

lr = (initial_lr - min_lr) * (1 - step / (current_period + keras.backend.epsilon())) ** (power) + min_lr

My custom class is a custom learning rate scheduler that inherits from keras.optimizers.schedules.LearningRateSchedule . In order to clearly show the change of the learning rate during the training process, if the current step is an integer multiple of the externally specified print_step, print the learning rate once. And use the list self.learning_rate_list to save the learning rate of each step in the training process. After the training is completed, it can be called to view.

# ----------------------------------------------------------------------- #
# 学习率多项式衰减
# ----------------------------------------------------------------------- #
# eager模式防止graph报错
tf.config.experimental_run_functions_eagerly(True)
# ----------------------------------------------------------------------- #
# 继承自定义学习率的类
class PolynomialDecay(keras.optimizers.schedules.LearningRateSchedule):
    '''
    initial_lr: 初始的学习率
    decay_period: 一次多项式衰减的周期
    power: 多项式的幂
    min_lr: 学习率的最小值
    cycle: 是否进行多个多项式衰减
    print_step: 训练时多少个step打印一次学习率
    '''
    # 初始化
    def __init__(self, initial_lr, decay_period, power, min_lr, cycle, print_step):
        # 继承父类的初始化方法
        super(PolynomialDecay, self).__init__()
        
        # 属性分配
        self.initial_lr = tf.cast(initial_lr, dtype=tf.float32)
        self.decay_period = tf.cast(decay_period, dtype=tf.float32)
        self.power = power
        self.min_lr = tf.cast(min_lr, dtype=tf.float32)
        self.cycle = cycle
        self.print_step = print_step
        
        # 保存每个step的学习率
        self.learning_rate_list = []
        
        
    # 前向传播
    def __call__(self, step):
        
        #(1)学习率达到最低学习率后,就一直保持最低学习率
        if self.cycle is False:
            
            # 比较找出当前step是否超出了一个周期
            current_step = tf.where(step<self.decay_period, step, self.decay_period)
            
            # 计算衰减后的学习率
            decayed_learning_rate = (self.initial_lr - self.min_lr) *                            \
                                    (1 - current_step / self.decay_period) ** (self.power) +     \
                                    self.min_lr
            
            # 保存每个step的学习率
            self.learning_rate_list.append(decayed_learning_rate.numpy().item())
                        
            # 训练时每个epoch打印一次学习率
            if step % self.print_step == 0:
                # 打印当前epoch的学习率
                print('learning_rate has changed to: ', decayed_learning_rate.numpy().item())
                
            # 返回调整后的学习率
            return decayed_learning_rate


        #(2)学习率达到最低后,再上升一个较高的学习率再下降
        if self.cycle is True:
            
            # 计算目前处于第几个周期, tf.math.ceil向上取整
            current_period = self.decay_period * tf.math.ceil(step / self.decay_period)
            
            # 计算衰减后的学习率, 分母加上一个很小的数keras.backend.epsilon()防止分母为0
            decayed_learning_rate = (self.initial_lr - self.min_lr) *                                \
                                    (1 - step / (current_period + keras.backend.epsilon())) **       \
                                    (self.power) + self.min_lr
            
            
            # 保存每个step的学习率
            self.learning_rate_list.append(decayed_learning_rate.numpy().item())
                        
            
            # 训练时每个epoch打印一次学习率
            if step % self.print_step == 0:
                # 打印当前epoch的学习率
                print('learning_rate has changed to: ', decayed_learning_rate.numpy().item())


            return decayed_learning_rate

2. Single-cycle cosine annealing decay

2.1 Method introduction

In the traditional training process, the strategy for setting the learning rate is often stepped or exponentially decaying. If a constant learning rate is used for training, the model will start to oscillate when it is close to the optimal solution, and it will not be able to reach the optimal solution at the lowest point of the loss function . Therefore, the decaying learning rate is used. In the vicinity of the optimal solution, the gradient gradually decreases, correspondingly reducing the learning rate, so that the model can smoothly converge to the correct desired position .

However, in the actual process, due to the complexity of the model, it is difficult to correctly describe the optimal solution position and the structure of the loss function , which makes the model often converge to a local optimal solution . Eventually, due to the decay of the learning rate, the model eventually falls into a local optimal solution instead of a global optimal solution.

The use of cosine annealing method for the learning rate of the training process is to continuously adjust the learning rate , and after it has decayed to a certain value, re-adjust the recovery learning rate, jump out of the current local optimal solution and re-find the global optimal solution .

The single-cycle cosine annealing image is as follows:

insert image description here

The calculation formula of the cosine curve part is as follows, where initial_lr represents the maximum learning rate, min_lr represents the minimum learning rate, step_warmup represents the step required for the linear rise part, and total_step represents the step of a cycle

lr = min_lr + 0.5 * (initial_lr - min_lr) * (1 + cos(pi * (step-warmup_step) \ (total_step-warmup_step)))

The visualized graph of the result obtained by the calculation formula is as follows. The position of the peak point of the cosine curve is the end point of the linearly rising part.

insert image description here

The calculation formula of the linear rise part is as follows, which can be understood as the form of y=kx+b. Then take the warmup as the limit, the left side is the linear rising part, and the right side is the cosine falling part

# 增长系数k
k = (initial_lr - min_lr) / warmup_step 
# 增长线段 y=kx+b
warmup = k * step + min_lr

2.2 Code display

I have explained the key calculation formula above. The purpose of the function tf.where(step<self.warmup_step, warmup, decayed_learning_rate) is to take the linear part if the current step is in the warmup stage. If the step exceeds the warmup phase, the cosine decay part is taken . Finally, warmup is used as the boundary between the two learning rates.

My custom class is a custom learning rate scheduler that inherits from keras.optimizers.schedules.LearningRateSchedule . In order to clearly show the change of the learning rate during the training process, if the current step is an integer multiple of the externally specified print_step, print the learning rate once. And use the list self.learning_rate_list to save the learning rate of each step in the training process. After the training is completed, it can be called to view.

# ----------------------------------------------------------------------- #
# 单周期余弦退火衰减
# ----------------------------------------------------------------------- #
# eager模式防止graph报错
tf.config.experimental_run_functions_eagerly(True)
# ------------------------------------------------ #
import math

# 继承自定义学习率的类
class CosineWarmupDecay(keras.optimizers.schedules.LearningRateSchedule):
    '''
    initial_lr: 初始的学习率, 即最大学习率
    min_lr: 学习率的最小值
    warmup_step: 线性上升部分需要的step
    total_step: 整个余弦退火需要对总step
    print_step: 多少个step打印一次学习率
    '''
    # 初始化
    def __init__(self, initial_lr, min_lr, warmup_step, total_step, print_step):
        # 继承父类的初始化方法
        super(CosineWarmupDecay, self).__init__()
        
        # 属性分配
        self.initial_lr = tf.cast(initial_lr, dtype=tf.float32)
        self.min_lr = tf.cast(min_lr, dtype=tf.float32)
        self.warmup_step = warmup_step
        self.total_step = total_step
        self.print_step = print_step
        
        # 保存训练过程中每个step的学习率
        self.learning_rate_list = []
        
        
    # 前向传播
    def __call__(self, step):
        
        # 余弦曲线计算公式
        decayed_learning_rate = self.min_lr + 0.5 * (self.initial_lr - self.min_lr) *       \
                                (1 + tf.math.cos(math.pi * (step-self.warmup_step) /        \
                                 (self.total_step-self.warmup_step)))
        
        # 线性上升线段计算公式
        # 增长系数k
        k = (self.initial_lr - self.min_lr) / self.warmup_step 
        # 增长线段 y=kx+b
        warmup = k * step + self.min_lr
        
        # 将余弦部分和增长线段组合,以warmup_step为界限
        decayed_learning_rate = tf.where(step<self.warmup_step, warmup, decayed_learning_rate)
        
        # 保存每个step的学习率
        self.learning_rate_list.append(decayed_learning_rate.numpy().item())
        
        # 训练时每个epoch打印一次学习率
        if step % self.print_step == 0:
            # 打印当前epoch的学习率
            print('learning_rate has changed to: ', decayed_learning_rate.numpy().item())
    
        # 返回更新后的学习率
        return decayed_learning_rate

3. Multi-cycle cosine annealing decay

3.1 Method introduction

Before looking at multiple cycles, please grasp the single cycle above.

This can be understood as a stochastic gradient descent algorithm with restart . When the network model is updated, since there are many local optimal solutions, the model will fall into the local optimal solution, that is, the optimization function has multiple peaks. This requires that when the model falls into a local optimal solution, it can jump out and continue to search for the next optimal solution until the global optimal solution is found . To make the model jump out of the local optimal solution, it is necessary to suddenly increase the learning rate when the model falls into the local optimal solution, that is, restart the learning rate .

The schematic diagram of the multi-cycle cosine annealing decay is as follows:

insert image description here

The formula of the multi-cycle cosine annealing algorithm is the same as that of the single-cycle, and only needs to be changed slightly in the code. Where the change was made, a variable self.step was added, and in the __call__() method, I added an if conditional judgment.

My idea is that if the current step reaches the step at the end of a cycle, then reset the current step to 0, restart linearly, and increase the length of the warmup segment and the length of the entire cycle . If there is a better way, please point it out in the comments.

# ----------------------------------------------------------------------- #
# 多周期余弦退火衰减
# ----------------------------------------------------------------------- #
# eager模式防止graph报错
tf.config.experimental_run_functions_eagerly(True)
# ------------------------------------------------ #
import math

# 继承自定义学习率的类
class CosineWarmupDecay(keras.optimizers.schedules.LearningRateSchedule):
    '''
    initial_lr: 初始的学习率
    min_lr: 学习率的最小值
    max_lr: 学习率的最大值
    warmup_step: 线性上升部分需要的step
    total_step: 第一个余弦退火周期需要对总step
    multi: 下个周期相比于上个周期调整的倍率
    print_step: 多少个step并打印一次学习率
    '''
    # 初始化
    def __init__(self, initial_lr, min_lr, warmup_step, total_step, multi, print_step):
        # 继承父类的初始化方法
        super(CosineWarmupDecay, self).__init__()
        
        # 属性分配
        self.initial_lr = tf.cast(initial_lr, dtype=tf.float32)
        self.min_lr = tf.cast(min_lr, dtype=tf.float32)
        self.warmup_step = warmup_step  # 初始为第一个周期的线性段的step
        self.total_step = total_step    # 初始为第一个周期的总step
        self.multi = multi
        self.print_step = print_step
        
        # 保存每一个step的学习率
        self.learning_rate_list = []
        # 当前步长
        self.step = 0
        
        
    # 前向传播, 训练时传入当前step,但是上面已经定义了一个,这个step用不上
    def __call__(self, step):
        
        # 如果当前step达到了当前周期末端就调整
        if  self.step>=self.total_step:
            
            # 乘上倍率因子后会有小数,这里要注意
            # 调整一个周期中线性部分的step长度
            self.warmup_step = self.warmup_step * (1 + self.multi)
            # 调整一个周期的总step长度
            self.total_step = self.total_step * (1 + self.multi)
            
            # 重置step,从线性部分重新开始
            self.step = 0
            
        # 余弦部分的计算公式
        decayed_learning_rate = self.min_lr + 0.5 * (self.initial_lr - self.min_lr) *       \
                                (1 + tf.math.cos(math.pi * (self.step-self.warmup_step) /        \
                                  (self.total_step-self.warmup_step)))
        
        # 计算线性上升部分的增长系数k
        k = (self.initial_lr - self.min_lr) / self.warmup_step 
        # 线性增长线段 y=kx+b
        warmup = k * self.step + self.min_lr
        
        # 以学习率峰值点横坐标为界,左侧是线性上升,右侧是余弦下降
        decayed_learning_rate = tf.where(self.step<self.warmup_step, warmup, decayed_learning_rate)
        
        
        # 每个epoch打印一次学习率
        if step % self.print_step == 0:
            # 打印当前step的学习率
            print('learning_rate has changed to: ', decayed_learning_rate.numpy().item())
        
        # 每个step保存一次学习率
        self.learning_rate_list.append(decayed_learning_rate.numpy().item())

        # 计算完当前学习率后step加一用于下一次
        self.step = self.step + 1
        
        # 返回调整后的学习率
        return decayed_learning_rate

4. Practical verification

Let's take the Mnist handwritten dataset as an example to verify whether the multi-cycle cosine annealing learning rate decay defined above can be used. I won't talk about preprocessing and network construction, they are relatively basic, we directly see part (6) in the code below .

First instantiate our custom learning rate class , pass in the necessary initialization parameters cosinewarmupdecay = CosineWarmupDecay(…) , and then pass the learning rate method we defined into the Adam optimizer , keras.optimizers.Adam(cosinewarmupdecay) , Then during training, a current step value is passed to this class method each time, and after calculating the learning rate, the adjusted learning rate is returned to the model.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

# 调用GPU加速
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)


# ----------------------------------------------------------------------- #
# (1)fashion_mnist数据预加载及预处理
# ----------------------------------------------------------------------- #
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()
print('x_train.shape:', x_train.shape, 'y_train.shape:', y_train.shape) # (60000, 28, 28) , (60000,)
print('x_test.shape:', x_test.shape)  # (10000, 28, 28)

# 记录训练集的数量
total_train_num = x_train.shape[0]


# ----------------------------------------------------------------------- #
# 学习率多周期余弦退火衰减
# ----------------------------------------------------------------------- #
# eager模式防止graph报错
tf.config.experimental_run_functions_eagerly(True)
# ------------------------------------------------ #
import math

# 继承自定义学习率的类
class CosineWarmupDecay(keras.optimizers.schedules.LearningRateSchedule):
    '''
    initial_lr: 初始的学习率
    min_lr: 学习率的最小值
    max_lr: 学习率的最大值
    warmup_step: 线性上升部分需要的step
    total_step: 第一个余弦退火周期需要对总step
    multi: 下个周期相比于上个周期调整的倍率
    print_step: 多少个step并打印一次学习率
    '''
    # 初始化
    def __init__(self, initial_lr, min_lr, warmup_step, total_step, multi, print_step):
        # 继承父类的初始化方法
        super(CosineWarmupDecay, self).__init__()
        
        # 属性分配
        self.initial_lr = tf.cast(initial_lr, dtype=tf.float32)
        self.min_lr = tf.cast(min_lr, dtype=tf.float32)
        self.warmup_step = warmup_step  # 初始为第一个周期的线性段的step
        self.total_step = total_step    # 初始为第一个周期的总step
        self.multi = multi
        self.print_step = print_step
        
        # 保存每一个step的学习率
        self.learning_rate_list = []
        # 当前步长
        self.step = 0
        
        
    # 前向传播, 训练时传入当前step,但是上面已经定义了一个,这个step用不上
    def __call__(self, step):
        
        # 如果当前step达到了当前周期末端就调整
        if  self.step>=self.total_step:
            
            # 乘上倍率因子后会有小数,这里要注意
            # 调整一个周期中线性部分的step长度
            self.warmup_step = self.warmup_step * (1 + self.multi)
            # 调整一个周期的总step长度
            self.total_step = self.total_step * (1 + self.multi)
            
            # 重置step,从线性部分重新开始
            self.step = 0
            
        # 余弦部分的计算公式
        decayed_learning_rate = self.min_lr + 0.5 * (self.initial_lr - self.min_lr) *       \
                                (1 + tf.math.cos(math.pi * (self.step-self.warmup_step) /        \
                                  (self.total_step-self.warmup_step)))
        
        # 计算线性上升部分的增长系数k
        k = (self.initial_lr - self.min_lr) / self.warmup_step 
        # 线性增长线段 y=kx+b
        warmup = k * self.step + self.min_lr
        
        # 以学习率峰值点横坐标为界,左侧是线性上升,右侧是余弦下降
        decayed_learning_rate = tf.where(self.step<self.warmup_step, warmup, decayed_learning_rate)
        
        
        # 每个epoch打印一次学习率
        if step % self.print_step == 0:
            # 打印当前step的学习率
            print('learning_rate has changed to: ', decayed_learning_rate.numpy().item())
        
        # 每个step保存一次学习率
        self.learning_rate_list.append(decayed_learning_rate.numpy().item())

        # 计算完当前学习率后step加一用于下一次
        self.step = self.step + 1
        
        # 返回调整后的学习率
        return decayed_learning_rate


# ----------------------------------------------------------------------- #
# (3)参数设置
# ----------------------------------------------------------------------- #
# 每个step处理多少张图像
batch_size = 32
# 迭代次数
num_epochs = 15
# 初始学习率
initial_lr = 0.001
# 学习率下降的最小值
min_lr = 1e-7
# 余弦退火的周期调整倍率
multi = 0.25

# 一个epoch包含多少个batch也是多少个steps, 即1875
one_epoch_batchs = int(total_train_num / batch_size)

# 第一个余弦退火周期需要的总step,以三个epoch为一个周期
total_step = one_epoch_batchs * 3

# 线性上升部分需要的step, 一个周期的四分之一的epoch用于线性上升
warmup_step = int(total_step * 0.25)

# 多少个step打印一次学习率, 一个epoch打印一次
print_step = one_epoch_batchs


# ----------------------------------------------------------------------- #
# (4)划分数据集
# ----------------------------------------------------------------------- #

# 预处理
def preprocessing(x, y):
    x = tf.cast(x, dtype=tf.float32) / 255.0  # 像素归一化
    x = tf.expand_dims(x, axis=-1)  # 增加通道维度
    y = tf.cast(y, dtype=tf.int32)  # 标签转为tensor类型
    return x,y

# 训练集
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)) 
train_ds = train_ds.map(preprocessing).batch(batch_size).shuffle(10000)
# 测试集
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_ds = test_ds.map(preprocessing).batch(batch_size)

# 迭代器查看数据是否正确
sample = next(iter(train_ds))  
print('x_batch:', sample[0].shape, 'y_batch:', sample[1].shape)  # (32, 28, 28, 1), (32,)


# ----------------------------------------------------------------------- #
# (5)网络构建
# ----------------------------------------------------------------------- #
inputs = keras.Input(sample[0].shape[1:])  # 构造输入层
# [28,28,1]==>[28,28,32]
x = layers.Conv2D(32, kernel_size=3, padding='same', activation='relu')(inputs)
# [28,28,32]==>[14,14,32]
x = layers.MaxPool2D(pool_size=(2,2), strides=2, padding='same')(x)
# [14,14,32]==>[14,14,64]
x = layers.Conv2D(64, kernel_size=3, padding='same', activation='relu')(x)
# [14,14,64]==>[7,7,64]
x = layers.MaxPool2D(pool_size=(2,2), strides=2, padding='same')(x)
# [7,7,64]==>[None,7*7*64]
x = layers.Flatten()(x)
# [None,7*7*64]==>[None,128]
x = layers.Dense(128)(x)
# [None,128]==>[None,10]
outputs = layers.Dense(10, activation='softmax')(x)
# 构建模型
model = keras.Model(inputs, outputs)


# ------------------------------------------------------------------ #
# (6)模型训练
# ------------------------------------------------------------------ #
# 接收学习率调整方法
cosinewarmupdecay = CosineWarmupDecay(initial_lr=initial_lr, # 初始学习率,即最大学习率
                                  min_lr=min_lr,             # 学习率下降的最小值
                                  warmup_step=warmup_step,   # 线性上升部分的step
                                  total_step=total_step,     # 训练的总step
                                  multi=multi,               # 周期调整的倍率
                                  print_step=print_step)     # 每个epoch打印一次学习率值


# 设置adam优化器,指定学习率
opt = keras.optimizers.Adam(cosinewarmupdecay)

# 网络编译
model.compile(optimizer=opt,   # 学习率
              loss='sparse_categorical_crossentropy',  # 损失
              metrics=['accuracy'])  # 监控指标

# 网络训练
model.fit(train_ds, epochs=num_epochs, validation_data=test_ds)

# 绘制学习率变化曲线
plt.plot(cosinewarmupdecay.learning_rate_list)
plt.xlabel("Train step")
plt.ylabel("Learning_Rate")
plt.title('cosinewarmupdecay')
plt.grid()
plt.show()

I set the learning rate to be printed once per epoch during the training process. The training process is as follows:

Epoch 1/15
learning_rate has changed to:  1.0000000116860974e-07
1875/1875 [==============================] - 27s 14ms/step - loss: 0.9364 - accuracy: 0.6849 - val_loss: 0.3792 - val_accuracy: 0.8629
Epoch 2/15
learning_rate has changed to:  0.0009698210633359849
1875/1875 [==============================] - 25s 13ms/step - loss: 0.3030 - accuracy: 0.8920 - val_loss: 0.2907 - val_accuracy: 0.8989
------------------------------------------------------------
------------------------------------------------------------
Epoch 14/15
learning_rate has changed to:  0.0009987982921302319
1875/1875 [==============================] - 29s 15ms/step - loss: 0.1430 - accuracy: 0.9470 - val_loss: 0.2871 - val_accuracy: 0.9107
Epoch 15/15
learning_rate has changed to:  0.0008539927075617015
1875/1875 [==============================] - 29s 15ms/step - loss: 0.1213 - accuracy: 0.9563 - val_loss: 0.2902 - val_accuracy: 0.9156

I set the current learning rate value to be saved once for each step in the training process , and save it in self.learning_rate_list. After the training is completed, I can read this list through cosinewarmupdecay.learning_rate_list and draw the learning rate change curve

insert image description here

Comparison of learning rate decay method using cosine annealing and traditional learning rate continuous decay method

insert image description here

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/127217557