[Pytorch] Learning Record (2) Regression and Gradient Descent

(1) Regression (relationship fitting)

To start the code, first define the class Net, so that the class inherits from torch.nn.Moudle. Include __init__ and forward in the class , and these two functions will be included in most torches. The following code is to build the simplest neural network:

x = torch.unsqueeze(torch.linspace(-1, 1, 100), dim = 1)
y = x.pow(2) + 0.2*torch.rand(x.size())
x, y = Variable(x), Variable(y)            #数据的准备

class Net(torch.nn.Module):
    def __init__(self, n_features, n_hidden, n_output):     #特征，隐藏层，输出层
        super(Net, self).__init__()        #这一步都得这么写
        self.hidden = torch.nn.Linear(n_features, n_hidden)    
                                           #命名为hidden，这就是一层神经网络（隐藏层）
        self.predict = torch.nn.Linear(n_hidden, n_output)
                                           #预测层，n_hidden是隐藏层的神经元

    def forward(self, x):                  #这一部分才是在搭建神经网络，x是输入信息（data等）
        x = F.relu(self.hidden(x))         #x经过hidden处理后，再用激活函数激活信息
        x = self.predict(x)                #这里的x来自上一行，经predict后直接输出
        return x

net = Net(1, 10, 1)                        #调用，1个特征点，10个隐藏层，1个输出层
print(net)

So far, the neural network has been built. The following will optimize the neural network (optimizer) :

optimizer = torch.optim.SGD(net.parameters(), lr=0.5)    #lr是学习速率，通常都小于1
loss_func = torch.nn.MSELoss()       #计算误差的方法loss，MSE方法足以用来处理回归问题

# ----------------- 开始训练 ----------------- #
for i in range(100):
    prediction = net(x)              #看每一步的prediction
    loss = loss_func(prediction, y)  #y是真实值，前者是预测值
    
    optimizer.zero_grad()            #这一步是将梯度降为0，防止先前的梯度造成影响
    loss.backward()                  #进行loss值的反向传播
    optimizer.step()                 #用optimizer以学习效率0.5来优化梯度

The above code does not include the visualization part, and the visualization will be explained in a separate article. After splicing the above two pieces of code, it becomes a complete model training framework.

In addition, there is another code scheme in the complete collection , the principle is the same.

import numpy as np
import matplotlib.pyplot as plt

x_data = [1.0, 25.0, 36.0]
y_data = [3.1, 58.0, 73.0]


def forward(x):
    return x * w


def loss(x, y):
    y_pred = forward(x)
    return (y_pred - y) * (y_pred - y)


w_list = []
mse_list = []
for w in np.arange(0.0, 4.1, 0.1):
    print('w=', w)
    l_sum = 0
    for x_val, y_val in zip(x_data, y_data):
        y_pred_val = forward(x_val)
        loss_val = loss(x_val, y_val)
        l_sum += loss_val
        print('\t', x_val, y_pred_val, loss_val)
    print('MSE=', l_sum / 3)
    w_list.append(w)
    mse_list.append(l_sum / 3)

plt.plot(w_list, mse_list)
plt.show()

The result of drawing the graph is as follows:

(2) Re-exploring gradient descent

Generally speaking, such a more perfect curve in Figure 2 is rarely encountered. In most cases, there are more than one dimension and weight. For example, if $\hat{y}(w_1,w_2,x)$ there are two weights, the linear search method cannot be used, otherwise it will be too troublesome. There is a method called divide and conquer . For example, we have two weights w1w2. When searching, we first perform some relatively sparse searches, and find 4 on w1 and w2 respectively, so we find 16 in total. Among these 16 points, which value we are looking for is the smallest, and the smallest point is the approximate range we are looking for. A 4×4 range search is performed within the green frame, so that the search range can be greatly narrowed after running a few rounds.

But this method is not good, because it is possible to miss the global minimum point and find a local minimum when the initial point is fixed . So it leads to the algorithm to be discussed in this section-gradient descent. Gradient descent is derivation. Specifically, the loss function is derived from the weight value $-\frac{\partial cost}{\partial \omega }$ . The reason for taking the negative sign is to descend along the opposite direction of the slope. Therefore, the complete formula of gradient descent is $\omega = \omega-\alpha\frac{\partial cost}{\partial \omega}$ , where ω is the weight, cost is the loss function, and α is the learning rate, which is usually relatively small to allow it to decrease in a small range.

Similarly, the gradient descent algorithm also has flaws. First of all, gradient descent is easy to fall into the local minimum and saddle point. The local minimum is not difficult to overcome, but the saddle point is more difficult to break through, because once it enters the saddle point, it cannot continue to iterate.

x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]
w = 1.0


def forward(x):
    return x * w


def cost(xs, ys):
    cost = 0
    for x, y in zip(xs, ys):
        y_pred = forward(x)
        cost += (y_pred - y) ** 2
    return cost / len(xs)


def gradient(xs, ys):
    grad = 0
    for x, y in zip(xs, ys):
        grad += 2 * x * (x * w - y)
    return grad / len(xs)


print('Predict(beforetraining)', 4, forward(4))
for epoch in range(100):
    cost_val = cost(x_data, y_data)
    grad_val = gradient(x_data, y_data)
    w -= 0.01 * grad_val
    print('Epoch:', epoch, 'w={:.2f}'.format(w), 'loss =', cost_val)
print('Predict(after training)', 4, forward(4))

This code uses the gradient descent algorithm, which is more interpretable than that in (1). In the actual training process, the loss curve may be very rough, with many glitches. In order to obtain a smooth curve we use the exponential mean method.

In the application of deep learning, gradient descent is not used much. Stochastic gradient descent is often used , and the reason for choosing stochastic gradient descent is to solve the saddle point problem. We add random noise to the data set. Even if we fall into a saddle point, the random noise may push the gradient to continue to decline, so as to solve the saddle point problem.