Machine Learning && Deep Learning - Implementation of Linear Regression from Scratch

Although the current deep learning framework can almost automate the following work, implementing it from scratch can give us a better understanding of the working principle, which is convenient for us to customize the model, custom layer or custom loss function.

import random
import torch
from d2l import torch as d2l

generate dataset

Construct an artificial dataset from a linear model with noise. The task is to use this dataset to recover the parameters of the model. We use low-dimensional data, which allows for easier visualization.
In the code below, we generate a dataset with 1000 samples, each containing 2 features sampled from a standard normal distribution. Our dataset is a 1000×2 matrix X.
Use linear model parameters w = [ 2 , − 3.4 ] T , b = 4.2 and noise term δ to generate dataset and labels: y = X w + b + δ Use linear model parameters w=[2,-3.4]^T, b=4.2 and noise item \delta generate data set and label: \\ y=Xw+b+\deltaUsing the linear model parameter w=[2,3.4]Tb=4.2 and the noise term δ to generate data sets and labels:y=Xw+b+δ
where δ can be regarded as the potential observation error when the model predicts and labels. Here we consider the standard assumption to hold, that δ follows a normal distribution with mean zero. To simplify the problem, the standard deviation is set to 0.01. The following code generates a synthetic dataset:

def synthetic_data(w, b, num_examples):  #@save
    """生成y=Xw+b+δ"""
    # 生成均值为0,标准差为1(标准正态分布)且大小1000*2的数据集
    X = torch.normal(0, 1, (num_examples, len(w)))
    # 生成y函数,生成1000*1的矩阵
    y = torch.matmul(X, w) + b
    # 再加上服从均值为0的正态分布的δ
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

Among them, each row in features contains a two-dimensional data sample, and each row in labels contains a one-dimensional label value (a scalar).

print('features:', features[0], '\nlabel:', labels[0])

result:

features: tensor([-0.5829, -0.2094])
label: tensor([3.7491])

By generating a scatter plot of the second feature features[:, 1] and labels, you can intuitively see the linear relationship between the two:

d2l.set_figsize()
d2l.plt.scatter(features[:, 1].detach().numpy(), labels.detach().numpy(), 1)
d2l.plt.show()

insert image description here

read dataset

When training a model, it is necessary to iterate through the data set, taking a small batch of samples each time, and using them to update the model. Therefore, it is necessary to define a function that shuffles the samples in the dataset and fetches the data in small batches.
In the following code, a data_iter function is defined, which receives the batch size, feature matrix and label vector as input, and generates a mini-batch of size batch_size. Each mini-batch contains a set of features and labels.

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))  # 0到999的顺序
    random.shuffle(indices)  # 这些样本是随机读取的,没有特定顺序
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i+batch_size, num_examples)]  # 随机取样
        )
        yield features[batch_indices], labels[batch_indices]
        # yield返回一个可以用来迭代for循环的生成器,而不是直接return

Usually we will take advantage of CPU parallel computing to process "mini-batches" of reasonable size. Model calculations can be performed in parallel for each sample, and the gradient of the loss function for each sample can also be calculated in parallel.
You can intuitively feel the small batch operation: read the first small batch data sample and print:

batch_size = 10
for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break

result:

tensor([[-1.0186, 1.8338],
[ 0.6455, 1.1226],
[-0.5020, 0.2105],
[ 1.3583, 0.6979],
[ 0.3024, -0.8929],
[ 0.4045, -0.4207],
[ 0.5201, -0.3263],
[ 0.6037, -0.1332],
[ 1.6171, 0.2449],
[-0.6540, 1.0338]])
tensor([[-4.0795],
[ 1.6835],
[ 2.5014],
[ 4.5346],
[ 7.8678],
[ 6.4298],
[ 6.3537],
[ 5.8528],
[ 6.6194],
[-0.6216]])

As we iterate, we successively get different mini-batches until we have traversed the entire dataset. However, the iterative execution implemented above is very inefficient and may cause problems. Built-in iterators implemented in deep learning frameworks are much more efficient and can work with data stored in files and data provided by data streams

Initialize model parameters

Initialize the weights by sampling random numbers from a normal distribution with mean 0 and standard deviation 0.01, and initialize the bias to 0:

w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

After initializing the parameters, our task is to update these parameters until they fit our data sufficiently. Each update needs to calculate the gradient of the loss function with respect to the model parameters. With this gradient, each parameter can be updated in the direction of reducing the loss.

define model

Defining a model involves associating the inputs and parameters of the model with the outputs of the model.
To calculate the output of a linear model, it is only necessary to calculate the matrix-vector multiplication of the input feature X and the model weight w and then add the bias b. (Xw is a vector, and b is a scalar) When we add a scalar to a vector, the scalar is added to each component of the vector (broadcasting mechanism):

def linreg(X, w, b):  #@save
    """线性回归模型"""
    return torch.matmul(X, w) + b

Define the loss function

To calculate the gradient of the loss function, it is natural to define the loss function first. The square loss function is defined below:

def squared_loss(y_hat, y):  #@save
    """均方损失"""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

Define the optimization algorithm

Linear regression has an analytical solution, but other models basically do not, so the stochastic gradient descent method is still used for optimization.
At each step, a mini-batch randomly drawn from the dataset is used, and the gradient of the loss is computed according to the parameters. Next, the parameters are updated in the direction of reducing the loss.
Below is the stochastic gradient descent update function, which accepts as input a set of model parameters, a learning rate, and a batch size. The size of each step update is determined by the learning rate lr. Since the loss we compute is a synthesis of a batch of samples, we normalize the step size by the batch size batch_size so that the step size does not depend on our choice of batch size:

def sgd(params, lr, batch_size):  #@save
    """小批量随机梯度下降"""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

Among them, torch.no_grad() is a context manager, which is used to specify that no gradient calculation is performed in its internal code block. When there is no need to calculate gradients, using this context manager can improve code execution efficiency.

train

In each iteration, we read a small batch of training samples and run them through our model to obtain a set of predictions. After calculating the loss, we start backpropagation, storing the gradient of each parameter. Finally, the optimization algorithm sgd is called to update the model parameters.
In a nutshell, it is to execute the following cycle:
1. Initialize parameters
2. Repeat the training until it is completed:
calculate the gradient g ← ∂ ( w , b ) 1 ∣ B ∣ ∑ i ∈ B l ( x ( i ) , y ( i ) , w , b ) update parameters ( w , b ) ← ( w , b ) − η g calculate gradient g←\partial_{(w,b)}\frac{1}{|B|}\sum_{i∈ B}l(x^{(i)},y^{(i)},w,b)\\ Update parameters (w,b)←(w,b)-ηgCalculate the gradient g(w,b)B1iBl(x(i),y(i),w,b)Update parameters ( w ,b)(w,b)η g
In each iteration cycle, we use the deta_iter function to traverse the entire data set and use all samples in the training data set once (assuming that the number of samples can be divisible by the batch size). The number of iteration cycles num_epoches and the learning rate lr here are both hyperparameters, which are set to 3 and 0.03 respectively. (Hyperparameter setting is cumbersome, ignore the details for now)

batch_size = 10
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # X和y的小批量损失
        # 因为l形状是(batch_size,1),不是标量
        # l中所有元素加起来再计算关于[w,b]的梯度
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # 使用参数的梯度更新参数
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {
      
      epoch + 1}, loss {
      
      float(train_l.mean()):f}')

result:

epoch 1, loss 0.040672
epoch 2, loss 0.000146
epoch 3, loss 0.000047

In fact, the real parameters are very close to the parameters obtained by training:

print(f'w的估计误差: {
      
      true_w - w.reshape(true_w.shape)}')
print(f'b的估计误差: {
      
      true_b - b}')

result:

The estimated error of w: tensor([0.0006, -0.0002], grad_fn=) The
estimated error of b: tensor([0.0004], grad_fn=)

Guess you like

Origin blog.csdn.net/m0_52380556/article/details/131889305