Hands-on deep learning v2__08 linear regression + basic optimization algorithm

1. Linear Algebra

Regression refers to a class of methods that model the relationship between one or more independent variables and a dependent variable. In the natural and social sciences, regression is often used to represent the relationship between input and output. Most tasks in the field of machine learning usually involve prediction.

Regression problems are involved when we want to predict a numerical value. Common examples include: predicting prices (houses, stocks, etc.), predicting length of stay (for hospitalized patients), predicting demand (retail sales), etc.

But not all predictions are regression problems. In later chapters, we will introduce the classification problem. The goal of a classification problem is to predict which of a set of classes the data belongs to.

1.1 Basic elements of linear regression

Linear regression is based on several simple assumptions: first, it is assumed that the relationship between the independent variable x and the dependent variable y is linear, that is, y can be expressed as a weighted sum of the elements in x, which is usually allowed to include some noise of the observations; second , we assume that any noise is relatively normal, such as the noise follows a normal distribution.

To explain linear regression, let's take a practical example: We want to estimate the price of a house (in dollars) based on its size (square feet) and age (years).

In order to develop a model that can predict house prices, we need to collect a real dataset. This dataset includes the sale price, size and age of houses.

In machine learning terminology, this data set is called the training data set or training set, and each row of data (in this case, the data corresponding to a house transaction) is called a sample ( sample), also known as data point (data point) or data sample (data instance).

The target we are trying to predict (in this case the house price) is called the label or target.

The independent variables (area and age) on which the prediction is based are called features or covariates.

Typically, we use n to denote the number of samples in the dataset.

1.2 Linear Model

The linear assumption means that the target (house price) can be expressed as a weighted sum of features (area and age), as follows:

Warea and Wage are called weights , and b is called bias , offset , and intercept .

The weights determine the influence of each feature on our predicted value.

Bias refers to what the predicted value should be when all features take the value 0. Without the bias term, the expressive power of our model will be limited.

Strictly speaking, the above formula is an affine transformation of the input features. Affine transformation is characterized by linear transformation of features through weighted sum and translation through bias term.

Given a dataset, our goal is to find the weights w and biases b of the model such that the predictions made by the model roughly match the true prices in the data. The predicted value of the output is determined by the affine transformation of the input features through the linear model , and the affine transformation is determined by the selected weights and biases.

Before we can start looking for the best model parameters w and b, we need two more things:
(1) a measure of the quality of the model;
(2) a way to update the model to improve the quality of the model's predictions.

1.3 Loss function

The loss function can quantify the gap between the actual value of the target and the predicted value. Usually we choose a non-negative number as the loss, and the smaller the value, the smaller the loss, and the loss for perfect prediction is 0. The most commonly used loss function in regression problems is the squared error function.

For further explanation, consider the following example. We draw the image for the regression problem in the 1D case as shown.

 

 In the squared error function, the difference between the hesitant estimated value and the observed value will make the error of the entire model larger, so the display range of the error can be reduced by losing the mean value, as follows:

When training a model, we want to find a set of parameters (w,b) that minimizes the total loss over all training samples.

1.3 Analytical solution

        Linear regression happens to be a very simple optimization problem. Unlike most of the other models we'll cover in this book, the solution to a linear regression can be expressed simply by a formula, and this type of solution is called an analytical solution.

        Simple problems like linear regression have analytical solutions, but not all problems have analytical solutions. Analytical solutions can be used for good mathematical analysis, but analytical solutions are very restrictive, preventing them from being used in deep learning.

1.4 Mini-Batch Stochastic Gradient Descent

Gradient descent, which optimizes almost all deep learning models, reduces error by continuously updating parameters in the direction of decreasing loss function.

1.4.1 Mini-Batch Stochastic Gradient Descent

The simplest use of gradient descent is to calculate the derivative of the loss function (the mean of the loss over all samples in the dataset) with respect to the model parameters (it can also be called the gradient here). But in practice the execution can be very slow: we have to traverse the entire dataset before every parameter update.

Therefore, we usually randomly draw a small batch of samples each time we need to compute an update, a variant called mini-batch stochastic gradient descent.

In each iteration, we first randomly sample a mini-batch B , which consists of a fixed number of training samples. Then, we compute the derivative (also called gradient) of the average loss with respect to the model parameters for the mini-batch. Finally, we multiply the gradient by a predetermined positive number η and subtract it from the current parameter value.

We express this update process with the following mathematical formula ( denotes partial derivative):

To summarize, the steps of the algorithm are as follows:

(1) Initialize the values ​​of model parameters, such as random initialization;

(2) Randomly sample a small batch of samples from the dataset and update the parameters in the direction of the negative gradient, and iterate this step continuously.

For squared loss and affine transformation, we can write it explicitly as follows:

Both w and x in equation (3.1.10) are vectors. Here, the more elegant vector notation is more readable than the coefficient notation ( w1,w2,…,wd ). |B| represents the number of samples in each mini-batch, which is also called the batch size. η represents the learning rate.

The values ​​for batch size and learning rate are usually pre-specified manually, rather than model training.

Hyperparameters refer to parameters that can be adjusted but not updated during training. Hyperparameters are usually adjusted based on the results of training iterations evaluated on independent validation datasets

Tuning: The process of choosing hyperparameters.

After training for a predetermined number of iterations (or until some other stopping condition is met), we record the estimated values ​​of the model parameters, denoted w^,b^. However, even if our function is indeed linear and noise-free, these estimates do not really minimise the loss function. Because the algorithm will make the loss slowly converge towards the minimum, but cannot reach the minimum very accurately within a limited number of steps.

Linear regression happens to be a learning problem with only one minimum in the entire domain. But for complex models like deep neural networks, the loss plane usually contains multiple minima. Fortunately, for some reason, deep learning practitioners rarely go to great lengths to find such a set of parameters that minimizes the loss on the training set.

In fact, what's harder to do is to find a set of parameters that achieves low loss on data we've never seen before, a challenge known as generalization.

1.4.1 Predicting with the learned model

Now we can estimate the price of a new house that is not included in the training data, given the size of the house x1 and the age of the house x2. The process of estimating a target given features is often referred to as prediction or inference.

1.5 Normal distribution and squared loss

1.5.1 Normal distribution

The normal distribution, also known as the Gaussian distribution, and linear regression are closely related. 

Below we define a Python function to compute the normal distribution.

def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)

We now visualize the normal distribution.

# 再次使用numpy进行可视化
x = np.arange(-7, 7, 0.01)

# 均值和标准差对
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
         ylabel='p(x)', figsize=(4.5, 2.5),
         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])

As can be seen from the above figure, changing the mean on the Gaussian distribution will produce a shift along the x-axis, and increasing the variance will scatter the distribution and reduce its peak.

1.5.2 Mean Square Error Loss Function (referred to as Mean Square Loss)

One reason it can be used for linear regression is that we assume that the observations contain noise, where the noise follows a normal distribution.

The normal distribution of noise is as follows:

1.5.3 Maximum Likelihood Estimator

The estimator selected according to the maximum likelihood estimation method is called the maximum likelihood estimator.

While maximizing the product of many exponential functions may seem difficult, we can simplify by maximizing the log-likelihood without changing the objective. Optimization is usually said to minimize rather than maximize.

 Therefore, under the assumption of Gaussian noise, minimizing the mean squared error is equivalent to the maximum likelihood estimation of a linear model.

 1.6 From Linear Regression to Deep Networks

1.6.1 Neural Network Diagram

We describe a linear regression model as a neural network that shows only the connection pattern, i.e. only how each input is connected to the output, omitting the values ​​of weights and biases.

 Since the model focuses on where the computation takes place, usually we do not consider the input layer when calculating the number of layers. That is, the number of layers in the neural network in Figure 3.1.2 is 1.

For linear regression, each input is connected to each output (in this case there is only one output), we call this transformation (the output layer in Figure 3.1.2) a fully-connected layer, Or called a dense layer.

1.6.2 Neural Networks and Biology

Figure 3.1.3, which is a picture of a biological neuron composed of dendrites (input terminals) and nucleus (nucleu, CPU). Axons (axons, output lines) and axon terminals (axon terminals, output terminals) are connected to other neurons through synapses.

 Second, the basic optimization method

 

 

 

 3. Implement linear regression from scratch

 In this section, we will only use tensors and automatic differentiation.

PS: The d2l package can be installed directly by entering the command pip install -U d2l in the prompt of conda

%matplotlib inline
import random
import torch
from d2l import torch as d2l

3.1 Generate dataset

Generate a dataset with 1000 samples in this section, each sample contains 2 features sampled from a standard normal distribution.

def synthetic_data(w, b, num_examples):  #@save
    """生成 y = Xw + b + 噪声。"""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

Note that each row in features contains a two-dimensional data sample, and each row in labels contains a one-dimensional label value (a scalar), which is output below.

print('features:', features[0],'\nlabel:', labels[0])
features: tensor([-1.5273,  0.5069])
label: tensor([-0.5713])

By generating a scatter plot of the second feature features[:, 1] and labels, the linear relationship between the two can be visually observed.

d2l.set_figsize()
d2l.plt.scatter(features[:, (1)].detach().numpy(), labels.detach().numpy(), 1);

 3.2 Read the dataset

We define a data_iter function that takes as input the batch size, feature matrix and label vector and generates mini-batches of size batch_size, each mini-batch contains a set of features and labels.

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # 这些样本是随机读取的,没有特定的顺序
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]

Typically, we use reasonably sized mini-batches to take advantage of GPU hardware, since GPUs excel at parallel processing. The model calculation can be performed in parallel for each sample, and the gradient of the loss function for each sample can also be calculated in parallel. The GPU can process hundreds of samples without taking much more time than processing one sample.

Let's get a feel for it intuitively. Read the first mini-batch of data samples and print. The feature dimension of each batch specifies the batch size and the number of input features. Likewise, the label shape of the batch is equal to batch_size.

batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
tensor([[-0.5309,  0.2644],
        [ 0.5849, -0.4633],
        [ 0.0618, -1.1239],
        [ 0.2240,  0.0085],
        [ 0.7633,  0.9449],
        [ 0.8836,  0.8245],
        [-2.5749, -0.5567],
        [-0.0975,  0.8569],
        [ 0.8215, -0.3621],
        [ 0.7872,  0.2790]])
 tensor([[2.2405],
        [6.9425],
        [8.1252],
        [4.6162],
        [2.5107],
        [3.1595],
        [0.9252],
        [1.0848],
        [7.0856],
        [4.8352]])

The iteration implemented above is fine for teaching, but it performs poorly and can get into trouble on practical issues. For example, it requires us to load all data into memory and perform a lot of random memory accesses.

The built-in iterators implemented in deep learning frameworks are much more efficient and can process data stored in files and data provided through data streams.

 3.3 Initialize model parameters

In the code below, the weights are initialized by sampling random numbers from a normal distribution with mean 0 and standard deviation 0.01, and biases are initialized to 0.

w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

After initializing the parameters, our task is to update these parameters until these parameters are sufficient to fit our data.

Each update needs to calculate the gradient of the loss function with respect to the model parameters. According to the change of the gradient, we can update each parameter in the direction of reducing the loss.

 3.4 Defining the model

Next, we must define the model, associating the model's inputs and parameters with the model's output. Recall that to compute the output of a linear model, we simply compute the matrix-vector multiplication of the input features X and the model weights w followed by the bias b.

def linreg(X, w, b):  #@save
    """线性回归模型。"""
    return torch.matmul(X, w) + b

 3.5 Defining the loss function

Because updating the model requires calculating the gradient of the loss function, we should define the loss function first. Here we use the squared loss function. In the implementation, we need to transform the shape of the true value y to be the same as the shape of the predicted value y_hat.

def squared_loss(y_hat, y):  #@save
    """均方损失。"""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

 3.6 Defining optimization functions

At each step, a mini-batch randomly drawn from the dataset is used, and then the gradient of the loss is calculated according to the parameters. Next, update our parameters in the direction of reducing the loss.

We will present a working example of mini-batch stochastic gradient descent here, and the following function implements mini-batch stochastic gradient descent updates. The function accepts a set of model parameters, learning rate, and batch size as input. The size of each step update is determined by the learning rate lr. Because the loss we compute is the sum of a batch of samples, we normalize the step size by the batch size (batch_size) so that the step size does not depend on our choice of batch size.

def sgd(params, lr, batch_size):  #@save
    """小批量随机梯度下降。"""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

 3.7 Training

Now that we have everything we need for model training, we can implement the main parts of the training process.

In each iteration, we read a small batch of training samples and pass our model to obtain a set of predictions. After calculating the loss, we start backpropagation, storing the gradient of each parameter.

Finally, we call the optimization algorithm sgd to update the model parameters.

To recap, we will execute the following loop:

In each iteration, we use the data_iter function to iterate over the entire dataset and use all samples of the training dataset once (assuming the number of samples is divisible by the batch size). The number of iterations num_epochs and the learning rate lr are both hyperparameters, set to 3 and 0.03, respectively.

Setting hyperparameters is tricky and requires tuning through trial and error.

lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # `X`和`y`的小批量损失
        # 因为`l`形状是(`batch_size`, 1),而不是一个标量。`l`中的所有元素被加到一起,
        # 并以此计算关于[`w`, `b`]的梯度
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # 使用参数的梯度更新参数
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
epoch 1, loss 0.043538
epoch 2, loss 0.000165
epoch 3, loss 0.000050

Because we are using our own synthetic dataset, we know what the real parameters are. Therefore, we can evaluate the success of training by comparing the real parameters with the parameters learned through training. In fact, the real parameters and the parameters learned through training are indeed very close.

print(f'w的估计误差: {true_w - w.reshape(true_w.shape)}')
print(f'b的估计误差: {true_b - b}')
Estimated error of w: tensor([0.0010, 0.0004], grad_fn=<SubBackward0>) 
Estimated error of b: tensor([2.0504e-05], grad_fn=<RsubBackward1>)

Note that we should not assume that we can recover the parameters perfectly.

In machine learning, we are usually less concerned with recovering the true parameters and more concerned with those parameters that can be predicted with high accuracy.

Fortunately, even on complex optimization problems, stochastic gradient descent usually finds very good solutions. One reason is that there are many parameter combinations in deep networks that enable highly accurate predictions.

Fourth, the concise implementation of linear regression

These frameworks can automate repetitive tasks in gradient-based learning algorithms.

In the previous section, we only relied on: (1) data storage and linear algebra via tensors; (2) gradient computation via automatic differentiation. In fact, since data iterators, loss functions, optimizers, and neural network layers are commonly used, modern deep learning libraries implement these components for us as well.

We will show how the linear regression model in the previous section can be implemented succinctly by using a deep learning framework.

4.1 Generate dataset

from mxnet import autograd, gluon, np, npx
from d2l import mxnet as d2l

npx.set_np()


true_w = np.array([2, -3.4])
true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)

4.2 Read the dataset

We can call the existing API in the framework to read the data. We pass features and labels as parameters to the API and specify batch_size when instantiating the data iterator object.

Additionally, the boolean is_train indicates whether you want the data iterator object to shuffle the data each iteration cycle.

def load_array(data_arrays, batch_size, is_train=True):  #@save
    """构造一个PyTorch数据迭代器。"""
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

batch_size = 10
data_iter = load_array((features, labels), batch_size)

Use data_iter in the same way we used the data_iter function in Section 3.2. To verify that this is working, let's read and print the first mini-batch of samples. Here we use iter to construct a Python iterator and use next to get the first item from the iterator.

next(iter(data_iter))
[tensor([[-0.6054, -0.8844],
         [-0.7336,  1.4959],
         [ 0.0251, -0.5782],
         [-0.3874, -2.7577],
         [-0.2987, -1.6454],
         [-1.6794,  0.9353],
         [-1.1240,  0.0860],
         [ 0.7230, -0.3869],
         [ 0.2812, -1.3614],
         [ 1.4122,  0.7235]]),
 tensor([[ 5.9993],
         [-2.3407],
         [ 6.2172],
         [12.8106],
         [ 9.2043],
         [-2.3530],
         [ 1.6639],
         [ 6.9765],
         [ 9.3874],
         [ 4.5664]])]

4.3 Defining the model

When we implemented linear regression in Section 3.2, we explicitly defined the model parameter variables and wrote the code for the calculations so that the outputs were obtained through basic linear algebra operations.

We first define a model variable net, which is an instance of the Sequential class. The Sequential class defines a container for multiple layers that are chained together. When given input data, the Sequential instance passes the data into the first layer, then uses the output of the first layer as the input of the second layer, and so on.

In the example below, our model contains only one layer, so Sequential is not actually needed. But since almost all models in the future will be multi-layered, using Sequential here will familiarize you with standard pipelines.

Recalling the single-layer network architecture in Figure 3.1.2, this single layer is called a fully-connected layer because each of its inputs is connected to each of its outputs by matrix-vector multiplication.

# `nn` 是神经网络的缩写
from torch import nn

net = nn.Sequential(nn.Linear(2, 1))

4.5 Initializing model parameters

 Deep learning frameworks usually have predefined methods to initialize parameters. We specify that each weight parameter should be randomly sampled from a normal distribution with mean 0 and standard deviation 0.01, and the bias parameter will be initialized to zero.

Just as we specify the input and output dimensions when constructing nn.Linear. Now we directly access the parameter to set the initial value.

We select the first layer in the network via net[0], then use the weight.data and bias.data methods to access the parameters, and then use the replacement methods normal_ and fill_ to override the parameter values.

net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)
tensor([0.])

4.6 Defining the Loss Function

The mean squared error is calculated using the MSELoss class, also known as the squared L2 norm. By default it returns the average of all sample losses.

loss = nn.MSELoss()

4.7 Defining the Optimization Algorithm

The mini-batch stochastic gradient descent algorithm is a standard tool for optimizing neural networks, and PyTorch implements many variants of this algorithm in the optim module.

When we instantiate an SGD instance, we specify the parameters for optimization (available from our model via net.parameters() ) and a dictionary of hyperparameters required by the optimization algorithm.

Mini-batch stochastic gradient descent only needs to set the lr value, which is set to 0.03 here.

trainer = torch.optim.SGD(net.parameters(), lr=0.03)

4.8 Training

Implementing our model via the high-level API of a deep learning framework requires relatively little code. We don't have to assign parameters individually, define our loss function, or manually implement mini-batch stochastic gradient descent.

To recap: in each iteration, we will traverse the dataset ( train_data ) completely, and keep getting a mini-batch of inputs and corresponding labels from it.

For each mini-batch, we perform the following steps: 1. Generate predictions and compute the loss (forward propagation) by calling net(X).

② Calculate the gradient by performing backpropagation. ③Update the model parameters by calling the optimizer.

To better measure the training effect, we calculate the loss after each epoch and print it to monitor the training process.

num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X) ,y)
        trainer.zero_grad()
        l.backward()
        trainer.step()
    l = loss(net(features), labels)
    print(f'epoch {epoch + 1}, loss {l:f}')
epoch 1, loss 0.000254
epoch 2, loss 0.000098
epoch 3, loss 0.000098

Below we compare the real parameters of the generated dataset and the model parameters obtained by training with limited data. To access the parameters, we first access the desired layer from net, then read the weights and biases of that layer. As in the implementation from scratch, our estimated parameters are very close to the true parameters of the generated data.

w = net[0].weight.data
print('w的估计误差:', true_w - w.reshape(true_w.shape))
b = net[0].bias.data
print('b的估计误差:', true_b - b)

Estimated error of w: tensor([0.0007, 0.0007])

Estimated error of b: tensor([-0.0007])

V. Summary

1. We can implement the model more concisely using PyTorch's high-level API.

2. In PyTorch, the data module provides data processing tools, and the nn module defines a large number of neural network layers and common loss functions.

3. We can initialize the parameters by replacing the parameters with the method at the end of _.

QA

1. It is recommended to use colab

2. Why should losses be averaged? Equivalent to the average learning rate

3. The batch size should be smaller

4. Learning rate and batch size do not affect the final result

5. The batch size of the stochastic gradient is based on multiple samples

6. The role of datach(): not participating in the gradient

7. Don't do learning rate decay, it's not a big problem

8.

Guess you like

Origin blog.csdn.net/qq_39237205/article/details/121489104