hands-on deep learning v2 p2 linear neural network linear regression

3. Linear Neural Networks

Regression is a class of methods that can model the relationship between one or more independent variables and a dependent variable. In the natural and social sciences, regression is often used to represent the relationship between inputs and outputs.

Most tasks in the field of machine learning are usually related to prediction . Regression problems are involved when we want to predict a value. Common examples include: predicting prices (houses, stocks, etc.), predicting hospital stays (for inpatients, etc.), predicting demand (retail sales, etc.). But not all forecasting is a regression problem. In later chapters, we will cover classification problems. The goal of classification problems is to predict which of a set of classes data belongs to.

3.1. Linear regression

Basic Elements of Linear Regression

Here is the text extracted from the file:

3.1.1. Basic elements of linear regression

Linear regression (linear regression) dates back to the early 19th century, and it is the simplest and most popular among the various standard tools for regression. Linear regression is based on several simple assumptions: first, it is assumed that the relationship between the independent variable x and the dependent variable y is linear, that is, y can be expressed as the weighted sum of the elements in x, where some noise of the observation value is usually allowed to be included; secondly , we assume that any noise is relatively normal, such as the noise follows a normal distribution.

To explain linear regression, let's take a practical example: We want to estimate the price of a house (in dollars) based on its size (in square feet) and its age (in years). In order to develop a model that can predict house prices, we need to collect a real dataset. This dataset includes sales prices, square footage, and age of houses. In machine learning terminology, this data set is called a training data set or training set. Each row of data (such as the data corresponding to a housing transaction) is called a sample, which can also be called a data point or a data instance. We call the target we are trying to predict (say predicting house prices) a label or target. The independent variables (area and age) on which the prediction is based are called features or covariates.

Usually, we use n to denote the number of samples in a dataset. For a sample with index i, its input is expressed as x(i) = [x1(i), x2(i)]⊤, and its corresponding label is y(i).

1. Linear model

The linear assumption means that the target (house price) can be expressed as a weighted sum of features (area and age), as shown in the following formula:

\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b.

The wara and wage in are called weight, and the weight determines the influence of each feature on our predicted value. b is called bias (bias), offset (offset) or intercept (intercept). Bias refers to how much the predicted value should be when all features take the value of 0. Even though in reality there won't be any house with an area of ​​0 or an age of exactly 0 years, we still need a bias term. Without the bias term, the expressive power of our model would be limited. Strictly speaking, (3.1.1) is an affine transformation of the input features. Affine transformation is characterized by linear transformation of features through weighted sum and translation through bias term.

Given a dataset, our goal is to find the weights w and bias b of the model: when given a new sample feature sampled from the same distribution as x, this set of weight vectors and biases enables the new sample to predict the label The error is as small as possible. The predicted value of the output is determined by the affine transformation of the input features through the linear model, and the affine transformation is determined by the selected weights and biases.

In the field of machine learning, we usually use high-dimensional data sets, and it is more convenient to use linear algebra representation when modeling. When our input contains d features, we denote the predicted result y^ (often using the "pointed" notation for the estimated value of y) as:

\hat{y} = w_1 x_1 + ... + w_d x_d + b.

Putting all the features into the vector x and all the weights into the vector w, we can express the model succinctly in the form of a dot product:

\hat{y} = \mathbf{w}^\top \mathbf{x} + b.

In (3.1.3), the vector x corresponds to the features of a single data sample. The notational matrix X can conveniently refer to the n samples of our entire dataset. Among them, each row of X is a sample, and each column is a feature.

For a feature set X, the predicted value y^ (often using the "cusp" notation to denote the estimated value of y) can be expressed by matrix-vector multiplication as:

y^ = Xw + b

The summation in this process will use the broadcast mechanism (the broadcast mechanism is described in detail in Section 2.1.3). Given training data features X and corresponding known labels y, the goal of linear regression is to find a set of parameters w and b: when given a new sample feature sampled from the same distribution of X, the set of weight vectors and biases It can make the error of predicting labels of new samples as small as possible.

While we believe that the best model for predicting y given x will be linear, it is difficult for us to find a real dataset with n samples where y(i) is exactly equal to w⊤ for all 1 ≤ i ≤ n x(i) + b. No matter what means we use to observe features X and labels y, there may be a small amount of observation error. Therefore, even if we are sure that the underlying relationship between features and labels is linear, we will add a noise term to account for the impact of observation errors.

Before starting to find the best model parameters w and b, we need two more things: (1) a measure of model quality; (2) a method that can update the model to improve the quality of the model's prediction.

2. Loss function

Before we start thinking about how to fit the model to the data, we need to identify a measure of the goodness of fit. The loss function can quantify the difference between the actual value of the target and the predicted value. Usually we choose a non-negative number as the loss, and the smaller the value, the smaller the loss, and the loss is 0 for perfect prediction. The most commonly used loss function in regression problems is the squared error function. When the predicted value of sample i is y^(i), and its corresponding true label is y(i), the square error can be defined as the following formula:

l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.

The constant 1⁄2 doesn't make a substantial difference, but it's slightly simpler in form (since the constant coefficient is 1 when we differentiate the loss function). Since the training dataset is not under our control, the empirical error is only a function of the model parameters. To further illustrate, look at the following example. We draw a graph for the regression problem in the one-dimensional case, as shown in Figure 3.1.1.

A larger difference between the estimated value y^(i) and the observed value y(i) will result in a larger loss due to the quadratic term in the squared error function. In order to measure the quality of the model on the entire data set, we need to calculate the average loss (also equivalent to summation) on n samples of the training set.L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.

When training the model, we want to find a set of parameters (w*, b*) that minimize the total loss over all training samples. as follows

:\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b).

3. Analytical solution

Linear regression happens to be a very simple optimization problem. Unlike most of the other models we will cover in this book, the solution of linear regression can be expressed simply by a formula. This type of solution is called an analytical solution. First, we incorporate the bias b into the parameter w by appending a column in the matrix containing all parameters. Our prediction problem is to minimize ||y - Xw||^2. There is only one critical point on the loss plane, which corresponds to the minimum point of loss in the entire region. Set the derivative of the loss with respect to w to 0 to obtain an analytical solution:

\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}.

Simple problems like linear regression have analytical solutions, but not all problems have analytical solutions. Analytical solutions can perform good mathematical analysis, but the analytical solution has strict restrictions on the problem, which prevents it from being widely used in deep learning.

4. Stochastic Gradient Descent

Even in cases where we cannot get an analytical solution, we can still train the model efficiently. On many tasks, models that are difficult to optimize perform better. Therefore, it is very important to figure out how to train these difficult-to-optimize models.

In this book we use a method called gradient descent, which can optimize almost all deep learning models. It reduces the error by continuously updating the parameters in the direction of decreasing loss function.

The simplest use of gradient descent is to calculate the derivative (here, it can also be called gradient) of the loss function (mean loss of all samples in the data set) with respect to the model parameters. But the actual execution can be very slow: because we have to traverse the whole dataset before updating the parameters every time. Therefore, we usually randomly sample a small batch every time we need to compute an update, a variant called minibatch stochastic gradient descent.

In each iteration, we first randomly sample a mini-batch B, which consists of a fixed number of training samples. Then, we compute the derivatives (also called gradients) of the average loss over the mini-batch with respect to the model parameters. Finally, we multiply the gradient by a predetermined positive number η and subtract it from the current parameter value.

We express this update process with the following mathematical formula (∂ denotes a partial derivative):

(\mathbf{w},b) \leftarrow(\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}}\partial_ {(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).

To sum up, the steps of the algorithm are as follows: (1) Initialize the value of the model parameters, such as random initialization; (2) Randomly select a small batch of samples from the data set and update the parameters in the direction of the negative gradient, and continuously iterate this step. For the squared loss and affine transformation, we can explicitly write it as follows:

\begin{aligned} \mathbf{w} & \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\ mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in\ mathcal{B}} \mathbf{x}^{(i)} (\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}),\b & \leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}}(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}).  \end{aligned}

Both w and x in formula (3.1.10) are vectors. Here, the more elegant vector notation is more readable than the coefficient notation (like w1, w2, ..., wd). |B| denotes the number of samples in each mini-batch, which is also called the batch size. η represents the learning rate (learning rate). The values ​​for batch size and learning rate are usually pre-specified manually rather than obtained through model training. These parameters that can be tuned but not updated during training are called hyperparameters. Hyperparameter tuning is the process of selecting hyperparameters. Hyperparameters are usually tuned by us based on the results of training iterations, which are evaluated on an independent validation dataset.

After training for a predetermined number of iterations (or until some other stopping condition is met), we record the estimated values ​​of the model parameters, denoted as w^, b^. However, even if our function were indeed linear and noise-free, these estimates would not make the loss function truly minimized. Because the algorithm will make the loss slowly converge to the minimum value, but it cannot reach the minimum value very accurately within a limited number of steps.

Linear regression happens to be a learning problem that has only one minimum across the domain. But for complex models like deep neural networks, the loss plane usually contains multiple minima. Deep learning practitioners rarely spend a lot of effort finding such a set of parameters that minimizes the loss on the training set. In fact, what is harder to do is to find a set of parameters that achieves a lower loss on data we have never seen before, a challenge known as generalization.

5. Prediction with the model

Given the "learned" linear regression model \hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}, we can now estimate the price of a new house (not included in the training data) by house size x1 and house age x2. The process of estimating a target given features is often called prediction or inference.

This book will try to stick to the word forecast. While the word inference has become standard deep learning terminology, it is actually a bit of a misnomer. In statistics, inference refers more to estimating parameters based on a data set. Misuse of terminology often leads to some misunderstandings when deep learning practitioners talk to statisticians.

2. Vectorization Acceleration

When training our models, we often want to be able to process an entire mini-batch of examples at once. In order to achieve this, we need to vectorize the computation, thus taking advantage of linear algebra libraries, rather than writing expensive for loops in Python.

%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l
# 为了说明矢量化为什么如此重要,我们考虑对向量相加的两种方法。 我们实例化两个全为1的10000维向量。 在一种方法中,我们将使用Python的for循环遍历向量; 在另一种方法中,我们将依赖对+的调用。
# 由于在本书中我们将频繁地进行运行时间的基准测试,所以我们定义一个计时器:

n = 100000
a = torch.ones([n])
b = torch.ones([n])
class Timer:  #@save
    """记录多次运行时间"""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        """启动计时器"""
        self.tik = time.time()

    def stop(self):
        """停止计时器并将时间记录在列表中"""
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        """返回平均时间"""
        return sum(self.times) / len(self.times)

    def sum(self):
        """返回时间总和"""
        return sum(self.times)

    def cumsum(self):
        """返回累计时间"""
        return np.array(self.times).cumsum().tolist()
# 现在我们可以对工作负载进行基准测试。
# 首先,我们使用for循环,每次执行一位的加法。

c = torch.zeros(n)
timer = Timer()
for i in range(n):
    c[i] = a[i] + b[i]
f'{timer.stop():.5f} sec'
# '0.89696 sec'
# 或者,我们使用重载的+运算符来计算按元素的和。
timer.start()
d = a + b
f'{timer.stop():.5f} sec'
# '0.00019 sec'
# 结果很明显,第二种方法比第一种方法快得多。 矢量化代码通常会带来数量级的加速。 另外,我们将更多的数学运算放到库中,而无须自己编写那么多的计算,从而减少了出错的可能性。

Operations between direct tensors

3. Normal distribution and squared loss

Next, we interpret the squared loss objective function through assumptions about the noise distribution.

There is a close relationship between the normal distribution and linear regression. Normal 
distribution, also known as Gaussian distribution,
was first applied to astronomical research by German mathematician Gauss.
Simply put, if a random variable x has mean μ and variance σ^2 (standard deviation σ), its normal distribution probability density function is as follows:

p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right).

Below we define a Python function to compute the normal distribution.

def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)
# 再次使用numpy进行可视化
x = np.arange(-7, 7, 0.01)

# 均值和标准差对
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
         ylabel='p(x)', figsize=(4.5, 2.5),
         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])

As we can see, changing the mean produces a shift along the x-axis, and increasing the variance spreads out the distribution, reducing its peak value.

One reason the mean square error loss function (mean square loss for short) can be used for linear regression is that
we assume that the observations contain noise, where the noise obeys a normal distribution. The normal distribution of noise is as follows:

y = \mathbf{w}^\top \mathbf{x} + b + \epsilon,, among which, \epsilon \sim \mathcal{N}(0, \sigma^2).

Thus, we can now write the likelihood of observing a particular y given a given x:

P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).

Now, according to maximum likelihood estimation, the optimal values ​​for parameters w and b are those that maximize the likelihood of the entire dataset:

 
 

Estimators selected according to the maximum likelihood estimation method are called maximum likelihood estimators.
While maximizing the product of many exponential functions seems difficult,
we can simplify by maximizing the log-likelihood without changing the objective.
For historical reasons, optimization is usually said to minimize rather than maximize.
We can instead minimize the negative log-likelihood -logP(y|X).
The mathematical formula that can be obtained from this is:

-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2.

Now we only need to assume that σ is some fixed constant to ignore the first term,
because the first term does not depend on w and b.
Now the second term is the same as the mean square error introduced earlier except for the constant 1/σ^2.
Fortunately, the solution of the above equation does not depend on σ.
Therefore, under the assumption of Gaussian noise, minimizing the mean square error is equivalent to the maximum likelihood estimation of the linear model.

4. From Linear Regression to Deep Networks

So far we have only talked about linear models. Although the neural network covers more and richer models, we can still describe the linear model in the same way as the neural network, so that the linear model can be regarded as a neural network. First, we rewrite the model using the "layer" notation.

Neural Network Diagram

Deep learning practitioners love to draw diagrams to visualize what is going on in their models. In  Figure 3.1.2 , we describe the linear regression model as a neural network. It should be noted that this figure only shows the connection mode, that is, only shows how each input is connected to the output, and the values ​​​​of weights and biases are hidden.

 summary

  • The key elements in a machine learning model are the training data, the loss function, the optimization algorithm, and the model itself.

  • Vectorization makes mathematical expression more concise, while running faster.

  • Minimizing the objective function is equivalent to performing maximum likelihood estimation.

  • A linear regression model is also a simple neural network.

 topic

  1. Suppose we have some data x1, ..., xn ∈ R. Our goal is to find a constant b that minimizes ∑i(xi - b)^2.
    1. Find the analytical solution for the optimal value b.
    2. How does this problem and its solution relate to the normal distribution?
  1. Find the analytical solution for the optimal value b:

    To find $b$ that minimizes $\sum_i (x_i - b)^2$, you can set the derivative to 0:

    $\frac{d}{db} \sum_i (x_i - b)^2 = -2 \sum_i (x_i - b) = 0$

    The solution is: $b = \frac{1}{n}\sum_i x_i$, that is, $b$ is equal to the mean of all $x_i$.

  2. The problem and its solution are related to the normal distribution:

    $b$ minimizes the sum of squared residuals $\sum_i (x_i - b)^2$, which is equivalent to maximizing the likelihood that the observed data $x_i$ comes from a normal distribution with a mean of $b$ and a constant variance.

  1. Derive an analytical solution to a linear regression optimization problem using squared error. To simplify matters, the bias b can be ignored (we can do this by adding a column to X with all values ​​1).
    1. Write optimization problems in matrix and vector notation (treat all data as a single matrix and all target values ​​as a single vector).
    2. Computes the gradient of loss with respect to w.
    3. The analytical solution is found by solving the matrix equation with the gradient set to 0.
    4. When might it be better than using stochastic gradient descent? When does this approach fail?
  1. The optimization problem says:

    \min_w |Xw - y|_2^2

  2. Calculate the gradient:

    \frac{\partial}{\partial w} |Xw - y|_2^2 = 2X^T(Xw - y)

  3. The gradient is set to 0, and the equation is solved to get:

    w = (X^TX)^{-1}X^Ty

  4. When the sample size is small, the cost of calculating the analytical solution is less than stochastic gradient descent, otherwise stochastic gradient descent is more effective. The analytical solution will fail when $X^TX$ is irreversible.

  1. The noise model controlling the additive noise ε is assumed to be exponentially distributed. That is, p(ε) = (1/2)exp(-|ε|)
    1. Write the negative log-likelihood of the data under the model -logP(y|X).
    2. Please try to write an analytical solution.
    3. A stochastic gradient descent algorithm is proposed to solve this problem. What could go wrong? (hint: what happens near the stagnation point when we keep updating the parameters) Please try to fix this.
    1. Negative log-likelihood:

      -logP(y|X) = \frac{1}{2}\sum_i |y_i - X_i w|

    2. no analytical solution

    3. Stochastic gradient descent can be used, but there is a risk of converging to a non-minimum point. Try squared Loss as an alternative.

The following is the torch code implementation:

import torch

# 生成数据
X = torch.randn(100, 10) 
y = torch.randn(100)

# 定义模型
w = torch.randn(10, requires_grad=True)
b = torch.randn(1, requires_grad=True)

# 定义损失函数
def loss_fn(y_pred, y):
  return torch.sum((y_pred - y)**2)

# 训练
optimizer = torch.optim.SGD([w, b], lr=1e-3)
for iter in range(100):
  y_pred = X @ w + b
  loss = loss_fn(y_pred, y)
  loss.backward()
  optimizer.step()
  optimizer.zero_grad()

# 输出训练后的参数  
print(w) 
print(b)

This code implements the minimum square loss to train the linear regression model, and outputs the trained weight w and bias b. We can see that stochastic gradient descent can be used to solve this linear regression problem.

3.2. Implementation of linear regression from scratch

After understanding the key ideas of linear regression, we can start to implement linear regression through code. In this section, we will implement the entire method from scratch, including the data pipeline, model, loss function, and mini-batch stochastic gradient descent optimizer. While modern deep learning frameworks automate almost all of this work, implementing it from scratch ensures we really know what we are doing. At the same time, understanding more detailed working principles will facilitate our custom models, custom layers or custom loss functions. In this section, we will only use tensors and autodifferentiation. In the following chapters, we will make full use of the advantages of the deep learning framework and introduce a more concise implementation.

1. Generate a dataset

For simplicity, we will construct an artificial dataset based on a linear model with noise. Our task is to recover the parameters of this model using this finite sample dataset. We will use low-dimensional data, which can be easily visualized. In the code below, we generate a dataset with 1000 samples, each containing 2 features sampled from a standard normal distribution. Our synthetic dataset is a matrix X ∈ R1000×2. We use the linear model parameters w=[2, -3.4], b=4.2 and the noise term ε to generate the dataset and its labels: y = Xw + b + ε. (3.2.1) ε can be regarded as when the model predicts and labels potential observation errors. Here we consider the standard assumption to hold, that is, ε follows a normal distribution with mean 0. To simplify the problem, we set the standard deviation as 0.01. The code below generates a synthetic dataset.

Note that featureseach row in contains a two-dimensional data sample, and  labelseach row in contains one-dimensional label values ​​(a scalar).

features[:, 1]By generating a scatterplot of the second feature labelssum, the linear relationship between the two can be visually observed.

%matplotlib inline
import random
import torch
from d2l import torch as d2l
def synthetic_data(w, b, num_examples): #@save # 定义合成数据生成函数synthetic_data
        # w: 权重参数的向量
        # b: 偏置量
        # num_examples: 生成的数据集样本数量  

    """生成y=Xw+b+噪声"""

    X = torch.normal(0, 1, (num_examples, len(w)))   # 使用正态分布随机生成特征矩阵X,num_examples行,len(w)列
    y = torch.matmul(X, w) + b # 计算线性方程得到标签y的一部分,矩阵乘法计算X和w的乘积
    print('y.shape',y.shape)
#     y = X*w + b #  X 和 w 的维度不匹配,不能直接相乘。X 是 (num_examples, len(w)) 的二维Tensor,w 是一维Tensor。
    y += torch.normal(0, 0.1, y.shape) # 添加正态噪声到标签y,噪声方差为0.01
    
    return X, y.reshape((-1, 1)) # 返回特征矩阵X和reshape后的标签y,reshape为num_examples行1列 -1 表示自适应这个维度的大小。也就是说,这一维的大小将根据其他维度来自动计算。1 表示TensorShape的第二个维度大小为1。

true_w = torch.tensor([2, -3.4]) # 定义真实的参数权重向量true_w,2个元素
true_b = 4.2 # 定义真实的偏置量true_b 
features, labels = synthetic_data(true_w, true_b, 1000) # 生成1000个样本的合成数据集,赋值到features和labels
    # true_w: 传入之前定义的真实权重 
    # true_b: 传入之前定义的真实偏置量
    # 1000: 生成样本数量

print('features:', features[0],'\nlabel:', labels[0]) # 打印第一个样本的特征和标签
d2l.set_figsize()  
d2l.plt.scatter(features[:, 1].detach().numpy(), labels.detach().numpy(), 1); # 绘制特征矩阵第2列和标签的散点图,点大小为1

y.shape torch.Size([1000])
features: tensor([0.0875, 0.5291]) 
label: tensor([2.5143])

2. Read the dataset

Recall that training a model involves traversing the dataset, taking small batches of samples each time, and using them to update our model. Since this process is the basis for training a machine learning algorithm, it is necessary to define a function that shuffles the samples in the dataset and obtains the data in small batches.

In the code below, we define a data_iterfunction that takes as input a batch size, a feature matrix, and a label vector to generate a mini-batch of size batch_size. Each mini-batch contains a set of features and labels.

Typically, we take advantage of GPU parallelism and process "mini-batches" of reasonable size. Model calculations can be performed in parallel for each sample, and the gradient of the loss function for each sample can also be calculated in parallel. A GPU can process hundreds of samples in no more time than one sample.

Let's intuitively feel the small batch operation: read the first small batch data sample and print it. The feature dimension per batch shows the batch size and number of input features. Likewise, the label shape of the batch is batch_sizeequal to .

def data_iter(batch_size, features, labels): # batch_size: 每次迭代返回的特征和标签数目
    num_examples = len(features) # features和labels的数目
    indices = list(range(num_examples)) # 生成索引列表
    
    # 这些样本是随机读取的,没有特定顺序
    random.shuffle(indices) # 打乱索引顺序

    for i in range(0, num_examples, batch_size): # 从索引中循环读取特征和标签
        batch_indices = torch.tensor(indices[i: i + batch_size]) # 本次取出的样本索引
        # a=[1,2,3]
        # print(a[:5]) #不会报错的
        yield features[batch_indices], labels[batch_indices] # 按照索引取出特征和标签
a=list(range(10))
random.shuffle(a)
print(a)

batch_size = 10
for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
[6, 2, 9, 4, 0, 7, 3, 5, 8, 1]
tensor([[ 0.4282, -0.6740],
        [-0.3198, -0.1568],
        [-0.8597,  0.6003],
        [-0.6630,  0.4034],
        [ 0.1831, -0.5057],
        [ 0.0660,  0.9484],
        [-0.6181,  1.1546],
        [ 2.1839, -0.3741],
        [-0.8864,  0.3907],
        [ 1.1164, -0.7436]]) 
 tensor([[ 7.3615],
        [ 4.1884],
        [ 0.3505],
        [ 1.5880],
        [ 6.2715],
        [ 1.2592],
        [-0.9332],
        [ 9.8630],
        [ 1.0079],
        [ 8.7573]])

As we run iterations, we successively get different mini-batches until we have traversed the entire dataset. The iteration implemented above is fine for teaching, but it performs inefficiently and can get into trouble with practical problems. For example, it requires us to load all the data into memory and perform a lot of random memory accesses. The built-in iterators implemented in deep learning frameworks are much more efficient and can work with data stored in files and data provided by data streams.

3. Initialize model parameters, define model, loss function, optimization algorithm

Initialize model parameters

Before we start optimizing our model parameters with mini-batch stochastic gradient descent, we need some parameters. In the code below, we initialize the weights by sampling random numbers from a normal distribution with mean 0 and standard deviation 0.01, and initialize the bias to 0.

w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

After initializing the parameters, our task is to update these parameters until they fit our data sufficiently. Each update requires computing the gradient of the loss function with respect to the model parameters. With this gradient, we can update each parameter in the direction of reducing the loss. No one calculates gradients by hand because it is tedious and error-prone. We  compute gradients using automatic differentiation introduced in Section 2.5 .

define model

Next, we must define the model and relate the model's inputs and parameters to the model's outputs. Recall that to compute the output of a linear model, we simply compute the matrix-vector multiplication of the input features x and the model weights w followed by the bias b. Note that Xw above is a vector and b is a scalar. Recall  the broadcast mechanism described in Section 2.1.3 : When we add a scalar to a vector, the scalar is added to each component of the vector.

def linreg(X, w, b):  #@save
    """线性回归模型"""
    return torch.matmul(X, w) + b

loss function

Because we need to calculate the gradient of the loss function, we should first define the loss function. Here we use  the squared loss function described in Section 3.1 . In implementation, we need to yconvert the shape of the ground truth to be y_hatthe same as the shape of the predicted value.

def squared_loss(y_hat, y):  #@save
    """均方损失"""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

optimization

As we  discussed in Section 3.1 , linear regression has analytical solutions. While linear regression has analytical solutions, the other models in this book do not. Here we introduce mini-batch stochastic gradient descent.

At each step, a mini-batch randomly drawn from the dataset is used, and the gradient of the loss is computed according to the parameters. Next, update our parameters in the direction of reducing the loss. The function below implements mini-batch stochastic gradient descent updates. The function accepts as input a set of model parameters, a learning rate, and a batch size. The size of the update at each step lris determined by the learning rate. Because the loss we compute is the sum of a batch of samples, we batch_sizenormalize the stride size by the batch size ( ) so that the stride size does not depend on our choice of batch size.

def sgd(params, lr, batch_size):  #@save
    """小批量随机梯度下降"""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

7. Training

Now that we have all the elements needed for model training ready, we can implement the main part of the training process. Understanding this code is crucial because when working in deep learning, the same training process occurs almost over and over again. In each iteration, we read a small batch of training samples and run them through our model to obtain a set of predictions. After calculating the loss, we start backpropagation, storing the gradient of each parameter. Finally, we call the optimization algorithm sgdto update the model parameters.

To recap, we will execute the following loop:

  • Initialization parameters

  • Repeat the following exercises until complete

    • Calculate the gradient\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf {x}^{(i)}, y^{(i)}, \mathbf{w}, b)

    • update parameters(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}

In each iteration cycle (epoch), we use data_iterthe function to traverse the entire data set and use all samples in the training data set once (assuming that the number of samples is divisible by the batch size). num_epochsThe number of iteration cycles and learning rate here lrare both hyperparameters, which are set to 3 and 0.03, respectively. Setting hyperparameters is tricky and requires tuning by trial and error. We ignore these details for now, and they will be covered in detail later in  Section 11 .

w = torch.normal(0, 0.01, size=(2,1), requires_grad=True) # 初始化w,均值为0,标准差为0.01,大小为(2,1),需要梯度 b = torch.zeros(1, requires_grad=True) # 初始化b,值为0,需要梯度
b = torch.zeros(1, requires_grad=True)

def linreg(X, w, b): #@save _"""线性回归模型"""_
    return torch.matmul(X, w) + b # X是输入,w和b是模型参数,matmul做矩阵乘法

def squared_loss(y_hat, y): #@save _"""均方损失"""_
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2 # y_hat是预测,y是标签,计算均方误差

def sgd(params, lr, batch_size): #@save _"""小批量随机梯度下降"""_
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size # 对每个参数param,按照lr和batch_size更新
            param.grad.zero_() # 梯度清零

lr = 0.03
num_epochs = 10

net = linreg
loss = squared_loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y) # X和y的小批量损失

        # 因为l形状是(batch_size,1),而不是一个标量。l中的所有元素被加到一起,
        # 并以此计算关于[w,b]的梯度
        l.sum().backward()

        sgd([w, b], lr, batch_size) # 使用参数的梯度更新参数

    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
print(f'w的估计误差: {true_w - w.reshape(true_w.shape)}')
print(f'b的估计误差: {true_b - b}')
epoch 1, loss 0.038583

epoch 10, loss 0.005485
w的估计误差: tensor([-0.0028, -0.0045], grad_fn=<SubBackward0>)
b的估计误差: tensor([-0.0045], grad_fn=<RsubBackward1>)

Because we are using a dataset we synthesized ourselves, we know what the real parameters are. Therefore, we can evaluate the success of the training by comparing the real parameters with the parameters learned through training. In fact, the real parameters and the parameters learned through training are indeed very close.

Note that we should not take for granted that we can solve for the parameters perfectly. In machine learning, we are usually less concerned with recovering the true parameters and more concerned with how to predict them with a high degree of accuracy. Fortunately, even on complex optimization problems, stochastic gradient descent can often find very good solutions. One reason for this is that there are many combinations of parameters in deep networks that can achieve highly accurate predictions.

summary

  • We learned how deep networks are implemented and optimized. Only tensors and automatic differentiation are used in this process, no need to define layers or complex optimizers.

  • This section only scratches the surface. In the following sections we describe other models based on the concepts just introduced and learn how to implement them more concisely

practise

  1. What happens if we initialize the weights to zero. Does the algorithm still work?

When performing gradient descent optimization, all samples will contribute exactly the same to the weight update. This will cause the weights to not be updated effectively and the algorithm will not work properly

true_w = torch.tensor([0.,0.]) # 定义真实的参数权重向量true_w,2个元素
    print(X.shape,w.shape)
torch.Size([1000, 2]) torch.Size([2])

  1. Suppose you are trying to build a model for the relationship between voltage and current. Can automatic differentiation be used to learn the parameters of a model

可以使用自动微分来学习建模电压和电流关系的参数。我们可以定义模型,然后使用自动微分计算损失函数关于模型参数的梯度,通过梯度下降迭代参数以拟合数据。
import torch

# 模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# 生成数据    
X = torch.rand(20, 1) 
y = 3*X + 0.1*torch.randn(20, 1)

# 构建模型
model = Model()

# 定义损失函数和优化器
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# 训练
for epoch in range(100):
    y_pred = model(X)
    loss = criterion(y_pred, y)
    optimizer.zero_grad() 
    loss.backward()        
    optimizer.step()
  1. Can the spectral energy density be used to determine the temperature of an object based on Planck's law ?

The temperature of an object can be calculated using the spectral energy density based on Planck's law. According to Planck's law, the spectral energy density of blackbody radiation and wavelength at a certain temperature can determine its temperature. Therefore, if the spectral energy density is measured, the temperature of the object can be inversely calculated.

  1. What problems might I have when computing the second derivative? How can these problems be solved?

When calculating the second derivative, you may encounter the problem of numerical instability. This is because the error is magnified when calculating the second derivative by numerical approximation. Some tricks can be used to alleviate this problem, such as using the central difference method, adjusting the step size, etc.

  1. Why do you need to use functions squared_lossin functions ?reshape

The reshape is needed in the squared_loss function to flatten the predicted output and the label vector so that element-wise subtraction can be done and the squared difference computed.

  1. Experiment with different learning rates and see how quickly the loss function decreases.

Using different learning rates will affect how quickly the loss function decreases. The larger the learning rate, the faster the loss function decreases, but it is also easy to cross the minimum value; the smaller the learning rate, although it is difficult to cross the minimum value, the convergence speed slows down.

  1. data_iterHow does the behavior of the function change if the number of samples is not divisible by the batch size ?

If the number of samples is not divisible by the batch size, the number of samples in the last batch will be less than the batch size. This will cause the last batch of input data to have a different shape than the previous ones. We can add 0 or discard the last batch to keep the shape consistent.

References: 3.1. Linear Regression — Hands-on Deep Learning 2.0.0 documentation

Understanding of sum() in loss.sum().backward()

Guess you like

Origin blog.csdn.net/m0_61634551/article/details/131757289