References:

"Hands-On Deep Learning"

3.1 Linear regression

3.1.1 Basic elements of linear regression

Sample: $n$ 表示样本数， $x^{(i)}=[x^{(i)}_1,x^{(i)}_2,\cdots,x^{(i)}_d]$ means the $i$ samples.

Prediction: $\hat{y}=w^Tx+b$ represents the predicted value of a single sample, $\hat{y}=Xw+b$ represents the predicted value of all samples.

损失函数：
$L(w,b)=\sum\limits_{i=1}^{n}\frac12\Big(\hat{y}^{(i)}-y^{(i)}\Big)$

Stochastic Gradient Descent: In each iteration, we first randomly sample a mini-batch $\mathcal{B}$ , which consists of a fixed number of training samples. The parameters are then updated as follows:
$(\mathbf{w},b ) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum\limits_{i \in \mathcal{B}} \partial_{(\mathbf{w },b)} l^{(i)}(\mathbf{w},b)$
Among them, $\eta$ is the learning rate, which is a hyperparameter.

3.1.2 Vectorization Acceleration

Use efficient linear algebra libraries whenever possible.

3.1.3 Normal distribution and square loss

Assuming observations are noisy $\epsilon$ :
$\mathbf{w}^\top \mathbf{x} + b + \epsilon,$
lessϵ $\epsilon \sim N(0, \sigma^2)$ 。

Determine the equation:
$\mid \mathbf{x}) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(y - \mathbf{w}^\top \mathbf{ x} - b)^2\right)$
Then the likelihood function is:
$\prod\limits_{i=1}^{n } p(y^{(i)}|\mathbf{x}^{(i)})$
取对数再加负号，得：
$\sum\limits_{i=1}^n \bigg(\frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2\bigg).$
, σ \pi, \ $p, σ$ is a constant, so it can be seen from the above formula that the minimum mean square error of the linear model is equivalent to the maximum likelihood estimation.

3.1.4 From linear regression to deep network

3.2 Implementation of linear regression from scratch

3.2.1 Generate dataset

Suppose we want to generate a data set containing 1000 samples, each sample contains 2 features sampled from the standard normal distribution , the labels of the samples are:
$\mathbf{y}= \mathbf {X} \mathbf{w} + b + \mathbf\epsilon$
where $\mathbf{w} = [2, -3.4]^\top$ 、 $b = 4.2$ , $\epsilon$ follows a normal distribution with mean 0 and standard deviation 0.01.

def synthetic_data(w, b, num_examples):  #@save
    """生成y=Xw+b+噪声"""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    # 如果没有y.reshape，那么y将只有一个维度
    return X, y.reshape((-1, 1))

3.2.2 Read dataset

Since the stochastic gradient descent method requires us to randomly select a part of the sample from the sample each time, we can define data_iterfor sample extraction:

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # 这些样本是随机读取的，没有特定的顺序
    random.shuffle(indices)
    # 在一轮训练中要用到所有的样本
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i + batch_size, num_examples)])
        # 每次参数更新只用到一小部分样本
        yield features[batch_indices], labels[batch_indices]

The above code is only used to understand the process of extracting samples, and the built-in iterator can be used in actual implementation.

3.2.3 Initialization parameters

w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

3.2.4 Define the model

def linreg(X, w, b):
    """线性回归模型"""
    return torch.matmul(X, w) + b

3.2.5 Define the loss function

def squared_loss(y_hat, y):
    """均方损失"""
    # 这里的y.reshape其实是没有必要的，因为labels在前面已经reshape过了
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

3.2.6 Define the optimization algorithm

def sgd(params, lr, batch_size):
    """小批量随机梯度下降"""
    # 表示下一个代码块不需要进行梯度计算
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            # 清空梯度
            param.grad.zero_()

3.2.7 Training

lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # X和y的小批量损失
        # 因为l形状是(batch_size,1)，而不是一个标量。l中的所有元素被加到一起，
        # 并以此计算关于[w,b]的梯度
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # 使用参数的梯度更新参数
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {
      
      epoch + 1}, loss {
      
      float(train_l.mean()):f}')

3.3 Simple implementation of linear regression

3.3.1 Generate data

This part is the same as 3.2.1.

3.3.2 Reading Datasets

from torch.utils import data

We can directly use datathe API in to perform sample sampling:

def load_array(data_arrays, batch_size, is_train=True):
    """构造一个PyTorch数据迭代器"""
    # TensorDataset相当于把所有tensor打包，传入的tensor的第0维必须相同
    # *的作用是“解压”参数列表
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

batch_size = 10
data_iter = load_array((features, labels), batch_size)
# 访问数据
for input,label in data_iter:
    print(input,label)

3.3.3 Define the model

# nn是神经网络的缩写
from torch import nn

net = nn.Sequential(nn.Linear(2, 1))

In the above code, Sequentialmultiple layers can be connected in series; Lineara fully connected layer is implemented, and its parameters 2,1specify the shape of the input and the shape of the output.

3.3.4 Initialize model parameters

# net[0]表示选中网络中的第0层
net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)

3.3.5 Define the loss function

# 返回所有样本损失的均值
loss = nn.MSELoss()

3.3.6 Define the optimization algorithm

# SGD的输入为参数和超参数
trainer = torch.optim.SGD(net.parameters(), lr=0.03)

3.3.7 Training

num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X) ,y)
        trainer.zero_grad()
        l.backward()
        # 使用优化器对参数进行更新
        trainer.step()
    l = loss(net(features), labels)
    print(f'epoch {
      
      epoch + 1}, loss {
      
      l:f}')

3.4 softmax regression

3.4.1 Classification problems

Generally, different categories are represented by one-hot encoding .

3.4.2 Network Architecture

Assuming that each sample has 4 features and 3 possible categories, the network structure of softmax regression is shown in the figure below:

3.4.3 Parameter overhead of the fully connected layer

Generally speaking, the fully connected layer has $d$ inputs and $q$ output, then its parameter overhead is $O (d p)$ 。

3.4.4 softmax operation

For classification problems, what we want to get is the probability that the input belongs to each category, so we need to process the output to make it satisfy the basic axiom of probability:
$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\ exp(o_j)}{\sum\limits_k \exp(o_k)}$
Form, $\hat{\mathbf{y}}$ Each component of is constant positive and the sum is $1$ , and softmax will not change $\mathbf{o}$ The order of magnitude between $o .$

3.4.5 Vectorization of mini-batch samples

$KaTeX parse error: Expected 'EOF', got '&' at position 13: \mathbf{O} &̲= \mathbf{X}\m.$

3.4.6 Loss function

The likelihood function of softmax regression is:
$L(\theta)=\prod\limits_{i=1}^n P(\ mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})$
取负对数，得：
$\begin{align} -\log L(\theta)&=\sum\limits_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})\notag\\ &=\sum\limits_{i=1}^n\sum\limits_{j=1}^q-y_j\log \hat{y}_j \end{align}$

Explain the above formula as follows: because the label of the sample is a length of $The one-hot encoding of q$ , so the sum inside is actually the negative logarithm of the probability of deriving its label from the input, which is the same as $-\log P (\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})$ are equivalent.

称：
$jl(\mathbf{y}, \hat{\mathbf{y}})=\sum\limits_{j=1}^ q-y_j\log \hat{y}_j$
One is the cross-entropy loss (cross-entropy loss)
$\begin{aligned} l(\mathbf{y}, \hat{\mathbf{y}}) &= - \sum_{j= 1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)}\notag \\ &= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j\note\\ &= \log \sum_{k=1}^q \exp(o_k) - \ sum_{j=1}^q y_j o_j\note\\ \partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) &= \frac{\exp(o_j)}{\ sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j\notag \end{aligned}$
It can be seen that the gradient is the observed value $y$ and estimated value $\hat{y}$ , which makes gradient computation a lot easier in practice.

3.5 Image Classification Dataset

3.5.2 Reading small batches of data

batch_size = 256

def get_dataloader_workers():
    """使用4个进程来读取数据"""
    return 4

train_iter = data.DataLoader(mnist_train, batch_size, shuffle=True,
                             num_workers=get_dataloader_workers())

3.6 Implementation of softmax regression from scratch

3.6.1 Initialize model parameters

The input is a 28*28 image, which can be regarded as a vector with a length of 784; the output is the probability of belonging to 10 possible categories, so $W$ should be a 784*10 matrix, $b$ is a row vector of 1*10:

num_inputs = 784
num_outputs = 10

W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True)
b = torch.zeros(num_outputs, requires_grad=True)

3.6.2 Define softmax operation

Implementing softmax consists of three steps:

Exponentiates each term;
Sum each row (each sample is a row in the mini-batch) to get the normalization constant for each sample;
Divides each row by its normalization constant, ensuring that the results sum to 1.

The corresponding code is:

def softmax(X):
    X_exp = torch.exp(X)
    # 确保求和之后张量的维度不变
    partition = X_exp.sum(1, keepdim=True)
    return X_exp / partition  # 这里应用了广播机制

3.6.3 Defining the model

def net(X):
    return softmax(torch.matmul(X.reshape((-1, W.shape[0])), W) + b)

Why is the input here just an image?

3.6.4 Define the loss function

def cross_entropy(y_hat, y):
    return - torch.log(y_hat[range(len(y_hat)), y])

cross_entropy(y_hat, y)

Among them, yis a label list, which represents the category number of the sample, such as [0,1,3].

3.6.5 Classification Accuracy

Accuracy = number of correct predictions / total number of predictions

def accuracy(y_hat, y):
    """计算预测正确的数量"""
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = y_hat.argmax(axis=1)
    cmp = y_hat.type(y.dtype) == y
    return float(cmp.type(y.dtype).sum())

The code above says: If y_hatis a matrix, then the second dimension is assumed to store the predicted scores for each class. We use argmaxto get the index of the largest element in each row to get the predicted class. We then ycompare the predicted categories with the ground truth elements. Since the equality operator " ==" is sensitive to the data type, we y_hatconvert the data type of to ybe consistent with the data type of . The result is a tensor containing 0 (false) and 1 (true). Finally, we sum to get the number of correct predictions.

3.7 Simple implementation of softmax regression

3.7.1 Initialize model parameters

# PyTorch不会隐式地调整输入的形状。因此，
# 我们在线性层前定义了展平层（flatten），来调整网络输入的形状
net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

# apply会对net里的每一层执行init_weights函数
# 所以init_weights函数里的m是用来限定只初始化Linear层参数的
net.apply(init_weights);

3.7.2 Define the loss function

CorssEntropyLossThe input is $\mathbf{o}$ (without softmax) and a list of labels, the output is cross entropy. In other words, we do not need to convert the output into a probability through softmax when calculating the loss, because the exponential operation in softmax is very easy to overflow.

# none表示不合并结果，即loss为一个列表，元素为每个样本的交叉熵
# 这里之所以选择none，是因为后面既要用到损失的总和，又要用到损失的均值
loss = nn.CrossEntropyLoss(reduction='none')

3.7.3 Optimization algorithm

trainer = torch.optim.SGD(net.parameters(), lr=0.1)

3.7.4 Training

# 累加器类
class Accumulator:
    """在n个变量上累加"""
    def __init__(self, n):
        self.data = [0.0] * n
	
    # 将参数列表逐个加到累加器里
    def add(self, *args):
        self.data = [a + float(b) for a, b in zip(self.data, args)]

    def reset(self):
        self.data = [0.0] * len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

def train_epoch_ch3(net, train_iter, loss, updater):
    """训练模型一个迭代周期（定义见第3章）"""
    # 将模型设置为训练模式
    if isinstance(net, torch.nn.Module):
        net.train()
    # 训练损失总和、训练准确度总和、样本数
    metric = Accumulator(3)
    for X, y in train_iter:
        # 计算梯度并更新参数
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(updater, torch.optim.Optimizer):
            # 使用PyTorch内置的优化器和损失函数
            updater.zero_grad()
            l.mean().backward()
            updater.step()
        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
    # 返回训练损失和训练精度
    return metric[0] / metric[2], metric[1] / metric[2]

def evaluate_accuracy(net, data_iter):  #@save
    """计算在指定数据集上模型的精度"""
    if isinstance(net, torch.nn.Module):
        net.eval()  # 将模型设置为评估模式
    metric = Accumulator(2)  # 正确预测数、预测总数
    with torch.no_grad():
        for X, y in data_iter:
            # 这里的accuracy出自3.6.5
            metric.add(accuracy(net(X), y), y.numel())
    return metric[0] / metric[1]

def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
    """训练模型（定义见第3章）"""
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
                        legend=['train loss', 'train acc', 'test acc'])
    for epoch in range(num_epochs):
        # 训练一轮
        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
        # 在测试集上测试精度
        test_acc = evaluate_accuracy(net, test_iter)
        animator.add(epoch + 1, train_metrics + (test_acc,))
    train_loss, train_acc = train_metrics
    # 这条代码的意思是：如果train_loss<0.5则继续执行，否则报错，报错内容为"train_loss"
    assert train_loss < 0.5, train_loss
    assert train_acc <= 1 and train_acc > 0.7, train_acc
    assert test_acc <= 1 and test_acc > 0.7, test_acc

num_epochs = 10
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

3.7.5 Forecast

Just use y_hat.argmax(axis=1).

"Hands-on Deep Learning" - Linear Neural Networks