Hands-on deep learning-pytorch version (2): linear neural network

References

1. Linear neural network

  • The entire training process of the neural network, including: defining a simple neural network architecture, data processing, specifying the loss function and how to train the model . Linear regression and softmax regression in classical statistical learning techniques can be viewed as linear neural networks

1.1 Linear regression

  • Regression is a class of methods that can model the relationship between one or more independent variables and a dependent variable. In the natural and social sciences, regression is often used to represent the relationship between inputs and outputs
  • Most tasks in the field of machine learning are usually related to prediction . Regression problems are involved when you want to predict a value. Common examples include: predicting prices (houses, stocks, etc.), predicting hospital stays (for inpatients, etc.), predicting demand (retail sales, etc.)

1.1.1 Basic elements of linear regression

  • Linear regression is based on a few simple assumptions

    • First, suppose the independent variable xxx and dependent variableyyThe relationship between y is linear, that is,yyy can be expressed asxxweighted sum of elements in x
    • Second, it is usually allowed to include some noise in the observations, and it is assumed that any noise is relatively normal, such as the noise follows a normal distribution
  • To give a practical example: want to estimate the price of a house (USD) based on its size (square feet) and age (years)

    • In order to develop a model that can predict housing prices, a real dataset needs to be collected
      • This dataset includes the sales price, size and age of houses
      • In machine learning terminology, this data set is called the training set
    • Each row of data (such as the data corresponding to a housing transaction) is called a sample , which can also be called a data point or a data instance.
    • Call the target you are trying to predict (such as predicting house prices) a label or target
    • The independent variables (area and age) on which the prediction is based are called features or covariates
linear model
  • The linear assumption means that the target (house price) can be expressed as the weighted sum of features (area and house age)
    price = wara ⋅ area + wage ⋅ age + b \mathrm{price}=w_{\mathrm{area}}\cdot \mathrm{area}+w_{\mathrm{age}}\cdot\mathrm{age}+bprice=wareaarea+wageage+b

    • w a r e a w_{area} wareasum wage w_{age}wageCalled the weight (weight), the weight determines the influence of each feature on the predicted value
    • b b b is called bias (bias), offset (offset) or intercept (intercept). Bias refers to what the predicted value should be when all features are 0. Without a bias term, the expressive power of the model will be limited
    • The above formula is an affine transformation of the input features . The affine transformation is to linearly transform the features by weighting and translating through the bias term
  • Given a dataset, the goal is to find the weights ww of the modelw and biasbbb , so that the predictions made by the model roughly match the true prices in the data. The predicted value of the output is determined by the affine transformation of the input features through the linear model, and theaffine transformation is determined by the selected weights and biases

  • In the field of machine learning, high-dimensional data sets are usually used, when the input contains ddWhen there are d features, the prediction resulty ^ \hat{y}y^(use the "pointed" symbol for yyestimateof y ) expressed as
    y ^ = w 1 x 1 + . . . + wdxd + b \hat{y}=w_1x_1+...+w_dx_d+by^=w1x1+...+wdxd+b

  • Put all the features into the vector x ∈ R d {\mathbf{x}}\in\mathbb{R}^{d}xRd , and put all the weights into the vectorw ∈ R d {\mathbf{w}}\in\mathbb{R}^{d}wRIn d , the model y ^ = w ⊤ x + b \hat{y}=\mathbf{w}^\top\mathbf{x}+bcan be expressed concisely in the form of dot product
    y^=wx+b

  • In the above formula, the vector x {\mathbf{x}}x corresponds to the features of a single data sample. The symbolic matrixX ∈ R n × d \mathbf{X}\in\mathbb{R}^{n\times d}XRn × d can be conveniently referred to asnnn samples. whereX \mathbf{X}Each row of X is a sample, and each column is a feature. For feature setX \mathbf{X}X , predicted valuey ^ ∈ R n \hat{\mathbf{y}}\in\mathbb{R}^{n}y^Rn can be expressed by matrix-vector multiplication as
    y ^ = X w + b \hat{\mathbf{y}}=\mathbf{X}\mathbf{w}+by^=Xw+b

  • Given the training data features X \mathbf{X}X andthe corresponding known label yyy , the goal of linear regression is to find a set of weight vectorswww and biasbbb

    • When given from X \mathbf{X}When the new sample features are sampled in the same distribution of X , this set of weight vectors and offsets can make the error of the new sample prediction label as small as possible
    • Even if we are sure that the underlying relationship between features and labels is linear, a noise term will be added to account for the impact of observation errors

At the beginning to find the best model parameters (model parameters) www andbbBefore b , two more things are needed

  • (1) A measure of model quality
  • (2) A method that can update the model to improve the quality of the model prediction
loss function
  • Before considering how to fit a model to your data, you need to identify a measure of the fit. The loss function can quantify the difference between the actual value of the target and the predicted value . Usually a non-negative number is chosen as the loss, and the smaller the value, the smaller the loss, and the loss is 0 for perfect prediction. The most commonly used loss function in regression problems is the squared error function . when sample iiThe predicted value of i is y ^ ( i ) \hat{y}^{(i)}y^( i ) , and its corresponding true label isy ( i ) y^{(i)}y( i ) , the square error can be defined as the following formula
    l ( i ) ( w , b ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 l^{(i)}(\mathbf{w} ,b)=\frac12\left(\hat{y}^{(i)}-y^{(i)}\right)^2l(i)(w,b)=21(y^(i)y(i))2

  • Due to the quadratic term in the squared error function, the estimated value y ^ ( i ) \hat{y}^{(i)}y^( i ) and observationy ( i ) y^{(i)}yLarger differences between ( i ) will result in larger losses. In order to measure the quality of the scene model on the entire data set, it is necessary to calculate the training setnnAverage loss over n samples (also equivalent to summation) L
    ( w , b ) = 1 n ∑ i = 1 nl ( i ) ( w , b ) = 1 n ∑ i = 1 n 1 2 ( w ⊤ x ( i ) + b − y ( i ) ) 2 L(\mathbf{w},b)=\frac1n\sum_{i=1}^nl^{(i)}(\mathbf{w},b) =\frac1n\sum_{i=1}^n\frac12\left(\mathbf{w}^\top\mathbf{x}^{(i)}+by^{(i)}\right)^2L(w,b)=n1i=1nl(i)(w,b)=n1i=1n21(wx(i)+by(i))2

  • When training the model, you want to find a set of parameters ( w ∗ , b ∗ ) (\mathbf{w}^*,b^*)(w,b ), this set of parameters can minimize the total loss
    w ∗ , b ∗ = argmin ⁡ w , b L ( w , b ) \mathbf{w}^*,b^*=\underset{\ mathbf{w},b}{\operatorname*{argmin}}L(\mathbf{w},b)w,b=w,bargminL(w,b)

Analytical solution
  • Unlike most other models, the solution of linear regression can be expressed simply by a formula, which is called an analytical solution. First, bias the bbb is merged into parameterw \mathbf{w}In w , merging is done by appending a column to the matrix containing all parameters. The prediction problem is to minimize∥ y − X w ∥ 2 \|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2yXw2 . This has only one critical point on the loss plane, which corresponds to the loss minima for the entire region. will loss aboutw \mathbf{w}The derivative of w is set to 0, and the analytical solution
    w ∗ = ( X ⊤ X ) − 1 X ⊤ y \mathbf{w}^*=(\mathbf{X}^\top\mathbf{X})^{-1 }\mathbf{X}^\top\mathbf{y}w=(XX)1Xy
stochastic gradient descent
  • The gradient descent method can optimize almost all deep learning models. It reduces the error by continuously updating the parameters in the direction of decreasing loss function.

  • Usually, a small batch of samples is randomly selected each time an update needs to be calculated. This is called minibatch stochastic gradient descent.

    • In each iteration, first randomly sample a mini-batch B \mathcal{B}B , which consists of a fixed number of training samples
    • Then, calculate the derivative (also called gradient ) of the average loss of the mini-batch with respect to the model parameters
    • Finally, multiply the gradient by a predetermined positive number η \etaLet η , also independent of the infinitive
      w ← w − η ∣ B ∣ ∑ i ∈ B ∂ wl ( i ) ( w , b ) = w − η ∣ B ∣ ∑ i ∈ B x ( i ) ( w ⊤ x ( i ) + b − y ( i ) ) , b ← b − η ∣ B ∣ ∑ i ∈ B ∂ bl ( i ) ( w , b ) = b − η ∣ B ∣ ∑ i ∈ B ( w ⊤ x(i) + b − y(i)). \begin{aligned}\mathbf{w}&\leftarrow\mathbf{w}-\frac\eta{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\partial_\mathbf{w }l^{(i)}(\mathbf{w},b)=\mathbf{w}-\frac\eta{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\ mathbf{x}^{(i)}\left(\mathbf{w}^\top\mathbf{x}^{(i)}+by^{(i)}\right),\\b&\leftarrow b -\fraction{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\partial_bl^{(i)}(\mathbf{w},b)=b-\fraction {|\mathcal{B}|}\sum_{i\in\mathcal{B}}\left(\mathbf{w}^\top\mathbf{x}^{(i)}+by^{(i) }\right).\end{aligned}wbwBhiBwl(i)(w,b)=wBhiBx(i)(wx(i)+by(i)),bBhiBbl(i)(w,b)=bBhiB(wx(i)+by(i)).
  • B \mathcal{B} B represents the number of samples in each mini-batch, which is also calledthe batch size. η \etaη representsthe learning rate (learning rate). The values ​​for batch size and learning rate are usually pre-specified manually rather than obtained through model training

    • These parameters that can be tuned but not updated during training are called hyperparameters. Hyperparameter tuning is the process of selecting hyperparameters
    • Hyperparameters are usually tuned based on the results of training iterations, which are evaluated on an independent validation dataset
Make predictions with models
  • Given a "learned" linear regression model w ^ ⊤ x + b ^ \mathbf{\hat{w}}^{\top}\mathbf{x}+\hat{b}w^x+b^ , can now pass house areax 1 x_1x1and house age x 2 x_2x2to estimate the price of a new house (not included in the training data). The process of estimating a target given features is often called prediction or inference

1.1.2 Vectorization Acceleration

  • When training a model, it is often desirable to be able to process an entire mini-batch of samples at once , and to achieve this, the computation needs to be vectorized
  • To illustrate the importance of vectorization, consider two methods of adding vectors , instantiating two 10,000-dimensional vectors of all ones
    • In one approach, use Python's for loop to iterate over the vector
    • In another approach, relying on calls to +
    import math
    import time
    import numpy as np
    import torch
    
    n = 10000
    a = torch.ones([n])
    b = torch.ones([n])
    
    # 定义一个计时器
    class Timer:
        def __init__(self):
            self.times = []
            self.start()
        def start(self):
            self.tik = time.time()
    
        def stop(self):
            self.times.append(time.time() - self.tik)
            return self.times[-1]
    
        def avg(self):
            return sum(self.times) / len(self.times)
    
        def sum(self):
            return sum(self.times)
    
        def cumsum(self):
            return np.array(self.times).cumsum().tolist()
    
    # 使用 for 循环,每次执行一位的加法
    c = torch.zeros(n)
    timer = Timer()
    for i in range(n):
        c[i] = a[i] + b[i]
    
    # 使用重载的 + 运算符来计算按元素的和
    # 矢量化代码通常会带来数量级的加速
    timer.start()
    d = a + b
    
    print(f'{
            
            timer.stop():.5f} sec')
    
    # 输出
    '0.20727 sec'
    '0.00020 sec'
    

1.1.3 Normal distribution and square loss

  • The squared loss objective function is interpreted through assumptions about the noise distribution. Normal distribution (normal distribution), also known as Gaussian distribution (Gaussian distribution): if the random variable xxx has meanμ \muμ and varianceσ 2 \sigma^{2}p2 (standard deviationσ \sigmaσ ), its normal distribution probability density function is as follows
    p ( x ) = 1 2 π σ 2 exp ⁡ ( − 1 2 σ 2 ( x − μ ) 2 ) \begin{aligned}p(x)&=\frac1{ \sqrt{2\pi\sigma^2}}\exp\left(-\frac1{2\sigma^2}(x-\mu)^2\right)\end{aligned}p(x)=2 p.s _2 1exp(2 p21(xm )2)

  • One reason the mean square error loss function (mean square loss for short) can be used for linear regression is that it assumes that the observations contain noise, where the noise follows a normal distribution. The normal distribution of noise is as follows, where ϵ ∼ N ( 0 , σ 2 ) \epsilon\sim\mathcal{N}(0,\sigma^2)ϵN(0,p2 )
    y = w ⊤ x + b + ϵ y=\mathbf{w}^\top\mathbf{x}+b+\epsilony=wx+b+ϵ

  • Therefore, it is now possible to write for a given x \mathbf{x}x observes a specificyyThe likelihood of y
    P ( y ∣ x ) = 1 2 π σ 2 exp ⁡ ( − 1 2 σ 2 ( y − w ⊤ x − b ) 2 ) P(y\mid\mathbf{x})= \frac1{\sqrt{2\pi\sigma^2}}\exp\left(-\frac1{2\sigma^2}(y-\mathbf{w}^\top\mathbf{x}-b)^ 2\right)P ( andx)=2 p.s _2 1exp(2 p21(ywxb)2)

  • Now, according to the maximum likelihood estimation method, the parameter w \mathbf{w}w andbbThe optimal value of b is the value that maximizes the likelihood of the entire data set
    P ( y ∣ X ) = ∏ i = 1 np ( y ( i ) ∣ x ( i ) ) P(\mathbf{y}\mid\mathbf {X})=\prod_{i=1}^np(y^{(i)}|\mathbf{x}^{(i)})P ( andX)=i=1np ( and(i)x(i))

− log ⁡ P ( y ∣ X ) = ∑ i = 1 n 1 2 log ⁡ ( 2 π σ 2 ) + 1 2 σ 2 ( y ( i ) − w ⊤ x ( i ) − b ) 2 -\log P(\mathbf{y}\mid\mathbf{X})=\sum_{i=1}^n\frac12\log(2\pi\sigma^2)+\frac1{2\sigma^2}\left(y^{(i)}-\mathbf{w}^\top\mathbf{x}^{(i)}-b\right)^2 logP ( andX)=i=1n21log(2πσ2)+2 p21(y(i)wx(i)b)2

import math
import numpy as np
import matplotlib.pyplot as plt

def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(-0.5 / sigma ** 2 * (x - mu) ** 2)

x = np.arange(-7, 7, 0.01)

# 改变均值会产生沿 x 轴的偏移,增加方差将会分散分布、降低峰值
params = [(0, 1), (0, 2), (3, 1)]
plt.figure(figsize=(8, 6))
for mu, sigma in params:
    plt.plot(x, normal(x, mu, sigma), label=f'mean {
      
      mu}, std {
      
      sigma}')

plt.xlabel('x')
plt.ylabel('p(x)')
plt.legend()
plt.show()

insert image description here

1.1.4 From linear regression to deep network

Neural Network Diagram
  • In the neural network shown below

    • The input is x 1 , … , xd x_{1},\ldots,x_{d}x1,,xd, so the number of inputs (or feature dimensions ) in the input layer is ddd
    • The output of the network is o 1 o_1o1, so the number of outputs in the output layer is 1
  • The input values ​​are all given, and there is only one computational neuron. Since the focus of the model is where the computation occurs, the input layer is usually not considered when calculating the number of layers. In other words, the number of layers of the neural network in the figure below is 1

  • A linear regression model can be thought of as a neural network consisting of only a single artificial neuron, or a single-layer neural network . For linear regression, each input is connected to each output (in this case, there is only one output), and this transformation (the output layer in the figure) is called a fully-connected layer (fully-connected laver) or called a dense layer (dense laver)

insert image description here

1.2 Simple implementation of linear regression

1.2.1 Generate dataset

  • Generate a dataset of 1000 samples, each containing 2 features sampled from a standard normal distribution. The synthetic dataset is a matrix x ∈ R 1000 × 2 \mathbf{x}\in\mathbb{R}^{1000\times2}xR1000×2
  • Using linear model parameters w = [ 2 , − 3.4 ] T , b = 4.2 \mathbf{w}=[2,-3.4]^{\mathsf{T}},b=4.2w=[2,3.4]T,b=4.2 and the noise termϵ \epsilonϵGenerate the dataset and its labels
    • ϵ \epsilonϵ can be regarded as the potential observation error in model prediction and labeling, assumingϵ \epsilonϵ follows a normal distribution with a mean of 0, where the standard deviation is set to 0.01
      y = X w + b + ϵ \mathbf{y}=\mathbf{X}\mathbf{w}+b+\epsilony=Xw+b+ϵ
    import numpy as np
    import torch
    from torch.utils import data
    
    def synthetic_data(w, b, num_examples):
        X = torch.normal(0, 1, (num_examples, len(w)))
        y = torch.matmul(X, w) + b
        y += torch.normal(0, 0.01, y.shape)
        return X, y.reshape((-1, 1))
    
    true_w = torch.tensor([2, -3.4])
    true_b = 4.2
    features, labels = synthetic_data(true_w, true_b, 1000)
    

1.2.2 Read dataset

  • Call the existing API in the framework to read the data. Pass features and labels as parameters of the API, and specify batch_size through the data iterator. Additionally, the boolean is_train indicates whether you want the data iterator object to shuffle the data each iteration cycle
    def load_array(data_arrays, batch_size, is_train=True):
        dataset = data.TensorDataset(*data_arrays)
        return data.DataLoader(dataset, batch_size, shuffle=is_train)
    
    batch_size = 10
    data_iter = load_array((features, labels), batch_size)
    
    # 为了验证是否正常工作,读取并打印第一个小批量样本
    # 使用 iter 构造 Python 迭代器,并使用 next 从迭代器中获取第一项
    print(next(iter(data_iter)))
    
    # 输出
    [tensor([[ 1.0829, -0.0883],
            [ 0.0989,  0.7460],
            [ 1.0245, -0.1956],
            [-0.7932,  1.7843],
            [ 1.2336,  1.0276],
            [ 2.1166,  0.2072],
            [-0.1430,  0.4944],
            [ 0.7086,  0.3950],
            [-0.0851,  1.4635],
            [ 0.2977,  1.8625]]), 
    tensor([[ 6.6616],
            [ 1.8494],
            [ 6.9229],
            [-3.4516],
            [ 3.1747],
            [ 7.7283],
            [ 2.2302],
            [ 4.2612],
            [-0.9383],
            [-1.5352]])]
    

1.2.3 Define the model

  • For standard deep learning models, you can use the predefined layers of the framework, you only need to focus on which layers are used to construct the model, and you don't have to pay attention to the implementation details of the layers
  • In PyTorch, fully connected layers are defined in the Linear class. It is worth noting that passing two parameters into nn.Linear
    • The first specifies the input feature shape, which is 2
    • The second specifies the output feature shape (single scalar), which is 1
    from torch import nn
    
    net = nn.Sequential(nn.Linear(2, 1))
    

1.2.4 Initialize model parameters

  • Model parameters need to be initialized before using net, such as weights and biases in a linear regression model . Deep learning frameworks usually have predefined methods to initialize parameters. Here, it is specified that each weight parameter should be randomly sampled from a normal distribution with mean 0 and standard deviation 0.01, and the bias parameter will be initialized to zero
    net[0].weight.data.normal_(0, 0.01)
    net[0].bias.data.fill_(0)
    

1.2.5 Define the loss function

  • The mean squared error is calculated using the MSELoss class, also known as squared L 2 L_2L2norm. By default it returns the mean of all sample losses
    loss = nn.MSELoss()
    

1.2.6 Define the optimization algorithm

  • The mini-batch stochastic gradient descent algorithm is a standard tool for optimizing neural networks , and PyTorch implements many variants of this algorithm in the optim module. When instantiating an SGD instance, specify the optimized parameters and a dictionary of hyperparameters required by the optimization algorithm. Small batch stochastic gradient descent only needs to set the lr value, here it is set to 0.03
    trainer = torch.optim.SGD(net.parameters(), lr=0.03)
    

1.2.7 Training

  • In each iteration cycle, the data set will be completely traversed once, and a small batch of input and corresponding labels will be continuously obtained from it. For each mini-batch, perform the following steps
    • Generate predictions and compute loss l by calling net(X) ( forward pass )
    • Gradients are computed by doing backpropagation
    • Update model parameters by calling the optimizer
  • In order to better measure the training effect, calculate the loss after each iteration cycle and print it to monitor the training process
    num_epochs = 3
    for epoch in range(num_epochs):
        for X, y in data_iter:
            l = loss(net(X) ,y)
            trainer.zero_grad()
            l.backward()
            trainer.step()
        l = loss(net(features), labels)
        print(f'epoch {
            
            epoch + 1}, loss {
            
            l:f}')
    
  • code summary
    import numpy as np
    import torch
    from torch.utils import data
    from torch import nn
    
    # 生成数据集
    def synthetic_data(w, b, num_examples):
        X = torch.normal(0, 1, (num_examples, len(w)))
        y = torch.matmul(X, w) + b
        y += torch.normal(0, 0.01, y.shape)
        return X, y.reshape((-1, 1))
    
    true_w = torch.tensor([2, -3.4])
    true_b = 4.2
    features, labels = synthetic_data(true_w, true_b, 1000)
    
    # 读取数据集
    def load_array(data_arrays, batch_size, is_train=True):
        dataset = data.TensorDataset(*data_arrays)
        return data.DataLoader(dataset, batch_size, shuffle=is_train)
    
    batch_size = 10
    data_iter = load_array((features, labels), batch_size)
    
    # 定义模型
    net = nn.Sequential(nn.Linear(2, 1))
    
    # 初始化模型参数
    net[0].weight.data.normal_(0, 0.01)
    net[0].bias.data.fill_(0)
    
    # 定义损失函数
    loss = nn.MSELoss()
    
    # 定义优化算法
    trainer = torch.optim.SGD(net.parameters(), lr=0.03)
    
    # 训练
    num_epochs = 3
    for epoch in range(num_epochs):
        for X, y in data_iter:
            l = loss(net(X) ,y)
            trainer.zero_grad()
            l.backward()
            trainer.step()
        l = loss(net(features), labels)
        print(f'epoch {
            
            epoch + 1}, loss {
            
            l:f}')
    
    w = net[0].weight.data
    print('w的估计误差:', true_w - w.reshape(true_w.shape))
    b = net[0].bias.data
    print('b的估计误差:', true_b - b)
    
    # 输出
    epoch 1, loss 0.000216
    epoch 2, loss 0.000104
    epoch 3, loss 0.000102
    w的估计误差: tensor([-0.0002,  0.0004])
    b的估计误差: tensor([0.0002])
    

1.3 softmax regression

1.3.1 Classification problem

  • Start with an image classification problem. Assume each input is a 2 x 2 grayscale image. Each pixel value can be represented by a scalar, and each image corresponds to four features x 1 , x 2 , x 3 , x 4 x_{1},x_{2},x_{3},x_{4}x1,x2,x3,x4. Also, assume that each image belongs to one of the categories "cat", "chicken", and "dog"
  • A simple way to represent categorical data: one-hot encoding . A one-hot encoding is a vector that has as many components as classes. The component corresponding to the category is set to 1, and all other components are set to 0. In this example, the label yyy will be a three-dimensional vector where (1,0,0) corresponds to "cat", (0,1,0) to "chicken", (0,0,1) to "dog" y ∈ {
    ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) } y\in\{(1,0,0),(0,1,0),(0,0,1 )\}y{(1,0,0),(0,1,0),(0,0,1)}

1.3.2 Network Architecture

  • To estimate conditional probabilities for all possible classes, a model with multiple outputs is required, one for each class. To solve classification problems with linear models, as many affine functions as outputs are needed. Each output corresponds to its own affine function. In this example there are 4 features and 3 possible output classes, so 12 scalars will be needed to represent the weights ( ww with subscriptw ), 3 scalars to represent the bias (bbb ). The following computes three unnormalized predictions (logits) for each input:o 1 , o 2 and o 3 o_1,o_2\text{and}o_3o1,o2and o3
    o 1 = x 1 w 11 + x 2 w 12 + x 3 w 13 + x 4 w 14 + b 1 , o 2 = x 1 w 21 + x 2 w 22 + x 3 w 23 + x 4 w 24 + b 2 , o 3 = x 1 w 31 + x 2 w 32 + x 3 w 33 + x 4 w 34 + b 3 . \begin{aligned}o_1&=x_1w_{11}+x_2w_{12}+x_3w_{13}+x_4w_{14}+b_1,\\o_2&=x_1w_{21}+x_2w_{22}+x_3w_{23}+x_4w_{24}+b_2,\\o_3&=x_1w_{31}+x_2w_{32}+x_3w_{33}+x_4w_{34}+b_3.\end{aligned} o1o2o3=x1w11+x2w12+x3w13+x4w14+b1,=x1w21+x2w22+x3w23+x4w24+b2,=x1w31+x2w32+x3w33+x4w34+b3.

  • This computational process can be described by a neural network diagram. Like linear regression, softmax regression is also a single-layer neural network due to the calculation of each output o 1 , o 2 and o 3 o_1,o_2\text{and}o_3o1,o2and o3Depends on all inputs x 1 , x 2 , x 3 and x 4 x_{1},x_{2},x_{3}\text{and}x_{4}x1,x2,x3and x4, so the output layer of softmax regression is also a fully connected layer

insert image description here

1.3.3 Parameter overhead of fully connected layer

  • Fully connected layers are "fully" connected and may have many learnable parameters. Specifically, for anyd inputs andqqA fully connected layer with q outputs, the parameter overhead is O ( dq ) \mathcal{O}(dq)O ( d q ) . willddd inputs are converted toqqThe cost of q outputs can be reduced toO ( dqn ) \mathcal{O}({\frac{dq}{n}})O(ndq) , where the hyperparameternnn can be flexibly specified to balance parameter saving and model effectiveness in practical applications

1.3.4 softmax operation

  • The softmax function transforms unnormalized predictions to non-negative numbers that sum to 1 while keeping the model differentiable. To accomplish this, each unnormalized prediction is first exponentiated, which ensures that the output is non-negative. In order to ensure that the sum of the probability values ​​​​of the final output is 1, let each exponentiation result be divided by their sum
    y ^ = softmax ( o ) where y ^ j = exp ⁡ ( oj ) ∑ k exp ⁡ ( ok ) \ hat{\mathbf{y}}=\mathrm{softmax}(\mathbf{o})\quad\text{where}\quad\hat{y}_j=\frac{\exp(o_j)}{\sum_k\ exp(o_k)}y^=softmax(o)iny^j=kexp(ok)exp(oj)

  • Although softmax is a nonlinear function, the output of softmax regression is still determined by the affine transformation of the input features. So softmax regression is a linear model

1.3.5 Vectorization of mini-batch samples

  • In order to improve computational efficiency and make full use of the GPU, vector calculations are usually performed on the data of the mini-batch of samples. The vector calculation expression of softmax regression is
    O = XW + b , Y ^ = softmax ( O ) \begin{aligned}\mathbf{O}&=\mathbf{X}\mathbf{W}+\mathbf{b}, \\\hat{\mathbf{Y}}&=\mathrm{softmax}(\mathbf{O})\end{aligned}OY^=XW+b,=softmax(O)

1.3.6 Loss function

  • Slightly, basically the same linear regression

1.3.7 Information theory basis

  • Information theory deals with encoding, decoding, sending, and processing information or data as concisely as possible

  • The core idea of ​​information theory is to quantify the information content in the data , and this value is called the distribution PPEntropy of P (entropy)
    H [ P ] = ∑ j − P ( j ) log ⁡ P ( j ) H[P]=\sum_j-P(j)\log P(j)H[P]=jP(j)logP(j)

1.3.8 Model prediction and evaluation

  • After training a softmax regression model, given any sample features, the probability of each output class can be predicted. Usually the class with the highest predicted probability is used as the output class . A prediction is correct if it agrees with the actual class (label). The performance of the model is evaluated using precision, which is equal to the ratio between the number of correct predictions and the total number of predictions

1.4 Image Classification Dataset

1.4.1 Read dataset

import torch
import torchvision
from torch.utils import data
from torchvision import transforms
import matplotlib.pyplot as plt

# 通过 ToTensor 实例将图像数据从 PIL 类型变换成 32 位浮点数格式
# 并除以 255 使得所有像素的数值均在 0~1 之间
trans = transforms.ToTensor()
# root:指定数据集下载或保存的路径;train:指定加载的是训练数据集还是测试数据集
# transform:指定数据集的转换操作;download:指定是否下载数据集
mnist_train = torchvision.datasets.FashionMNIST(
    root="./data", train=True, transform=trans, download=True)
mnist_test = torchvision.datasets.FashionMNIST(
    root="./data", train=False, transform=trans, download=True)

# 将标签转换成对应的类别名称
def get_fashion_mnist_labels(labels):
    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
    # 这是一个列表推导式
        # 1.将 labels 中的每个元素按照索引转换为对应的文本标签
        # 2.然后将这些元素组成一个新的列表并返回
    return [text_labels[int(i)] for i in labels]

def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
    figsize = (num_cols * scale, num_rows * scale)
    # 第一个变量_是一个通用变量名,通常用于表示一个不需要使用的值
    # 第二个变量 axes 是一个包含所有子图对象的数组
    # 这里使用这种命名约定是为了表示只关心 axes 而不关心第一个返回值
    _, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
    axes = axes.flatten()  # 将 axes 展平为一维数组
    # 遍历 axes 和 imgs 的元素,其中 i 为索引,ax 为当前子图,img 为当前图像
    for i, (ax, img) in enumerate(zip(axes, imgs)):
        if isinstance(img, torch.Tensor):  # img 是一个 torch.Tensor 类型
            # img 是一个张量,假设其形状为 (C, H, W),其中 C 代表通道数,H 代表高度,W 代表宽度
            # permute(1, 2, 0) 是对 img 进行维度重排操作。它将维度从 (C, H, W) 重排为 (H, W, C)
            ax.imshow(img.permute(1, 2, 0))
        else:
            ax.imshow(img) 
        ax.axis('off')  # 关闭图像的坐标轴
        if titles:
            ax.set_title(titles[i])
    plt.show()

X, y = next(iter(data.DataLoader(mnist_train, batch_size=18)))
show_images(X, 2, 9, titles=get_fashion_mnist_labels(y))

insert image description here

1.4.2 Reading small batches

  • To make it easier when reading the training and test sets, use the built-in data iterators instead of creating them from scratch. In each iteration, the data loader reads a small batch of data each time , the size of batch_size. Unbiased reading of mini-batches with built-in data iterators that randomly shuffle all samples
    • When dealing with larger datasets, feeding the network all at once does not yield good training results. Usually the number of the entire sample is divided into multiple batches, and the number of samples in each batch is called the sample size batch_size
    batch_size = 256
    
    def get_dataloader_workers():
        return 4  # 使用 4 个进程来读取数据
    train_iter = data.DataLoader(mnist_train, batch_size, shuffle=True, 
                                 num_workers=get_dataloader_workers())
    

1.4.3 Integrating all components

  • Now define the load_data_fashion_mnist function to fetch and read the Fashion-MNIST dataset. This function returns data iterators for the training and validation sets . In addition, this function also accepts an optional parameter resize, which is used to resize the image to another shape
    def load_data_fashion_mnist(batch_size, resize=None):
        # 下载 Fashion-MNIST 数据集,然后将其加载到内存中
        trans = [transforms.ToTensor()]
        if resize:
            trans.insert(0, transforms.Resize(resize))
        trans = transforms.Compose(trans)
        mnist_train = torchvision.datasets.FashionMNIST(
            root="./data", train=True, transform=trans, download=True)
        mnist_test = torchvision.datasets.FashionMNIST(
            root="./data", train=False, transform=trans, download=True)
        return (data.DataLoader(mnist_train, batch_size, shuffle=True,
                                num_workers=get_dataloader_workers()),
                data.DataLoader(mnist_test, batch_size, shuffle=False,
                                num_workers=get_dataloader_workers()))
    

1.5 Concise implementation of softmax regression

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

# 设置随机种子以确保结果可重复
torch.manual_seed(42)

# 定义超参数
batch_size = 128        # 每个批次的样本数
learning_rate = 0.1     # 学习率,用于控制优化过程中参数更新的步长
num_epochs = 100        # 训练的轮数

# 加载 Fashion-MNIST 数据集
transform = transforms.Compose([
    transforms.ToTensor(),                # 将图像转换为张量
    transforms.Normalize((0.5,), (0.5,))  # 将像素值归一化到 [-1,1] 区间
])

# 加载训练集和测试集,并将数据转换为张量
train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# 创建训练集和测试集的数据加载器,用于批量获取数据
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# 定义模型
# 创建了一个名为 SoftmaxRegression 的类,继承自 nn.Module
class SoftmaxRegression(nn.Module):
    def __init__(self, input_size, num_classes):  # 构造函数 init 初始化
        super(SoftmaxRegression, self).__init__()
        # 定义了一个线性层 (nn.Linear) 作为模型的唯一层次结构
        # 输入大小为 input_size,输出大小为 num_classes
        self.linear = nn.Linear(input_size, num_classes)

    # 实现了前向传播操作,将输入数据通过线性层得到输出
    def forward(self, x):
        out = self.linear(x)
        return out

model = SoftmaxRegression(input_size=784, num_classes=10)

# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()    # 用于计算多分类问题中的交叉熵损失
optimizer = optim.SGD(model.parameters(), lr=learning_rate)  # 定义随机梯度下降优化器,用于更新模型的参数

# 训练模型
train_losses = []
test_losses = []
# 在模型训练的过程中,运行模型对全部数据完成一次前向传播和反向传播的完整过程叫做一个 epoch
# 在梯度下降的模型训练的过程中,神经网络逐渐从不拟合状态到优化拟合状态,达到最优状态之后会进入过拟合状态
# 因此 epoch 并非越大越好。数据越多样,相应 epoch 就越大
for epoch in range(num_epochs):
    train_loss = 0.0

    # 1.将模型设置为训练模式
    model.train()  
    for images, labels in train_loader:
        # 将输入数据展平
        images = images.reshape(-1, 784)

        # 前向传播、计算损失、反向传播和优化
        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    # 2.将模型设置为评估模式(在测试集上计算损失)
    model.eval()  
    test_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for images, labels in test_loader:
            images = images.reshape(-1, 784)
            outputs = model(images)
            loss = criterion(outputs, labels)
            test_loss += loss.item()

            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    train_loss /= len(train_loader)
    test_loss /= len(test_loader)
    accuracy = 100 * correct / total

    train_losses.append(train_loss)
    test_losses.append(test_loss)

    print(f'Epoch [{
      
      epoch + 1}/{
      
      num_epochs}], Train Loss: {
      
      train_loss:.4f}, Test Loss: {
      
      test_loss:.4f}, Accuracy: {
      
      accuracy:.2f}%')

# 可视化损失
plt.plot(train_losses, label='Train Loss')
plt.plot(test_losses, label='Test Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
# 输出
Epoch [1/100], Train Loss: 0.6287, Test Loss: 0.5182, Accuracy: 81.96%
Epoch [2/100], Train Loss: 0.4887, Test Loss: 0.4981, Accuracy: 82.25%
Epoch [3/100], Train Loss: 0.4701, Test Loss: 0.4818, Accuracy: 82.49%
Epoch [4/100], Train Loss: 0.4554, Test Loss: 0.4719, Accuracy: 82.90%
Epoch [5/100], Train Loss: 0.4481, Test Loss: 0.4925, Accuracy: 82.57%
Epoch [6/100], Train Loss: 0.4360, Test Loss: 0.4621, Accuracy: 83.53%
Epoch [7/100], Train Loss: 0.4316, Test Loss: 0.4662, Accuracy: 83.53%
Epoch [8/100], Train Loss: 0.4293, Test Loss: 0.4543, Accuracy: 83.80%
Epoch [9/100], Train Loss: 0.4289, Test Loss: 0.5460, Accuracy: 81.09%
...

insert image description here

Guess you like

Origin blog.csdn.net/qq_42994487/article/details/132341723