"Hands-on Learning Deep Learning" Notes (2) Linear Neural Network

3. Linear neural network

3.1 Linear regression

3.1.1 Introduction

  1. Regression is a class of methods for modeling the relationship between one or more independent variables and a dependent variable . Linear regression is based on several simple assumptions: ① the relationship between the independent variable and the dependent variable is linear; ② noise is allowed but follows a normal distribution.
  2. The concept of training data set /training set, sample /data point/data sample, label/target (target trying to predict), feature/covariate (independent variable on which prediction is based), use nnn represents the number of samples in the data set, and the pair index isiiThe sample of i , the input is expressed asx ( i ) = [ x 1 ( i ) , x 2 ( i ) ] ⊤ \mathbf{x}^{(i)}=[x_1^{(i)}, x_2^{ (i)}]^{\top}x(i)=[x1(i),x2(i)] , the corresponding label isy ( i ) y^{(i)}y( i ) .
  3.The linear assumptioncontains weights and biases, which is an affine transformation of the input features. Put all the features into the vectorw ∈ R d \mathbf{w}\in\mathbb{R}^dwRIn d , the concise representation of the linear model is obtained:y ^ = w ⊤ x + b \hat{y}=\mathbf{w}^{\top}\mathbf{x}+by^=wx+b , and then get the model representation of the entire data set:y ^ = X w + b \hat{y}=\mathbf{Xw}+by^=Xw+b , to get the best model parameterw \mathbf{w}w andbbb , two more things are needed:
(1) A measure of model quality -loss function
L ( w , b ) = 1 n ∑ i = 1 nl ( i ) ( w , b ) = 1 n ∑ i = 1 n 1 2 ( w ⊤ x ( i ) + b − y ( i ) ) 2 (1) L(\mathbf{w},b)=\frac{1}{n}\sum^n_{i=1 }l^{(i)}(\mathbf{w},b)=\frac{1}{n}\sum^n_{i=1}\frac{1}{2}(\mathbf{w}^ {\top}\mathbf{x}^{(i)}+by^{(i)})^2\tag{1}L(w,b)=n1i=1nl(i)(w,b)=n1i=1n21(wx(i)+by(i))2( 1 )
w ∗ , b ∗ = arg min ⁡ w , b L ( w , b ) (2) \mathbf{w}^*,b^*=\argmin_{\mathbf{w},b}L(\ mathbf{w},b)\tag{2}w,b=w,bargminL(w,b)( 2 )
  The solution of linear regression can be simply expressed by a formula, but this method has strict restrictions on the problem, so it cannot be widely used in deep learning, so the following is needed— (2) A method that can update the model
toimprove A method for model prediction quality. For example, the gradient descent method (Gradient Descent) can calculate the derivative of the loss function with respect to the model parameters, but in actual execution, a small batch of samples is usually randomly selected every time the update needs to be calculated. This variant is called: small batchrandom gradient descent.
  Randomly select a mini-batchB \mathcal{B}B , predetermine a positive numberη \etaη ,define a simple solution:
( w , b ) ← ( w , b ) − η ∣ B ∣ ∑ i ∈ B ∂ ( w , b ) l ( i ) ( w , b ) (3) (\mathbf{w },b)\leftarrow(\mathbf{w},b)-\frac{\eta}{\vert\mathcal{B}\vert}\sum_{i\in\mathcal{B}}\partial_{(\ mathbf{w},b)}l^{(i)}(\mathbf{w},b)\tag{3}(w,b)(w,b)BhiB(w,b)l(i)(w,b)( 3 )
  where∣ B ∣ \vert\mathcal{B}\vertB represents the number of samples in each mini-batch, also known as the batch size (batch size),η \etaη represents the learning rate, which is manually set in advance, and these parameters that can be adjusted but not updated during the training process are calledhyperparameters.
  4.generalization; use the model to makepredictions;
  5.Vectorization acceleration: When training the model, the calculation is vectorized, so that the linear algebra library is used instead of the expensive Python For loop to reduce the time required for the program to run .
  6.Normal distribution and square loss: After considering the noise in the observation information and setting it to obey the normal distribution, there isy = w ⊤ x + b + ϵ y=\mathbf{w^\top x}+b+\epsilony=wx+b+ϵ ,indexϵ ∼ N ( 0 , σ 2 ) \epsilon\sim\mathcal{N}(0,\sigma^2)ϵN(0,p2 ),Maximum Likelihood Estimation...
  7.From linear regression to deep network: Neural network covers a wealth of models, and linear models can be described in the same way as neural networks. The following uses the "layer" symbol to rewrite this model. (Note: The values ​​​​of weights and biases are hidden in the figure below)

Linear regression is a single layer neural network

  For linear regression, each input is connected to an output, and this transformation is called a fully -connected layer or dense layer. The idea of ​​these algorithms is partly due to our study of real biological nervous systems .

3.1.2 Implementation of linear regression from scratch

3.1.3 Simple implementation of linear regression

3.2 softmax regression

3.2.1 Introduction

  1. [ Classification problem ] Regression can be used to predict how many problems, and we may also be interested in classification problems , not asking "how much", but asking "which one". There are two subtly different problems: ① differentiate only the "hard" categories; ② get the "soft" categories, i.e. the probability of belonging to each category.
  2. [ One-Hot Encoding ] Statisticians long ago invented a simple way to represent categorical data: One-Hot Encoding , which is a vector with as many components as categories, whereby the label yyy can be expressed as:
y ∈ { ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) } y\in\{(1,0,0),(0,1, 0),(0,0,1)\}y{(1,0,0),(0,1,0),(0,0,1 )} .
  3. [Multi-output model] Taking a 4-pixel picture to distinguish three categories as an example, in order to estimate the probability of each category, amulti-output model:
o 1 = x 1 w 11 + x 2 w 12 + x 3 w 13 + x 4 w 14 + b 1 o 2 = x 1 w 21 + x 2 w 22 + x 3 w 23 + x 4 w 24 + b 2 o 3 = x 1 w 31 + x 2 w 32 + x 3 w 33 + x 4 w 14 + b 3 (4) o_1=x_1w_{11}+x_2w_{12}+x_3w_{13}+x_4w_{14}+b_1\\ o_2=x_1w_{21}+x_2w_{22}+x_3w_{23} +x_4w_{24}+b_2\\ o_3=x_1w_{31}+x_2w_{32}+x_3w_{33}+x_4w_{14}+b_3\tag{4}o1=x1w11+x2w12+x3w13+x4w14+b1o2=x1w21+x2w22+x3w23+x4w24+b2o3=x1w31+x2w32+x3w33+x4w14+b3( 4 )
  The neural network diagram is described as follows. It can be seen that the softmax regression is also a single-layer neural network, and the output layer is also a fully connected layer.

softmax regression is a single layer neural network

  4. [ Parameter overhead ] In deep learning, the fully connected layer is ubiquitous, for any ddd inputs andqqA fully connected layer with q outputs, the parameter overheadisO ( dq ) \mathcal{O}(dq)O ( d q ) , which may not be acceptable in practice, this cost can be reduced toO ( dqn ) \mathcal{O}(\frac{dq}{n})O(ndq) , the hyperparameternnn can be flexibly specified to balance parameter saving with model effectiveness (Zhang et al., 2021).

  5. [ softmax function ] If you want to treat the output as a probability, you must ensure that the output on any data is non-negative and sums to 1. The softmax function can convert unnormalized predictions into non-negative numbers and sum to 1, while maintaining the model's derivability, the formula is as follows:
y ^ = softmax ( o ) where y ^ j = exp ⁡ ( oj ) ∑ k exp ⁡ ( ok ) (5) \hat{\mathbf{y}}=\text{softmax}(\mathbf{o})\ \ \ \ where \ \ \ \ \hat{y}_j=\frac{\exp(o_j) }{ {\sum}_k\exp(o_k)}\tag{5}y^=softmax ( o ) where        y^j=kexp(ok)exp(oj)( 5 )
  In this process,arg max ⁡ jy ^ j = arg max ⁡ joj \argmax_j\hat{y}_j=\argmax_jo_jargmaxjy^j=argmaxjoj, although softmax is nonlinear, the output of softmax regression is still determined by the affine transformation of the input features, so softmax regression is a linear model .
  6. [ Vectorization of small batches ], speed up X \mathbf{X}XW \mathbf{W}Matrix-vector multiplication of W.
  7. [Loss function], the softmax function gives a vectory ^ \hat{y}y^, that is, "for a given input x \mathbf{x}The conditional probability of each class of x ", the optimization goal is to maximizeP ( Y ∣ X ) P(\mathbf{Y\vert X})Let P ( Y∣X ) , and then have the following function:
− log ⁡ P ( Y ∣ X ) = ∑ i = 1 n − log ⁡ P ( y ( i ) ∣ x ( i ) ) = ∑ i = 1 nl ( y ( i ) , y ^ ( i ) ) (6) -\log P(\mathbf{Y\vert X})=\sum_{i=1}^n-\log P(\mathbf {y}^{(i)}\vert\mathbf{x}^{(i)})=\sum_{i=1}^nl(\mathbf{y}^{(i)},\hat{\ mathbf{y}}^{(i)})\tag{6}logP(Y∣X)=i=1nlogP ( and(i)x(i))=i=1nl ( y(i),y^(i))( 6 )
l ( y ( i ) , y ^ ( i ) ) = − ∑ j = 1 qyj log ⁡ y ^ j (7) l(\mathbf{y}^{(i)},\hat{\mathbf {y}}^{(i)})=-\sum^q_{j=1}y_j\log{\hat{y}_j}\tag{7}l ( y(i),y^(i))=j=1qyjlogy^j( 7 )
  The loss function in (7) is usually calledcross-entropy loss, which is one of the most commonly used losses for classification problems. Since the power is always greater than zero, the resulting probability must be greater than zero, and theoretically the loss function cannot be further minimized.
  8. [Derivative Calculation] Substituting formula (5) into formula (7), we can get:
l ( y , y ^ ) = − ∑ j = 1 qyj log ⁡ exp ⁡ ( oj ) ∑ k exp ⁡ ( ok ) = log ⁡ ∑ k = 1 q exp ⁡ ( ok ) − ∑ j = 1 qyjoj (8) l(\mathbf{y},\hat{\mathbf{y}})=-\sum^q_{j=1} y_j\log{\frac{\exp(o_j)}{ {\sum}_k\exp(o_k)}}=\log\sum_{k=1}^q \exp(o_k)-\sum_{j=1 }^qy_jo_j\tag{8}l ( y ,y^)=j=1qyjlogkexp(ok)exp(oj)=logk=1qexp(ok)j=1qyjoj( 8 )
get the loss relative to any unnormalized predictionoj o_jojDefinitions:
∂ ojl ( y , y ^ ) = exp ⁡ ( oj ) ∑ k = 1 q exp ⁡ ( ok ) − yj = softmax ( o ) j − yj (9) \partial_{o_j}l(\mathbf{ y},\hat{\mathbf{y}})=\frac{\exp(o_j)}{{ \sum}^q_{k=1}\exp(o_k)}-y_j=\text{softmax}( \mathbf{o})_j-y_j\tag{9}ojl ( y ,y^)=k=1qexp(ok)exp(oj)yj=softmax(o)jyj( 9 )
  9. [Calculation of entropy], the core idea is to quantify the information content in the data, this value is called distributionPPThe entropy of P (entropy), the calculation method isH [ P ] = ∑ j − P ( j ) log ⁡ P ( j ) H[P]=\sum_j-P(j)\log P(j)H[P]=jP(j)logP ( j ) , nat, bit...
  [Compression and Prediction] Imagine a data stream that needs to be compressed, if it is easy to predict the next data, it is easy to be compressed, and when we can't completely predict every time, we will feel "Surprised", Shannon decided to use the amount of informationlog ⁡ 1 P ( j ) = − log ⁡ P ( j ) \log\frac{1}{P(j)}=-\log P(j)logP(j)1=logP ( j ) to quantify this surprise, we assign a lower probability to an event, the greater our surprise, the greater the amount of information contained, and the above entropy is when the assigned probability really matches the data generation process information expectations.
  [Understanding of entropy and cross entropy] Put entropyH ( P ) H(P)H ( P ) imagined as "the degree of surprise experienced by a person who knows the true probability", the cross entropyH ( P , Q ) H(P,Q)H(P,Q ) is "the subjective probability isQQAn observer of Q sees with probabilityPPExpected surprise when P generates data",P = QP = QP=The cross-entropy reaches the minimum at Q
  [Cross-entropy classification objective] The cross-entropy classification objective can be considered from two aspects: ① Maximize the likelihood of observed data; ② Minimize the surprise required to convey the label.

  10. [ Model prediction evaluation ] Use the class with the highest predicted probability as the output class, and if the prediction agrees with the actual class, the prediction is correct. Use precision (the ratio between the number of correct predictions and the total number of predictions) to evaluate the performance of the model.

3.2.2 Implementation of softmax regression from scratch

3.2.3 Simple implementation of softmax regression

3.3 Image Classification Dataset

  MNIST ( LeCun et al., 1998 ) is one of the widely used data sets in image classification, but it is too simple as a benchmark data set. Fashion-MNIST ( Xiao et al., 2017 ) data set is similar but more complex. It is a clothing Classification dataset, which consists of 10 categories of images, each category consists of 6000 images in the training set and 1000 images in the test set.

""" 导入所需库 """
%matplotlib inline
import torch
import torchvision
from torch.utils import data
from torchvision import transforms
from d2l import torch as d2l

d2l.use_svg_display()

""" 读取数据集 """
# 通过ToTensor实例将图像数据从PIL类型变换成32位浮点数格式,
# 并除以255使得所有像素的数值均在0~1之间
trans = transforms.ToTensor()
mnist_train = torchvision.datasets.FashionMNIST(
    root="../data", train=True, transform=trans, download=True)
mnist_test = torchvision.datasets.FashionMNIST(
    root="../data", train=False, transform=trans, download=True)

len(mnist_train), len(mnist_test) # (60000, 10000)
mnist_train[0][0].shape # torch.Size([1, 28, 28])

""" 在数字标签索引及其文本名称之间转换 """
def get_fashion_mnist_labels(labels):  #@save
    """返回Fashion-MNIST数据集的文本标签"""
    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
    return [text_labels[int(i)] for i in labels]

""" 可视化样本图像 """
def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
    """绘制图像列表"""
    figsize = (num_cols * scale, num_rows * scale)
    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
    axes = axes.flatten()
    for i, (ax, img) in enumerate(zip(axes, imgs)):
        if torch.is_tensor(img):
            # 图片张量
            ax.imshow(img.numpy())
        else:
            # PIL图片
            ax.imshow(img)
        ax.axes.get_xaxis().set_visible(False)
        ax.axes.get_yaxis().set_visible(False)
        if titles:
            ax.set_title(titles[i])
    return axes

""" 展示图像 """
X, y = next(iter(data.DataLoader(mnist_train, batch_size=18)))
show_images(X.reshape(18, 28, 28), 2, 9, titles=get_fashion_mnist_labels(y));

In order to make it easier to read the training set and test set, Pytorch's built-in data iterator is generally used instead of creating it from scratch. The data iterator reads a small batch of data (batch_size) in each iteration, and can randomly shuffle the samples to ensure unbiasedness.

batch_size = 256
def get_dataloader_workers():  #@save
    """使用4个进程来读取数据"""
    return 4

train_iter = data.DataLoader(mnist_train, batch_size, shuffle=True,
                             num_workers=get_dataloader_workers())
""" 统计时间 """
timer = d2l.Timer()
for X, y in train_iter:
    continue
f'{
      
      timer.stop():.2f} sec'

Integration components

def load_data_fashion_mnist(batch_size, resize=None):  #@save
    """下载Fashion-MNIST数据集,然后将其加载到内存中"""
    trans = [transforms.ToTensor()]
    if resize:
        trans.insert(0, transforms.Resize(resize))
    trans = transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(
        root="../data", train=True, transform=trans, download=True)
    mnist_test = torchvision.datasets.FashionMNIST(
        root="../data", train=False, transform=trans, download=True)
    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
                            num_workers=get_dataloader_workers()),
            data.DataLoader(mnist_test, batch_size, shuffle=False,
                            num_workers=get_dataloader_workers()))

train_iter, test_iter = load_data_fashion_mnist(32, resize=64)
for X, y in train_iter:
    print(X.shape, X.dtype, y.shape, y.dtype)  # torch.Size([32, 1, 64, 64]) torch.float32 torch.Size([32]) torch.int64
    break

Guess you like

Origin blog.csdn.net/lj164567487/article/details/129177711