[Hands-on Deep Learning] Li Mu - Multilayer Perceptron

Implementation of multi-layer perceptron from scratch

import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

# 初始化模型参数
num_inputs, num_outputs, num_hiddens = 784, 10, 256
W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens, requires_grad=True) * 0.01)
b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True))
W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs, requires_grad=True) * 0.01)
b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True))
params = [W1, b1, W2, b2]


def relu(X):
    a = torch.zeros_like(X)
    return torch.max(X, a)


def net(X):
    X = X.reshape((-1, num_inputs))
    H = relu(X @ W1 + b1)  # @代表矩阵乘法的简写
    return H @ W2 + b2


loss = nn.CrossEntropyLoss(reduction='none')

num_epochs, lr = 10, 0.1
updater = torch.optim.SGD(params, lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
plt.show()
d2l.predict_ch3(net,test_iter)
plt.show()

Simple implementation of multi-layer perceptron

import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(nn.Flatten(), nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))


def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)


net.apply(init_weights)


batch_size, lr, num_epochs = 256, 0.01, 10
loss = nn.CrossEntropyLoss(reduction='none')


train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

optimer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter,test_iter, loss, num_epochs, optimer)


plt.show()

weight decay

Simple implementation

net = nn.Sequential(nn.Linear(num_inputs, 1))
for param in net.parameters():
	param.data.normal_()
optimer = torch.optim.SGD([ {"params":net[0].weight,'weight_decay':wd},
{"params":net[0].bias}],lr=lr)

Setting "weight_decay" to wd above is to set it to use weight decay.

Dropout

A good model needs to be robust to disturbances in the input data, that is, it must not be affected by noise. So if you use data with noise to learn, if you can prevent it from learning the part of the noise, then it is equivalent to regularization. Therefore, the dropout method is to add noise between layers.

So starting from the definition direction, it isadding noise without bias, that is, to the original input $\pmb{x}$ Additional sound acquisition $\pmb{x}^{\prime}$ , we hope that its mean value remains unchanged, that is:
$E[\pmb{x}^{\ prime}]=\pmb{x}$
Then the specific method of discarding method isperform the following perturbation on each element: a>
$x^{\prime}_i=\begin{cases} 0\quad with~probablity~p\\ \frac{x_i}{1-p} \quad otherwise \end{cases}$
Then this can ensure that the expectation remains unchanged:
$E[x^{\ prime}_i]=p\times 0 + (1-p)\times \frac{x_i}{1-p}=x_i$
Then this drop probability is the hyperparameter that controls the complexity of the model

Specifically,usually the dropout method is applied to the output of the hidden layer of the multi-layer perceptron, that is:

Insert image description here

This is used during the training process, it will affect the update of model parameters, and the dorpout operation will not be performed during testing, so as to ensure deterministic output . Experimentally, it can achieve similar effects to regularization.

Then the output of Dropout on the hidden layer will not update the weights of the neurons that are set to 0 this time. Then it can be considered that each Dropout selects a part from all the hidden layer neurons. Make an update.

The specific implementation can directly call the nn.Dropout() layer.

numerical stability

When calculating gradients:

Insert image description here

Because the derivation of vectors is a matrix, so many matrix operations may encounter gradient explosion or gradient disappearance problems.

Assuming that most of the gradients in the matrix are numbers slightly larger than 1, then after so many gradient calculations, the gradient may be too large and explode; then if the gradient is slightly less than 1, it will be close to 0 after so many iterations. .

Insert image description here

Then gradient explosion will bring the following problems:

The value exceeds the range that the numeric type can represent
More sensitive to learning rate
- When the learning rate is relatively large, multiplied by the larger gradient, the degree of update will be relatively large, making it difficult to stabilize.
- When the learning rate is too small, it may cause the normal weights other than those with gradient explosion to be unable to be updated normally.

For gradient disappearance, for example, the sigmoid function is used:

Insert image description here

Such a small gradient may cause the problem of gradient disappearance after multiple superpositions. Its main problems are:

It also exceeds the representation range, directly causing most of the gradient values to be 0 and unable to be updated.
Training cannot be updated normally because the gradient value is 0
This is especially serious for the bottom layer, because the gradient is calculated by backpropagation from the output layer. As you get to the bottom layer, the more layers are superimposed, the more likely the gradient will disappear, which means that only the top layer can be trained and updated normally.

Then the primary goal of how to make training more stable is tomake the gradient value within a reasonable range. For example, in some algorithms they Convert gradient multiplication into addition, or normalize, clip, etc. the gradient. But there is another important method that is to properly initialize the weights and select a suitable activation function.

Specifically, the conclusion is that when initializing the weight, let the weight start from a mean of 0 and a variance of $\gamma_t=\frac{2}{n_{t-1}+n_t}$ Sampled from . Among them $n_{t-1}, n_{t}$ represents the number of neurons in the two layers connected by this weight. Thereforeit is necessary to choose the variance of the distribution to which the weights obey based on the shape of the layer.

After derivation of the activation function, it can be considered that the two activation functions tanh(x) and ReLU(x) can have better characteristics, while sigmoid(x) needs to be adjusted to $4\times sigmoid(x)-2$ can achieve the same effect as the first two.

Environmental and distribution shifts

1. Types of distribution shifts

There are mainly the following types of offsets:

Covariate shift: refers tothe distribution of data $p (x)$ changed, for example, the training data set distribution used during training $p_1(x)$ and the test set distribution used during testing $p_{2} (x)$ is different, then it is difficult to make the model perform on the test data set good. However, there is another framework design for this change:Although the distribution of the input may change over time, the label function (i.e. conditional distribution $P(y\mid x)$ ) will not change. For example, during training, we use real cats and dogs to let the machine learn to classify, but during testing, we use cartoon cats and dogs. This means that the training and testing data sets are different, but their labels The function is the same and correctly labels cats and dogs.
Label shift: refers to the opposite problem to covariate shift, because it is assumed here thatlabel edge probability $P (y)$ can be changed, but the category Conditional distribution $P(x\mid y)$ remains constant across domains. An example here is to predict the patient's disease. The symptom is x, and the disease is label y. Then the relative prevalence of the disease, or the ratio between various diseases, may change (i.e. $P (y)$ ) may change, For the symptoms corresponding to a specific disease ( $P(x\mid y))$ Not meeting.
Concept shift: refers to the change in the definition of the label. For example, our definition of beauty may change over time, so the concept of the label "beauty" will also change.

2. Distribution offset correction

First we need to understand what empirical risk and actual risk are: during training we usually minimize the loss function (without considering the regularization term), that is:
$\min_{f} \frac{1}{Num}\sum_{i=1}^{Num} loss(f(x_i),y_i)$
The loss of this term on the training data set is called experience risk. Thenempirical risk is to approximate the real risk, which is the loss under the real distribution of data. However, in practice we cannot obtain the distribution of real data. Therefore, it is generally believed that minimizing the empirical risk can be approximated by minimizing the real risk.

Covariate shift correction

For the existing data set (x,y), we need to evaluate $P(y\mid x)$ , but this is the usual number $x_i$ is derived from some source distribution $q (x)$ (can be considered as training distribution of the data set), rather than derived from the target distribution $p (x)$ (can be considered true The distribution of the data, or considered the distribution of the test data). However, there is an assumption of covariate shift that is $p(y\mid x)=q(y\mid x)$ 。因此：
$\iint loss(f(x),y)p(x)dxdy~=~ \iint loss(f(x),y)q(y\mid x)q(x)\frac{p(x)}{q(x)}dxdy$
So currently we need to calculate The ratio between the data from the target distribution and the source distribution is used to reweight the weight of each sample , that is:
$\beta_i=\frac{p(x_i)}{q(x_i)}$
Then substitute this weight into each data sample, and you can use weighted empirical risk minimization to train the model:
$\min_f \frac{1}{Num}\sum_{i=1}^{Num}\beta_i loss(f(x_i),y)$
The problem with this connection is 估计 $\beta$ . The specific method is:sample is drawn from two distributions to estimate the distribution. That is, for the target distribution $p (x)$ We can access Test data set to obtain; and for the source distribution $q (x)$ directly passes the training data Set acquisition. Here we need to consider whetheraccessing the test data set will lead to data leakage. In fact, it will not, because we only accessed the features $\sim p(x)$ , and did not access it Tag y. Under this method, there is a very efficient way to calculate $\beta$ : 对数几率回归.

We assume that samples of the same data are drawn from two distributions. The sample data label for p is z=1, and the sample data label for q is z=-1. Therefore the probability of this mixed data set is:
$P(z=1\mid x)=\frac{p(x)}{p(x)+q(x)}\\ \frac{ P(z=1\mid x)}{P(z=-1\mid x)}=\frac{p(x)}{q(x)}$
Therefore if we use the logarithmic odds regression method, that is $1\mid x)=\frac{1}{1+exp(-h(x))}$ (h is a parameterized function, set), then there is:
$\beta_i = \frac{P(z=1\mid x_i)}{P(z=-1 \mid x_i)}=exp(h(x_i))$
Therefore, as long as the training gets $h (x)$ Available now.

But the above algorithm relies on an important assumption:It is required that each data sample in the target distribution (test set distribution) has a non-zero probability of appearing during training< a i=2>, otherwise it will appear $p(x_i)>0,q(x_i)=0$ circumstances.

Label offset correction

Similarly, it is assumed here that the distribution of labels changes with time $q(y)\neq p(y)$ , but the category conditional distribution Remain unchanged $q(x\mid y)=p(x\mid y)$ 。那么：
$\iint loss(f(x),y)p(x\mid y)p(y)dxdy=\iint loss(f(x),y)q(x\mid y)q(y)\frac{p(y)}{q(y)}dxdy$
The importance of this Similar ratio:
$\beta_i=\frac{p(y_i)}{q(y_i)}$
Because, in order to take into account the distribution of target labels, we first use an off-the-shelf classifier with fairly good performance (usually trained based on training data), and use the validation set to calculate the confusion matrix. Then the confusion matrix is $k\times k$ (k is the number of classification categories). The value of each cell $c_{ij}$ is the proportion of samples in the validation set where the true label is j and the model predicts i. .

But now we cannot calculate the confusion matrix on the target data because we do not know the true distribution. Then what we can do is to average the predictions of the existing models during testing to get the average model output $\mu (\hat{ y})\in R^k$ , where the i-th element is the total prediction score of our model predicting the i-th category in the test set.

So specifically, if our classifier is quite accurate from the beginning, and the target data only contains categories we have seen before (the training set and the test set have the same categories), then if the label If the offset assumption holds, the label distribution of the test set can be estimated through a simple linear system:
$Cp(y)=\mu(\ hat{y})$
Solve C, infinite:
$=C^{-1}\mu(\hat{y})$

Concept shift correction

This is difficult to correct with any exact method. However, such changes are usually rare or very slow. What we can generally do is to adapt to changes in the network during training and use new data to update the network.

Practical kaggle competition: predicting house prices

import numpy as np
import pandas as pd
import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l
import hashlib
import os
import tarfile
import zipfile
import requests

DATA_HUB = dict()
DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'


def download(name, cache_dir=os.path.join("dataset", "data_kaggle")):  # @save
    assert name in DATA_HUB, f"{
      
      name} 不存在于 {
      
      DATA_HUB}"
    url, shal_hash = DATA_HUB[name]
    os.makedirs(cache_dir, exist_ok=True)  # 按照第一个参数创建目录，第二参数代表如果目录已存在就不发出异常
    fname = os.path.join(cache_dir, url.split('/')[-1])
    if os.path.exists(fname):  # 如果已存在这个数据集
        shal = hashlib.sha1()
        with open(fname, 'rb') as f:
            while True:
                data = f.read(1048576)  # 这里进行数据集的读取，一次能够读取的最大行数为1048576
                if not data:  # 如果读取到某一次不成功
                    break
                shal.update(data)
        if shal.hexdigest() == shal_hash:
            return fname  # 命中缓存
    print(f'正在从{
      
      url}下载{
      
      fname}...')
    r = requests.get(url, stream=True, verify=True)
    # 向链接发送请求，第二个参数是不立即下载，当数据迭代器访问的时候再去下载那部分，不然全部载入会爆内存，第三个参数为不验证证书
    with open(fname, 'wb') as f:
        f.write(r.content)
    return fname


# 下载并解压一个zip或tar文件
def download_extract(name, folder=None):  # @save
    fname = download(name)
    base_dir = os.path.dirname(fname)  # 获取文件的路径，fname是一个相对路径，那么就返回从当前文件到目标文件的路径
    data_dir, ext = os.path.splitext(fname)  # 将这个路径最后的文件名分割，返回路径+文件名，和一个文件的扩展名
    if ext == '.zip':  # 如果为zip文件
        fp = zipfile.ZipFile(fname, 'r')
    elif ext in ('.tar', '.gz'):
        fp = tarfile.open(fname, 'r')
    else:
        assert False, "只有zip/tar文件才可以被解压缩"
    fp.extractall(base_dir)  # 解压压缩包内的所有文件到base_dir
    return os.path.join(base_dir, folder) if folder else data_dir


def download_all():  # @save
    for name in DATA_HUB:
        download(name)


# 下载并缓存房屋数据集
DATA_HUB['kaggle_house_train'] = (  # @save
    DATA_URL + 'kaggle_house_pred_train.csv',
    '585e9cc93e70b39160e7921475f9bcd7d31219ce'
)

DATA_HUB['kaggle_house_test'] = (  # @save
    DATA_URL + 'kaggle_house_pred_test.csv',
    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90'
)

train_data = pd.read_csv(download('kaggle_house_train'))
test_data = pd.read_csv(download('kaggle_house_test'))

# print(train_data.shape)
# print(test_data.shape)
# print(train_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])

# 将序号列去掉,训练数据也不包含最后一列的价格列，然后将训练数据集和测试数据集纵向连接在一起
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

# 将数值型的数据统一减去均值和方差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index  # 在panda中object类型代表字符串
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std())  # 应用匿名函数
)
# 在标准化数据后，所有均值消失，因此我们可以设置缺失值为0
all_features[numeric_features] = all_features[numeric_features].fillna(0)

# 对离散值进行处理
all_features = pd.get_dummies(all_features, dummy_na=True)  # 第二个参数代表是否对nan类型进行编码

# print(all_features.shape)

n_train = train_data.shape[0]  # 训练数据集的个数
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)  # 取出训练数据
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)  # 取出测试数据
train_labels = torch.tensor(train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32)  # 取出训练数据的价格列

loss = nn.MSELoss()
in_features = train_features.shape[1]  # 特征的个数


# 网络架构
def get_net():
    net = nn.Sequential(nn.Linear(in_features, 1))
    return net


# 取对数约束输出的数量级
def log_rmes(net, features, labels):
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    # 第一个为要约束的参数，第二个为最小值，第三个为最大值，小于最小值就为1
    rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels)))
    return rmse.item()


# 训练的函数
def train(net, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate,
          weight_decay, batch_size):
    train_ls, test_ls = [], []
    train_iter = d2l.load_array((train_features, train_labels), batch_size)  # 获取数据迭代器
    optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate, weight_decay=weight_decay)
    # 这是另外一个优化器，它对lr的数值不太敏感，第三个参数代表是否使用正则化
    for epoch in range(num_epochs):
        for X, y in train_iter:
            optimizer.zero_grad()  # 梯度先清零
            l = loss(net(X), y)  # 计算损失
            l.backward()  # 反向传播计算梯度
            optimizer.step()  # 更新参数
        train_ls.append(log_rmes(net, train_features, train_labels))
        if test_labels is not None:
            test_ls.append(log_rmes(net, test_features, test_labels))
    return train_ls, test_ls


# K折交叉验证
def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)  # 创建一个切片对象
        X_part, y_part = X[idx, :], y[idx]  # 将切片对象应用于索引
        if j == i:  # 取出第i份作为验证集
            X_valid, y_valid = X_part, y_part
        elif X_train is None:  # 如果当前训练集没有数据就初始化
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat([X_train, X_part], 0)  # 如果是训练集那么就进行合并
            y_train = torch.cat([y_train, y_part], 0)
    return X_train, y_train, X_valid, y_valid


# k次的k折交叉验证
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate, weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        if i == 0:
            d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls], xlabel="epoch",
                     ylabel='ylabel', xlim=[1, num_epochs], legend=["train", 'valid'], yscale='log')
        print(f"折{
      
      i + 1},训练log rmse{
      
      float(train_ls[-1]):f},"
              f"验证log rmse{
      
      float(valid_ls[-1]):f}")
    return train_l_sum / k, valid_l_sum / k


k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l,valid_l = k_fold(k,train_features, train_labels, num_epochs, lr, weight_decay, batch_size)
print(f"{
      
      k}折验证：平均训练log rmse:{
      
      float(train_l):f}",
      f"平均验证log rmse:{
      
      float(valid_l):f}")
plt.show()

The following are the results of my own debugging:

def get_net():
    net = nn.Sequential(nn.Linear(in_features, 256),
                        nn.ReLU(),
                        nn.Linear(256,1))
    return net
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
5折验证：平均训练log rmse:0.045112 平均验证log rmse:0.157140

I always felt that it was not good to go directly from 256 to 1, so I adjusted the structure of the model:

def get_net():
    net = nn.Sequential(nn.Linear(in_features, 128),
                        nn.ReLU(),
                        nn.Linear(128,1))
    return net
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 0.03, 1, 64
5折验证：平均训练log rmse:0.109637 平均验证log rmse:0.136201

More complex models always feel like there is no way to reduce the error.