Hands-on learning and less over-fitting fitting depth learning

Heard people say, DL power lies in its ability to fit the curve as long as you can give, the equation can be expressed by a neural network. However, this neural network requires sufficient data for training, here leads to the concept of over-fitting and less fit. When the neural network is very large, but not much data, the neural network can remember the characteristics of each data, which can lead to over-fitting. On the contrary, when the neural network to fit smaller or ability is still weak, but a lot of data, underfitting problems arise.

Over-fitting, underfitting

  1. Training error generalization error and
    training error model means exhibit on the training data set error;
    test error is the desired model demonstrated in a test on any error data sample, and often approximated by the error on the test data set.
  2. Model selection
    in the training process, in order to get a better argument, we need a validation data set. This is usually obtained from the training data set into. The method of division usually cross-validation K off method. Generally do papers, K = 10. The final results are averaged.
  3. Over-fitting, underfitting
    from training error and generalization error performance point of view, the model can not be lower training error, we will the phenomenon known as underfitting (underfitting); training error is much smaller than the model it in error on the test data set, we call this phenomenon as over-fitting (overfitting). Both phenomena occur usually associated with the complexity of the model and the training data set size.

Polynomial function to fit the experimental

In the case of an n-order polynomial model (n more, the higher the complexity of the model), the relationship between a given set of training data, the model complexity and error as shown below:

Here Insert Picture Description

It is worth mentioning that, in general DL, the training data set and the model was never enough and strong enough, so there will be more cases of over-fitting.

%matplotlib inline
import torch
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)

#初始化模型参数
n_train, n_test, true_w, true_b = 100, 100, [1.2, -3.4, 5.6], 5
features = torch.randn((n_train + n_test, 1))
poly_features = torch.cat((features, torch.pow(features, 2), torch.pow(features, 3)), 1) 
labels = (true_w[0] * poly_features[:, 0] + true_w[1] * poly_features[:, 1]
          + true_w[2] * poly_features[:, 2] + true_b)
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

#定义模型
def semilogy(x_vals, y_vals, x_label, y_label, x2_vals=None, y2_vals=None,
             legend=None, figsize=(3.5, 2.5)):
    # d2l.set_figsize(figsize)
    d2l.plt.xlabel(x_label)
    d2l.plt.ylabel(y_label)
    d2l.plt.semilogy(x_vals, y_vals)
    if x2_vals and y2_vals:
        d2l.plt.semilogy(x2_vals, y2_vals, linestyle=':')
        d2l.plt.legend(legend)

num_epochs, loss = 100, torch.nn.MSELoss()

def fit_and_plot(train_features, test_features, train_labels, test_labels):
    # 初始化网络模型
    net = torch.nn.Linear(train_features.shape[-1], 1)
    # 通过Linear文档可知,pytorch已经将参数初始化了,所以我们这里就不手动初始化了
    
    # 设置批量大小
    batch_size = min(10, train_labels.shape[0])    
    dataset = torch.utils.data.TensorDataset(train_features, train_labels)      # 设置数据集
    train_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True) # 设置获取数据方式
    
    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)                      # 设置优化函数,使用的是随机梯度下降优化
    train_ls, test_ls = [], []
    for _ in range(num_epochs):
        for X, y in train_iter:                                                 # 取一个批量的数据
            l = loss(net(X), y.view(-1, 1))                                     # 输入到网络中计算输出,并和标签比较求得损失函数
            optimizer.zero_grad()                                               # 梯度清零,防止梯度累加干扰优化
            l.backward()                                                        # 求梯度
            optimizer.step()                                                    # 迭代优化函数,进行参数优化
        train_labels = train_labels.view(-1, 1)
        test_labels = test_labels.view(-1, 1)
        train_ls.append(loss(net(train_features), train_labels).item())         # 将训练损失保存到train_ls中
        test_ls.append(loss(net(test_features), test_labels).item())            # 将测试损失保存到test_ls中
    print('final epoch: train loss', train_ls[-1], 'test loss', test_ls[-1])    
    semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
             range(1, num_epochs + 1), test_ls, ['train', 'test'])
    print('weight:', net.weight.data,
          '\nbias:', net.bias.data)

#测试
fit_and_plot(poly_features[:n_train, :], poly_features[n_train:, :], labels[:n_train], labels[n_train:]) #正常
fit_and_plot(features[:n_train, :], features[n_train:, :], labels[:n_train], labels[n_train:]) #欠拟合
fit_and_plot(poly_features[0:2, :], poly_features[n_train:, :], labels[0:2], labels[n_train:]) #过拟合
  • Third-order polynomial fit (normal)
    Here Insert Picture Description

Although this model is a linear model, but the characteristic of the input sample is calculated polynomial, the parameters of the linear combination so that a polynomial model.

  • Linear fit (underfitting)
    Here Insert Picture Description
  • Training set is too small (over-fitting)
    Here Insert Picture Description

Preventing overfitting

  1. L2正则化(又叫权重衰减)
    加入L2正则化,能够防止个别参数极端大的情况,从而防止过拟合。在全局最小约束下,给损失函数加上个L2正则化项:
    ( w 1 , w 2 , b ) + l 2 n w 2 \ell\left(w_{1}, w_{2}, b\right)+\frac{\lambda}{2 n}|w|^{2}
optimizer_w = torch.optim.SGD(params=[net.weight], lr=lr, weight_decay=wd) # 对权重参数衰减
  1. dropout(又叫丢弃法)
    丢弃法通过一定概率把某些单元的灭活(即对应值置0),来避免训练过程中对某些神经元的过分依赖。下面公式证明,丢弃法不改变输入期望值。
    h i = X i 1 p h i E ( h i ) = E ( ξ i ) 1 p h i = h i h_{i}^{\prime}=\frac{\xi_{i}}{1-p} h_{i} \\ E\left(h_{i}^{\prime}\right)=\frac{E\left(\xi_{i}\right)}{1-p} h_{i}=h_{i}
def dropout(X, drop_prob):
    X = X.float()
    assert 0 <= drop_prob <= 1
    keep_prob = 1 - drop_prob
    # 这种情况下把全部元素都丢弃
    if keep_prob == 0:
        return torch.zeros_like(X)
    mask = (torch.rand(X.shape) < keep_prob).float()
    print(mask)
    
    return mask * X / keep_prob
# 使用说明
def net(X, is_training=True):
    X = X.view(-1, num_inputs)
    H1 = (torch.matmul(X, W1) + b1).relu()
    if is_training:  # 只在训练模型时使用丢弃法
        H1 = dropout(H1, drop_prob1)  # 在第一层全连接后添加丢弃层
    H2 = (torch.matmul(H1, W2) + b2).relu()
    if is_training:
        H2 = dropout(H2, drop_prob2)  # 在第二层全连接后添加丢弃层
    return torch.matmul(H2, W3) + b3

#pytorch实现
nn.Dropout(drop_prob1)

有些话说

一些问题:

  1. 如何看出是过拟合?防止过拟合的方法有哪些?
  2. L2正则化和dropout防止过拟合的原理各自是什么?如何使用pytorch实现它们?
Published 52 original articles · won praise 69 · views 90000 +

Guess you like

Origin blog.csdn.net/gongsai20141004277/article/details/104359464