Do a fun thing and try to learn to write a deep learning framework by yourself (1) - deep back propagation

foreword

I have always wanted to implement a deep learning framework, which is a very interesting thing for me, and of course it is a challenging thing. Since it is far from enough to store knowledge about this aspect, I have spent most of my time collecting data recently. In the process of learning, I found that there are not many materials on the Internet, so I want to take some notes while realizing it, and record the whole process in the form of text and video.

I watched George hotz's video recently. Most of the code below is a reproduction of his live coding. It took nearly 4 hours of video. I researched it for nearly a week. Of course, I didn't spend all my time on this video. It is estimated that at least 1 :5 Well that's 3 hours of video, and it took me 15 hours to figure it out. Although it is indeed a head-scratching thing for a follower to go to coding, he did learn a lot in the process.

This article involves a lot of content, and each knowledge point has a certain depth, so it will be updated and supplemented later.

The basic idea

Taking Pytorch as a teacher is to use Pytorch as a small example first, make a sample, and then imitate a network based on numpy. In the process, you may know a lot of underlying knowledge and implementation methods.

import dependencies

This framework still wants to imitate the pytorch API, so the Pytorch framework is introduced,

%pylab inline
import numpy as np
from tqdm import trange
np.set_printoptions(suppress=True)
import torch
import torch.nn as nn
import torch.nn.functional as F
# torch.set_printoptions()
# np.set_printoptions(suppress=True)

The dataset used here is the handwritten dataset MNIST. If you have learned or tried deep learning, you may be familiar with this database. This is a dataset similar to that of deep learning hello network. I won't go into details about this data set. There is a lot of information about him on the Internet, and there is a lot of information on the Internet.

def fetch(url):
  import requests, gzip, os, hashlib, numpy
  fp = os.path.join("D:\\workspaces\\aNet\\tmp", hashlib.md5(url.encode('utf-8')).hexdigest())
  if os.path.isfile(fp):
    with open(fp, "rb") as f:
      dat = f.read()
  else:
    with open(fp, "wb") as f:
      dat = requests.get(url).content
      f.write(dat)
  return numpy.frombuffer(gzip.decompress(dat), dtype=np.uint8).copy()
X_train = fetch("http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")[0x10:].reshape((-1, 28, 28))
Y_train = fetch("http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")[8:]
X_test = fetch("http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")[0x10:].reshape((-1, 28, 28))
Y_test = fetch("http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")[8:]

Networking with Pytorch

This network is an example. This is a two-layer fully connected network. The first layer compresses the input 784-dimensional vector to 128-dimensional, and then enters the activation function to perform a nonlinear transformation. The activation function here selects the activation function ReLU. In the second layer, the 128-dimension is compressed again to the classification quantity dimension, which is 10-dimension. This is the feature extraction stage of the deep learning neural network. The next step is the prediction logic. Perform softmax on the output dimension to get the probability of which category the sample belongs to. The corresponding category with the largest probability is the recognition result given by the model.

torch.set_printoptions(sci_mode=False)
class ANet(torch.nn.Module):
    def __init__(self):
        super(ANet,self).__init__()
        #(m,784) -> (m,128)
        self.l1 = nn.Linear(784,128,bias=False)
        self.l2 = nn.Linear(128,10,bias=False)
        self.sm = nn.LogSoftmax(dim=1)

    def forward(self,x):
        x = F.relu(self.l1(x))
        x = self.l2(x)
        x = self.sm(x)
        return x

The above network results are relatively simple

LogSoftmax 和 Softmax

大家可能对于 Softmax 这个函数很熟悉,而对 LogSoftmax 可能会略显陌生,对于为什么这里要使用 LogSoftmax 来取代 Softmax 会产生好奇。

Softmax 是将输入由实数组成向量转换为一个概率分布,也就是转换后向量每一个分量取值范围在 0 到 1 之间,并且满足所有分量求和为 1。

Softmax 这个激活函数从名称上来看 soft max 并不是赢家通吃,也就是向量只有一个维度为 1,而其他维度均为 0 的形式,而是每一个维度都有一定概率。

σ ( z i ) = e z i j e j \sigma(z_i) = \frac{e^{z_i}}{\sum_j e^{j}}

从公式上来看,对每一个维度 z i z_i 进行指数函数,然后除以每一个元素进行指数函数后求和作为归一化。在深度学习中, softmax 通常用作激活函数,对一个神经元通常对输入进行加权再加上一个偏置,也就是对输入进行了线性变换后,再对其进行非线性变换。不过因为进行指数运算,所以指数运算后会得的一个很大数

e 19 = 178482300.96318725 e 20 = 485165195.4097903 e^{19} = 178482300.96318725\\ e^{20} = 485165195.4097903

指数运算结果可能是一个很大的数据,可能会超出计算机能够处理范围,所以输出的结果可能会是 nan。还有就是在公式 1 中,由于除以很大大数,所以在数值上可能不稳定。这也是为什么使用 logsoftmax 来取代 Softmax 的主要原因。

log ( e z i j e z j ) z i log j e z j \log \left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)\\ z_i - \log\sum_j e^{z_j}

使用对数概率而不是概率,对数概率只是一个概率的对数。使用对数概率意味着在对数尺度上表示概率,而不是在标准的

单位间隔,对于独立事件的概率相乘,对于概率乘法可能会带来一个很小数,对数是可以将乘法转换为加法,这样就可以将独立事件的对数概率相加。

input = torch.randn(2,3)
input
tensor([[-2.4280, 0.6736, -0.3681], [ 0.7437, 0.6434, -0.6621]])
softmax_fn = nn.Softmax(dim=1)
output = softmax_fn(input)
output
tensor([[0.0322, 0.7154, 0.2524], [0.4652, 0.4208, 0.1140]])
logsoftmax_fn = nn.LogSoftmax(dim=1)
output = logsoftmax_fn(input)
output
tensor([[-3.4365, -0.3349, -1.3766], [-0.7654, -0.8656, -2.1712]])

开始训练

  • 训练过程中优化器采用 SGD
  • batch size 128
model = ANet()
epochs = 1000
batch_size = 128
# 定义损失函数,损失函数使用交叉熵损失函数

loss_fn = nn.NLLLoss(reduction='none')
# 定义优化器
optim = torch.optim.SGD(model.parameters(),lr=0.001,momentum=0)

losses,accs = [],[]

for i in (t:=trange(epochs)):
    #对数据集中每次随机抽取批量数据用于训练
    samp = np.random.randint(0,X_train.shape[0],size=(batch_size))
    X = torch.tensor(X_train[samp].reshape((-1,28*28))).float()
    Y = torch.tensor(Y_train[samp]).long()
    # 将梯度初始化
    model.zero_grad()
    
    # 模型输出
    out = model(X)
    #计算准确度
    pred = torch.argmax(out,dim=1)
    acc = (pred == Y).float().mean()
    
    #计算损失值
    loss = loss_fn(out,Y)
    loss = loss.mean()
    
    # 计算梯度
    loss.backward()
    # 更新梯度
    optim.step()
    # 
    loss, acc = loss.item(),acc.item()
    losses.append(loss)
    accs.append(acc)
    t.set_description(f"loss:{loss:0.2f}, acc: {acc:0.2f}")
# figsize(6,6)
plt.ylim(-0.1,1.1)
plot(losses)
plot(accs)
    loss:0.27, acc: 0.97: 100%|███████████████████████████████████████████████████████| 1000/1000 [00:04<00:00, 237.69it/s]

output_3_2.png

step_by_step.jpeg

从训练过程中,效果还是比较不错了,看 loss 也是逐渐收敛,同时准确度不断攀升。

评估

Y_test_preds = torch.argmax(model(torch.tensor(X_test.reshape(-1,28*28)).float()),dim=1).numpy()
(Y_test == Y_test_preds).mean()
0.9313
l1 = np.zeros((784,128),dtype=np.float32)
l2 = np.zeros((128,10),dtype=np.float32)
l1[:] = model.l1.weight.detach().numpy().transpose()
l2[:] = model.l2.weight.detach().numpy().transpose()

这里将 pytorch 通过训练好模型的参数作为网络初始值,然后用 numpy 实现一个前向传播 forward。然后用测试数据集对模型进行评估。

def forward(x):
    x = x.dot(l1)
    x = np.maximum(x,0)
    x = x.dot(l2)
    return x
Y_test_preds_out = forward(X_test.reshape((-1,28*28)))
Y_test_preds = np.argmax(Y_test_preds_out,axis=1)
(Y_test == Y_test_preds).mean()
0.9313
figsize(6,6)
imshow(X_test[1])
output_7_1.png
samp= list(range(32))
model.zero_grad()
out = model(torch.tensor(X_test[samp].reshape((-1,28*28))).float())
out.retain_grad()
loss = loss_fn(out,torch.tensor(Y_test[samp]).long()).mean()
loss = loss.mean()
loss.retain_grad()
loss.backward()
figsize(16,16)
imshow(model.l1.weight.grad)
figure()
imshow(model.l2.weight.grad)
loss.grad,out.grad

compare.png

这里要解释的东西还是蛮多的,训练好模型观察一下权重的梯度,需要调用张量的梯度,对于非叶子节点的中间结点,默认情况下,对于非叶子结点在计算完梯度后,会释放内存。如果要保留其 grad 属性,也就是想要观察中间过程张量的梯度,就需要调用 retain_grad() 。将 torch 的 l1 和 l2 权重梯度显示出来,接下来就是以 Pytorch 的权重 l1 和 l2 为例,尝试 numpy 来实现求解各个阶段的梯度。


    (tensor(1.),
     tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.0312,
               0.0000,  0.0000],
             [ 0.0000,  0.0000, -0.0312,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
               0.0000,  0.0000],
             [ 0.0000, -0.0312,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
               0.0000,  0.0000],
             [-0.0312,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
               0.0000,  0.0000],
               ...
             ]))

l1 梯度图

output_8_1.png

l2 梯度图

output_8_2.png

这里需要暂时停下来分析一下,对于x_l2 = x_relu.dot(l2) 就是输出层的输出梯度 gin = torch.tensor(x_l2,requires_grad=True)

x l 2 = W l 2 T x r e l u x_l2 = W_{l2}^Tx_{relu}
L W l 2 = L x l 2 x l 2 w l 2 \frac{\partial L}{\partial W_{l2}} = \frac{\partial L}{\partial x_{l2}}\frac{\partial x_{l2}}{\partial w_{l2}}
# 前向传播
x = X_test[1:2].reshape((-1,28*28))
x_l1 = x.dot(l1)
x_relu = np.maximum(x_l1,0)
#W = x_relu(1,128) l2(128,10)
x_l2 = x_relu.dot(l2)
# 查看各个层输出张量的形状
# print(x_l1.shape,x_relu.shape,x_l2.shape)
(1, 128) (1, 128) (1, 10)
x l 1 = W l 1 T ( x ) x r e l u = r e l u ( x l 1 ) x l 2 = W l 2 T ( x r e l u ) x_{l1} = W_{l1}^T(x)\\ x_{relu} = relu(x_{l1})\\ x_{l2} = W_{l2}^T(x_{relu})
  • 先计算 d_l2 也就是 L / W l 2 = ( L / x l 2 ) ( x l 2 / W l 2 ) \partial L/\partial W_{l2} = (\partial L/\partial x_{l2})(\partial x_{l2}/\partial W_{l2})
  • 接下来计算 dx_relu 也就是 L / W l 2 = ( L / x l 2 ) ( x l 2 / x r e l u ) \partial L/\partial W_{l2} = (\partial L/\partial x_{l2})(\partial x_{l2}/\partial x_{relu})
  • dx_l1 = (x_relu > 0).astype(np.float32)* dx_relu 这里根据链式反正 (x_relu > 0).astype(np.float32)
  • 最后
# out = torch.tensor(out)
# gin = torch.tensor(x_l2,requires_grad=True)
# gout = torch.nn.functional.log_softmax(gin,dim=1)
# gout.retain_grad()
# loss = (-out*gout).mean()
# loss.backward()
# dx_sm = gin.grad.numpy()
x_l2.max(axis=1).reshape((-1,1)) + np.log(np.exp(x_l2 - x_l2.max(axis=1).reshape((-1,1))).sum(axis=1))
        array([[14.839776, 14.827087],
               [24.500969, 24.488281]], dtype=float32)
# 现在做的调整 logSumExp 中,由于指数预算带来内存溢出问题
def logsumexp(x):
    c = x.max(axis=1)
    return c + np.log(np.exp(x - c.reshape((-1,1))).sum(axis=1)) 
    

关于 logsumexp 方法的实现,关于形状我们这里来分析一下,输入 x 是 (batch_size,10) 的张量,然后在 axis=1 去最大值,也就是找到每个样本中最大值,c=(batch_size) c.reshape((-1,1))

NLLLoss 损失函数

x_loss = (-out * x_lsm) 这里 x_lsm

l ( x , y ) = L ( l 1 , , l N ) T l n = w y n x n , y n \cal{l}(x,y) = L(l_1,\cdots,l_N)^T \\ l_n = -w_{y_n}x_{n,y_n}

loss = (-out*gout).mean()

L o g S o f t m a x ( x i ) = log ( exp ( x i ) j ( x j ) ) LogSoftmax(x_i) = \log \left( \frac{\exp(x_i)}{\sum_j(x_j)} \right)
def forward_backward(x,y):
    
    out = np.zeros((len(y),10),np.float32)
    out[range(out.shape[0]),y]= 1

    x_l1 = x.dot(l1)
    x_relu = np.maximum(x_l1,0)
    x_l2 = x_relu.dot(l2)
    x_lsm = x_l2 - logsumexp(x_l2).reshape(-1,1)
    x_loss = (-out * x_lsm).mean(axis=1)
    # 这里难点就是 LogSoftmax 的梯度计算
    d_out = -out/len(y)

    dx_lsm = d_out - np.exp(x_lsm)*d_out.sum(axis=1).reshape(-1,1)
    d_l2 = x_relu.T.dot(dx_lsm)
    dx_relu = dx_lsm.dot(l2.T)

    dx_l1 = (x_relu > 0).astype(np.float32)* dx_relu
    d_l1 = x.T.dot(dx_l1)
    return x_loss, x_l2,d_l1,d_l2
samp = [0,1,2,3]
x_loss, x_l2,d_l1,d_l2 = forward_backward(X_test[samp].reshape((-1,28*28)),Y_test[samp])

imshow(d_l1.T)
figure()
imshow(d_l2.T)

实现 NLLLoss

这部分内容可以参见 pytorch 官方文档。

x_loss = (-out * x_lsm).mean(axis=1)
L ( y ^ , y ) = y x y n L(\hat{y},y) = - y x_{y_n}

L / S i \partial L /\partial S_i

这里是难点,理解这部分内容其他内容可以在网上搜索资料即可,想要理解透彻理解这部分还需对 jacobian 矩阵一定了解,也就是矩阵求导。

y i = f ( x i ) y i = log ( exp ( x i ) j ( x j ) ) y i = x i log ( j ( x j ) ) y_i = f(x_i)\\ y_i = \log \left( \frac{\exp(x_i)}{\sum_j(x_j)} \right)\\ y_i = x_i - \log ( \sum_j(x_j) )

输入为 x ,y 输出也就是预测值,x 和 y 都是向量,这里 f 表示 logsoftmax 函数

jacobian 矩阵

y i x i = 1 exp ( x i ) / j ( exp ( x j ) ) \frac{\partial y_i}{\partial x_i} = 1 - \exp(x_i) / \sum_j(\exp(x_j))

i 和 k 不相同的情况

y i x k = exp ( x k ) / j ( exp ( x j ) ) \frac{\partial y_i}{\partial x_k} = - \exp(x_k) / \sum_j(\exp(x_j))

下面矩阵用 JF 表示

[ 1 E ( x 1 ) E ( x 2 ) E ( x 3 ) E ( x 1 ) 1 E ( x 2 ) E ( x 3 ) E ( x 1 ) E ( x 2 ) 1 E ( x 3 ) ] \begin{bmatrix} 1-E(x_1) & -E(x_2) & -E(x_3) & \cdots \\ -E(x_1) & 1-E(x_2) & -E(x_3) & \cdots\\ -E(x_1) & -E(x_2) & 1-E(x_3) & \cdots\\ \cdots \end{bmatrix}

其中 E ( x i ) = exp ( x i ) / j ( e x p ( x j ) ) E(x_i) = \exp(x_i) / \sum_j(exp(x_j))

根据链式法则

L x i = j ( L y j y j x i ) \frac{\partial L}{\partial x_i} = \sum_j( \frac{ \partial L}{\partial y_j} \frac{\partial y_j}{\partial x_i} )

等价于

L x = J f T L y \frac{\partial L}{\partial x} = Jf^T \frac{\partial L}{\partial y}
j ( exp ( y j ) ) = j ( exp ( log ( exp ( x j ) / k ( e x p ( x k ) ) ) = j ( exp ( x j ) / k ( e x p ( x k ) ) = 1 \sum_j(\exp(y_j)) = \sum_j(\exp( \log(\exp(x_j) / \sum_k(exp(x_k) ))\\ = \sum_j( \exp(x_j) / \sum_k(exp(x_k) )\\ = 1
L x i = L y i exp ( y i ) j ( L y j ) \frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial y_i} - \exp(y_i) \sum_j( \frac{\partial L}{\partial y_j} )
dx_lsm = d_out - np.exp(x_lsm)*d_out.sum(axis=1).reshape(-1,1)

计算 L / o u t \partial L/\partial out

d_out = -out/len(y) 是均值的求导

ReLU 激活函数以及求导

R e L U ( x ) = max ( 0 , x ) ReLU(x) = \max(0,x)

ReLU 的导数如下,

i f    x > 0    f ( x ) = 0 i f    x < 0    f ( x ) = 1 if \;x >0\; f(x) = 0 \\ if \;x <0\; f(x)=1
R e L U ( W l 1 T X ) ReLU(W_{l1}^TX)
L l 1 = L X r e l u \frac{\partial L}{\partial l1} = \frac{\partial L}{\partial X_{relu}}
dx_l1 = (x_relu > 0).astype(np.float32)* dx_relu

output_16_1.png

output_16_2.png

通过和 pytorch 得到权重 l1 和 l2 梯度图进行对比,不难发现通过 numpy 实现反向传播得到 l1 和 l2 的梯度图完全一致,到此为止我们迈出一步,距离目标也更近了一步。

我正在参与掘金技术社区创作者签约计划招募活动,点击链接报名投稿

Supongo que te gusta

Origin juejin.im/post/7122024588380209166
Recomendado
Clasificación