Detailed Explanation of Auto Differentiation Mechanism

  Automatic differentiation is a method that can calculate the derivative value of a function with a computer program, and its basic principle is the derivative algorithm. Because it can find the first-order derivative of each independent variable of the function function without manually deriving the formula, the basic idea and method of automatic differentiation are widely used in the deep learning framework. Before the automatic differentiation method appeared, there were also methods such as numerical differentiation and symbolic differentiation. Although these methods are also very ingenious, they are difficult to be widely used in practical problems.

One: Basic principles

1. Forward propagation

  The principle of automatic differentiation is the derivation rule, and the algorithm structure is similar to a recursive tree. For a multivariate functional formula f(x1,x2..,xn), its operation can always be regarded as the addition and multiplication of two independent variables or independent variables and constants, such as:

f(a,b,c,d)=2\times a \times b+c \times c\times d+d\times d

Its operation tree diagram can be expressed as:

  We found that any operation can be gradually disassembled into two-element operations, and there are only two types of addition and multiplication. If there are multiple additions and multiplications, it is also the process of the two leaf nodes gradually returning to the root node. This process of returning from the leaf node to the root node step by step is the forward transfer, which is the first step in finding the partial derivative (gradient) of any of the independent variables of the function. For tensors, operations will be more abundant, but the structure of the entire tree is consistent.

class auto:
    def __init__(self,value):
        self.value=value
    def __add__(self,other):
        return auto(value=self.value+other.value)
    def __mul__(self, other):
        return auto(value=self.value*other.value)
    def multiply(self,other):
        return auto(value=np.multiply(self.value,other.value))
    def dot(self,other):
        return auto(value=np.dot(self.value,other.value))
if __name__=='__main__':
    a=auto(1)
    b=auto(2)
    c=auto(3)
    f1=a*b+c
    d=auto(np.random.randint(0,2,size=(3,2)))
    e=auto(np.random.randint(0,2,size=(2,3)))
    f2=d.dot(e)
    print('f1',f1.value)
    print('f2',f2.value)

  Before the operation, we convert the number or matrix into an auto object, and then we can use the functions in auto to perform corresponding operations. The __add__ and __mul__ here are built-in methods that rewrite the addition and multiplication of numbers, so that '+' and '*' can be used for direct operations. Of course, you can also directly define the add function and the mul function, so that a+b needs to be changed to a.add(b), and the final effect is the same. (In fact, if you think about it carefully, this method is reflected in many aspects, such as numpy, pytorch and other libraries. When they perform calculations, they need to convert the data into corresponding array or tensor objects before they can be calculated, and the way of calculation is also different. Similar to this method.)

  The reason why the auto object is used here is because it can recurse step by step during the operation process, and can save the two child nodes of a node and the operation method of these two nodes during the forward transfer process, so that the subsequent inversion can be performed. Seek guidance from dissemination.

2. Backpropagation

  The reason for directional propagation is that after constructing the computation tree through forward propagation, the derivation process is passed from the root node to the leaf node step by step. Still taking f(a,b,c,d) in 1 as an example, we first calculate the partial derivative of f to f itself, and the result is 1 (this step can be omitted in theory, but it cannot be omitted in matrix operations ); the partial derivative of f to (2ab+ccd) is 1; the partial derivative of f to dd is 1; then the partial derivative of f to 2ab is:

\frac{f}{f}.\frac{\partial f}{\partial 2ab+ccd}.\frac{2ab+ccd}{2ab}=1\times 1\times1=1

At this time, we found that if we want to find the derivative of a function to a certain variable, we need to know the partial derivative of the function to the parent node of this variable, and the partial derivative of the function to a certain node is the partial derivative of the function to the parent node of this node The product of the parent node's partial derivative to that node. In this way, we can find the partial derivative of f to a as:

\frac{\partial f}{\partial a}=\frac{f}{f}.\frac{\partial f}{\partial (2ab+ccd)}.\frac{\partial (2ab+ccd)}{2ab}.\frac{\partial 2ab}{2a}.\frac{2a}{a} \\ =1\times1\times1\times b\times 2=2b

The end result fits the facts.

  Of course, here a only appears in one branch, if a is on a different branch, the result will be the sum of multiple chain multiplications. For example, f=2a+2ab, its derivative to a is 1×2+1×2×b=2+2b, which needs to be paid attention to when implementing the algorithm later.

  For the case of matrix derivation, it is not possible to multiply numbers from left to right at will, because matrix derivation distinguishes between left multiplication and right multiplication, and sometimes there is also matrix transposition. This part can be referred to before Tutu His blog "Detailed Explanation of Matrix Derivation (Essence, Principle and Derivation)" and "Multilayer Perceptron, Fully Connected Neural Network...Detailed Explanation". Tutu gives a common matrix derivation conclusion here:

Y_1=W_1.A_1 \\Y_2=W_2.Y_1 \\ \Rightarrow \\ \frac{\partial Y_2}{\partial Y_1}=W_{2}^T .\frac{\partial Y_2}{\partial Y_2} \\\frac{\partial Y_2}{\partial W_2} =\frac{\partial Y_2}{\partial Y_2}.Y_{1}^T \\ \frac{\partial Y_2}{A_1}=W_{1}^T.W_{2}^T.\frac{\partial Y_2}{\partial Y_2} \\ \frac{\partial Y_2}{\partial W_1}=W_{2}^T.\frac{\partial Y_2}{\partial Y_2}.A_{1}^T

  This rule is relatively easy to find. For matrix multiplication, if the derived independent variable is at the multiplication coordinate of the formula, the other independent variable is transposed and multiplied to the right, and vice versa. If the matrix is ​​regarded as a number, since the transposition of the number is still itself, and conforms to the commutative law of multiplication, it degenerates to an ordinary function derivation. By writing the algorithm of this rule into the auto object, the gradient of each parameter in the BP algorithm can be calculated, and there is no need to manually derive the formula and perform complex algorithm implementation. Of course, for the convolutional layer, we can also write the partial derivative (loss function) of the convolution kernel, which is still a matrix derivative in essence. For the activation function, we can still derive the derivative, but the convolution operation generally only receives one element (except functions such as softmax), and then outputs a result, and the regular method is the same.

Two: Algorithm Implementation

  The difficulty of automatic differentiation lies in the implementation of the algorithm. Because it is a recursive tree structure, it is sometimes difficult to understand. The same is true for similar algorithms such as decision trees and Monte Carlo trees.

  For automatic differentiation, we define an auto object, which is used to store the value (value) of a node in the operation tree generated by the forward pass, the child node auto (depend), the child node operation mode (opt), and the initial gradient ( grad) is None or 0. After the backpropagation process, the derivative of the function to each node will be gradually calculated according to the principle of chain derivation.

import numpy as np

class auto:
    '''自动微分'''
    def __init__(self,value,depend=None,opt=''):
        self.value=value #该节点的值
        self.depend=depend #生成该节点的两个子节点
        self.opt=opt #两个子节点的运算方式
        self.grad=None #函数对该节点的梯度
    def add(self,other):
        '''数或矩阵加法'''
        return auto(value=self.value+other.value,depend=[self,other],opt='+')
    def mul(self, other):
        '''数的乘法或数与矩阵乘法'''
        return auto(value=self.value*other.value,depend=[self,other],opt='*')
    def dot(self,other):
        '''矩阵乘法'''
        return auto(value=np.dot(self.value,other.value),depend=[self,other],opt='dot')
    def sigmoid(self):
        '''sigmoid激活函数'''
        return auto(value=1/(1+np.exp(-self.value)),depend=[self],opt='sigmoid')
    def backward(self,backward_grad=None):
        '''反向求导'''
        if backward_grad is None:
            if type(self.value)==int or float:
                self.grad=1
            else:
                a,b=self.value.shape
                self.grad=np.ones((a,b))
        else:
            if self.grad is None:
                self.grad=backward_grad
            else:
                self.grad+=backward_grad
        if self.opt=='+':
            self.depend[0].backward(self.grad)
            self.depend[1].backward(self.grad) #对于加法,把函数对自己的梯度传给自己对子节点的梯度
        if self.opt=='*':
            new=self.depend[1].value*self.grad
            self.depend[0].backward(new)
            new=self.depend[0].value*self.grad
            self.depend[1].backward(new)
        if self.opt=='dot':
            new=np.dot(self.grad,self.depend[1].value.T)
            self.depend[0].backward(new)
            new=np.dot(self.depend[0].value.T,self.grad)
            self.depend[1].backward(new)
        if self.opt=='sigmoid':
            new=self.grad*(1/(1+np.exp(-self.depend[0].value)))*(1-1/(1+np.exp(-self.depend[0].value)))
            self.depend[0].backward(new)

if __name__=='__main__':
    a=auto(3)
    b=auto(4)
    c=auto(5)
    f1=a.mul(b).add(c).sigmoid()
    f1.backward()
    print(a.grad,b.grad,c.grad) #f1对a,b,c,导数
    A=auto(np.random.randint(1,20,size=(3,4)))
    B=auto(np.random.randint(1,20,size=(4,3)))
    F=A.dot(B)
    F.backward()
    print(A.grad,B.grad)

  After constructing auto, convert the number or matrix into an auto object when using it, use the calculation function contained in it to perform calculations, and then use backward on the final result to find the gradient of this function for each independent variable. Of course, whoever uses backward is whoever takes the partial derivative of each independent variable. In deep learning, it is usually the partial derivative of the loss function to the parameters. Regarding the above code, those who are interested in learning can also add more functions to enrich the functions of auto.

  Careful students can find that in deep learning frameworks such as pytorch and tensorflow, the method of backward() is also used to find the gradient, which is almost the same as the usage here, which also shows that automatic differentiation is also used behind these deep learning frameworks This method is just more complicated.

Three: Summary

  As a method of calculating gradients, the automatic differentiation method has greatly promoted the development of deep learning. It allows us to get rid of the tedious operations of manually deriving formulas and implementing algorithms, and can easily realize gradient solutions in convolution, full connection, and recurrent neural networks. . Although we don't need to write automatic differentiation algorithm in practical application, its recursive algorithm idea is very important, it can make us understand the basic idea of ​​this kind of algorithm better.

Guess you like

Origin blog.csdn.net/weixin_60737527/article/details/127414198