Introduction to deep learning theory for beginners (1) Essential mathematical foundation (self-use notes for quick reference)

Note reference and picture source: "The Mathematics of Deep Learning"

The mathematics of deep learning (ituring.com.cn)

 

Table of contents

1. Normal distribution

2. Recursion relationship

3. ∑ symbol

4. Vector

4.1 Vector inner product

4.2 Cauchy-Schwarz inequality

4.3 Coordinate representation of inner product

4.4 Generalization of vectors

5. Matrix

5.1 Identity matrix

5.2 Hadamard product

6. Derivative (single variable function)

6.1 Definition of derivatives

6.2 Derivatives of fractional functions and derivatives of Sigmoid functions

6.3 How to find the minimum value

7. Partial derivatives (multivariable functions)

7.1 Multivariable functions

7.2 Partial derivatives

7.3 How to find the minimum value 

7.4 Multiplication of Lagrange numbers

8. Chain Rule 

8.1 Composite functions

8.2 Derivative formula of composite function of single variable

8.3 Derivative formula of multi-variable composite functions

9. Approximate formula of variable function

9.1 Approximate formulas for functions of one variable

 9.2 Approximate formulas for multivariable functions

9.3 Vector representation of approximate formulas

10. Gradient descent method

10.1 Basic formula of gradient descent method for two-variable functions

The derivation process

10.2 Gradient

10.3 Basic formula of gradient descent method for three-variable function 

10.4 Hamiltonian operator

The meaning of 10.5 η: learning rate

 11. Optimization Problems and Regression Analysis


1. Normal distribution

Using random numbers that follow a normal distribution when setting initial values ​​for weights and biases generally yields good results.

2. Recursion relationship

Computers are good at calculating relationships.

For example, let's look at the calculation of factorial. The factorial of a natural number n is the product of integers from 1 to n, represented by the symbol n!.

n! = 1×2×3×…×n

In most cases, people calculate n! based on the above formula, while computers usually use the following recursion relationship to calculate.

a1 = 1,an + 1 = (n + 1)an

The error back propagation method uses this calculation method that computers are good at to calculate neural networks.

3. ∑ symbol

 

4. Vector

4.1 Vector inner product

a⋅b = | a | | b |cosθ (θ is the angle between a and b)

4.2 Cauchy-Schwarz inequality

Property 1 can be applied to the gradient descent method (the opposite direction of the gradient descends fastest).

4.3 Coordinate representation of inner product

Two-dimensional space:

Three-dimensional space:

4.4 Generalization of vectors

In the calculation process of neural networks, the vector perspective is very beneficial.

When the neural unit has multiple inputs x1, x2, …, xn, they can be organized into the following weighted inputs.

5. Matrix

5.1 Identity matrix

Identity matrix, which is a square matrix with elements 1 on the diagonal and other elements 0, is usually represented by E.

For example, the identity matrix E with 2 rows and 2 columns and 3 rows and 3 columns (called the 2nd-order identity matrix and the 3rd-order identity matrix) are represented as follows respectively.

 

 The identity matrix is ​​a matrix with the same properties as 1. The product of the identity matrix E and any matrix A satisfies the following commutative law.

AE = EA = A

5.2 Hadamard product

6. Derivative (single variable function)

6.1 Definition of derivatives

6.2 Derivatives of fractional functions and derivatives of Sigmoid functions

Derivatives of fractional functions:

Derivative of Sigmoid function:

 

6.3 How to find the minimum value

When the function f(x) takes the minimum value at x = a, f'(a) = 0.

7. Partial derivatives (multivariable functions)

7.1 Multivariable functions

A function with more than two independent variables is called a multivariable function .

f(x1, x2, …, xn) : A function with n independent variables x1, x2, …, xn.

There are thousands of function variables in neural networks.

7.2 Partial derivatives

The derivative  with respect to a specific variable is called a partial derivative.

7.3 How to find the minimum value 

7.4 Multiplication of Lagrange numbers

8. Chain Rule 

8.1 Composite functions

It is known that the function y = f(u), when u is expressed as u = g(x), y as a function of x can be expressed as a nested structure of the form y = f(g(x)) (u and x represent Multivariate). At this time, the function f(g(x)) of the nested structure is called the composite function of f(u) and g(x) .

The function in the neural network is also a typical composite function.

8.2 Derivative formula of composite function of single variable

8.3 Derivative formula of multi-variable composite functions

z = f(u,v), find the partial derivative of the x variable:

Find the partial derivative of the y variable: 

C = f(u,v,w) , find the partial derivative of u: (Same for v and w)

9. Approximate formula of variable function

The gradient descent method is a representative method for determining neural networks. When applying the gradient descent method, an approximate formula for a multivariable function needs to be used.

9.1 Approximate formulas for functions of one variable

 9.2 Approximate formulas for multivariable functions

 Simplify the formula and use the letter z to represent the change of function z = f(x, y) when x and y change by Δx and Δy :

The simplified version of the approximate formula is as follows. In the same way, the approximate formula of a three-variable function can also be expressed in this way.

9.3 Vector representation of approximate formulas

The approximate formula can be expressed as the inner product of the following two vectors ∇ z⋅Δ x.

10. Gradient descent method

The gradient descent method is the most commonly used method to find the minimum value of a function, that is, using the negative gradient direction to determine the new search direction for each iteration, so that each iteration can gradually reduce the objective function to be optimized.

10.1 Basic formula of gradient descent method for two-variable functions

The derivation process

When x changes Δx and y changes Δy, the change Δz of function f (x, y) is:

The above formula can be expressed as the inner product form of two vectors a and b.

It is known that when the direction of b is opposite to that of a, the inner product a·b takes the minimum value. That is, when the vector b satisfies b= - ka (k is a positive constant), the inner product a·b takes the minimum value.

That is to say, when the directions of the two vectors a and b are exactly opposite, Δz can be minimized (that is, decrease the fastest)

 From this we can get the basic formula of the gradient descent method of a two-variable function.

When moving from point (x, y) to point (x + Δx, y + Δy), when:

The function z = f (x, y) decreases fastest. 

10.2 Gradient

Gradient is a vector (vector) , which means that the directional derivative of a certain function at that point reaches the maximum value along that direction, that is, the function changes fastest along that direction (the direction of the gradient) at that point and has the largest rate of change. (is the modulus of the gradient).

10.3 Basic formula of gradient descent method for three-variable function 

The basic formula of the gradient descent method for two-variable functions can be easily extended to situations with more than three variables. When the function f consists of n independent variables x1, x2, …, xn, the basic formula of three variables can be generalized.

10.4 Hamiltonian operator

The meaning of 10.5 η: learning rate

In the world of neural networks, eta is called the learning rate . There is no clear standard for its determination method, and the appropriate value can only be found through trial and error.

 11. Optimization Problems and Regression Analysis

In terms of optimization, the sum of errors can be called "error function", "loss function", "cost function", etc. The method of optimization using the sum of squared errors is called the least squares method .

Guess you like

Origin blog.csdn.net/weixin_45662399/article/details/132823890