Machine Learning && Deep Learning - Preliminary Knowledge (Part 2)

4 Calculus

4.1 Derivatives and differentials

Know the concepts and properties

4.2 Partial derivatives

Know the concepts and properties

4.3 Gradient

Concatenates the partial derivatives of a multivariate function with all its variables to obtain the gradient vector of the function. The concepts and rules are as follows:
insert image description here
the derivation is easy

4.4 The chain rule

The above method may be difficult to find the gradient, because the multivariate functions in deep learning are usually compound, but the chain rule can be used to differentiate the compound function, the derivation is as follows, in fact, I have learned it:
insert image description here

5 Automatic Differentiation

Derivation is of course simple, but for complex models, it is very painful to call it a day and update it.
Deep learning frameworks speed up derivatives by automatically computing derivatives, known as automatic differentiation. In fact, according to the designed model, the system will build a calculation graph to track which data and operations are combined to produce output. Automatic differentiation enables the system to subsequently backpropagate gradients. Backpropagation means tracing the entire computational graph, filling in the partial derivatives of each parameter

5.1 Simple example

Take the derivative of the function y=2x T x with respect to the column vector x:

import torch

x = torch.arange(4.0)
# 不要在每次对一个参数求导时都分配新内存
# 通过调用requires_grad_来为一个张量的梯度分配内存
x.requires_grad_(True)
# 可使用x.grad访问,默认全0

# 计算y
y = 2 * torch.dot(x, x)

#  接下来,通过调用反向传播函数来自动计算y关于x每个分量的梯度,并打印这些梯度。
y.backward()
print(x.grad)

# 上述输出结果为tensor([ 0.,  4.,  8., 12.]),则其关于x的梯度为4x
# 可以验证:
print(x.grad == 4*x)

# 现在计算x的求和函数
x.grad.zero_() # 默认情况下pytorch会累积梯度,需清除之前的值
y = x.sum()
y.backward()
print(x.grad)

result:

tensor([ 0.,  4.,  8., 12.])
tensor([True, True, True, True])
tensor([1., 1., 1., 1.])

5.2 Backpropagation for non-scalar variables

When y is not a scalar, the most natural interpretation of the derivative of a vector y with respect to a vector x is a matrix. For high-order and high-dimensional y and x, the result of derivation can be a high-order tensor.
When inverse computation of a vector is called, it is common to compute the derivative of the loss function for each component of a batch of training samples. The purpose here is not to compute the differentiation matrix, but to compute the sum of partial derivatives for each sample in the batch individually.

import torch

x = torch.arange(4.0)
x.requires_grad_(True)
# 对非标量调用backward需要传入一个gradient参数,该参数指定微分函数关于self的梯度
# 这里只想求偏导数的和,所以传递一个1的梯度是合适的
y = x * x
# 等价于y.backward(torch.ones(len(x)))
y.sum().backward()
print(x.grad)

Result:
tensor([0., 2., 4., 6.])
If you are confused when you see this, remember to understand the concept of dot product and product of matrix/vector, the derivation is very easy

5.3 Separate calculation

Here I hope you can understand the concept of the so-called calculation graph used when we want to calculate the partial derivative. In fact, it is similar to the concept in advanced numbers, but the calculation graph will draw it like a tree
. Certain computations are moved outside of the recorded computation graph. For example: y is a function of x, and z is a function of y and x. At this time, if you want to calculate the gradient of z with respect to x, for some reason, you want to treat y as a constant, and only consider the role of x after y is calculated.
Just look at the example:

import torch

x = torch.arange(4.0)
x.requires_grad_(True)
# 这里可以分离y来返回一个新变量u,u与y有相同值,但丢弃计算图中如何计算y的任何信息
# 也就是说,梯度不会向后流经u到x
# 下面的计算z=u * x关于x的偏导,将u作为常数处理,而不是z=x*x*x关于x的偏导
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
print(x.grad == u)

# 由于记录了y的计算结果,现在可以在y上调用反向传播,得到y=x*x关于x的导数
x.grad.zero_()  # 这一步不要忘记了
y.sum().backward()
print(x.grad == 2 * x)

result:

tensor([True, True, True, True])
tensor([True, True, True, True])

5.4 Gradient calculation of Python control flow

The advantage of using automatic differentiation is that even if the calculation graph of the construction function needs to pass through the control flow, the variable gradient can still be obtained:

import torch

def f(a):
    b = a * 2
    while a.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c


# 计算梯度
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

# 根据上述f函数,可得f(a)=k*a,因此可用d/a验证梯度是否正确
print(a.grad == d / a)

Result:
tensor(True)

6 probability

6.1 Basic probability theory

The most common example: throwing dice.
The law of large numbers : As the number of tests increases, the estimated value of the event probability (number of occurrences/total number of events) is getting closer and closer to the real potential probability
Sampling : draw samples from the probability distribution (the distribution first looks at the probability distribution of the pair of events )
Now for verification, you can first understand the Multinomial (distributed) function:
Multinomial distribution multinomial.Multinomial().sample() analysis in Pytorch

import torch
from torch.distributions import multinomial
from d2l import torch as d2l

# 抽取一个掷骰子的样本,只需传入一个概率向量,输出的是另一个相同长度的向量:
# 它在索引i处的值是采样结果中i出现的次数
fair_probs = torch.ones([6])/6  # tensor([0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667])
# 进行一次抽样
print(multinomial.Multinomial(1, fair_probs).sample())
# 进行十次抽样
print(multinomial.Multinomial(10, fair_probs).sample())
# 进行1000次抽样并计算相对频率,作为真实概率的估计:
counts = multinomial.Multinomial(1000, fair_probs).sample()
print(counts / 1000)

result:

tensor([1., 0., 0., 0., 0., 0.])
tensor([2., 1., 2., 2., 1., 2.])
tensor([0.1670, 0.1540, 0.1780, 0.1600, 0.1700, 0.1710])

It can be seen that the numbers in the first two output sum to the number of samples; the last output proves the law of large numbers and
we can also see how these probabilities converge to the true probabilities over time. Let's run an experiment with 500 sets of 10 samples each:

import torch
from torch.distributions import multinomial
from d2l import torch as d2l
fair_probs = torch.ones([6])/6
counts = multinomial.Multinomial(10, fair_probs).sample((500,))  # sample指定抽样次数,默认是1次
# print(counts)
cum_counts = counts.cumsum(dim=0)  # 计算行前缀和,即sum[i][j]=a[1][j]+a[2][j]+...+a[i][j],方便直观验证大数定律
# print(cum_counts)
estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True)  # 每个行前缀和都除以当前行的和,以得到概率估计
d2l.set_figsize((6, 4.5))
for i in range(6):
    d2l.plt.plot(estimates[:, i].numpy(),
                 label=("P(die=" + str(i + 1) + ")"))
d2l.plt.axhline(y=0.167, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Groups of experiments')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend()
d2l.plt.show()

Results:
insert image description here
Each line corresponds to one of the 6 values ​​of the dice and gives the estimated probability of the value the dice has after each set of experiments. The more data, the more convergent to the real probability

6.1.1 Axioms of probability theory

insert image description here

6.2 Dealing with Multiple Random Variables

As an example, an image contains millions of pixels and therefore has millions of random variables, and all metadata can be treated as random variables, such as position, time, aperture, and camera type.

6.2.1 Joint probability

P(A=a,B=b) indicates the probability that A=a and B=b are satisfied at the same time, P(A=a,B=b)<=P(A=a).

6.2.2 Conditional probability

According to the inequality of joint probability, it can be deduced:
0 < = P ( A = a , B = b ) P ( A = a ) < = 1 0<=\frac{P(A=a,B=b)}{P (A=a)}<=10<=P(A=a)P(A=a,B=b)<=1
We call this ratio the conditional probability, denoted as P(B=b|A=a), which means:when the premise A=a has occurred, the probability of B=b

6.2.3 Bayes' Theorem

Very important theorem, according to the multiplication rule we can get:
P ( AB ) = P ( B ∣ A ) P ( A ) P(AB)=P(B|A)P(A)P(AB)=P ( B A ) P ( A )
According to symmetry, we can get:
P ( AB ) = P ( A ∣ B ) ( B ) P(AB)=P(A|B)(B)P(AB)=P ( A B ) ( B )
Assuming P(B)>0, then:
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B)=\frac{ P(B|A)P(A)}{P(B)}P(AB)=P(B)P(BA)P(A)

6.2.4 Marginalization

According to the summation rule, the probability of B is equivalent to calculating all possible choices of A and aggregating all joint probabilities together:
P ( B ) = ∑ AP ( AB ) P(B)=\sum_{A}P( AB)P(B)=AP ( A B )
This is also known as marginalization where the outcome probability is the marginal probability and the outcome distribution is the marginal distribution

6.2.5 Independence

When events A and B are irrelevant:
P ( A ∣ B ) = P ( A , B ) P ( B ) = P ( A ) P(A|B)=\frac{P(A,B)}{P( B)}=P(A)P(AB)=P(B)P(A,B)=P ( A )
Similarly, given a random variable C, A and B are conditionally independent if and only if:
P ( AB ∣ C ) = P ( A ∣ C ) ( B ∣ C ) P(AB|C )=P(A|C)(B|C)P(ABC)=P(AC)(BC)

6.3 Expectation and variance

Expectation
1, the expectation of a random variable X:
E [ X ] = ∑ xx P ( X = x ) E[X]=\sum_x{xP(X=x)}E [ X ]=xxP(X=x )
2. When the input of the function f(x) is a random variable drawn from the distribution P, the expected value of f(x) is
E x p [ f ( x ) ] = ∑ xf ( x ) P ( x ) E_{x ~p}[f(x)]=\sum_x{f(x)P(x)}Ex p[f(x)]=xf(x)P(x)
方差
V a r [ x ] = E [ ( X − E [ X ] ) 2 ] = E [ X 2 ] − E [ X ] 2 Var[x]=E[(X-E[X])^2]=E[X^2]-E[X]^2 r [ x ] _=And [( XE [ X ] )2]=E [ X2]E [ X ]2

7 Consult Documentation

7.1 Find all functions and classes in a module

Call the dir function, such as querying all attributes in the random number generation module:

print(dir(torch.distributions))

7.2 Find usages of specific functions and classes

help(touch.ones)

Guess you like

Origin blog.csdn.net/m0_52380556/article/details/131835225