entropy
Information entropy formula
H ( X ) = − ∑ x P ( x ) l o g P ( x ) H(X)=-\sum_{x}P(x)logP(x) H(X)=−x∑P(x)logP(x)
Conditional entropy
H (X ∣ Y) = - ∑ x, y P (x, y) log P (x ∣ y) H (X | Y) = - \ sum_ {x, y} P (x, y) logP (x | Y) H(X∣Y)=−x , y∑P(x,y ) l or g P ( x ∣ y )
Joint entropy
H (X, Y) = - ∑ x, y P (x, y) log P (x, y) H (X, Y) = - \ sum_ {x, y} P (x, y) logP (x, Y) H(X,And )=−x , y∑P(x,y)logP(x,and )
Mutual information
I (X, Y) = H (X) - H (X ∣ Y) = - ∑ x, y P (x, y) P (x, y) P (x) P (y) I (X, Y) = H (X) -H (X | Y) = - \ sum_ {x, y} P (x, y) \ frac {P (x, y)} {P (x) P (y)} I(X,And )=H(X)−H(X∣Y)=−x , y∑P(x,and )P(x)P(y)P(x,and )
Cross entropy
H ( p , q ) = − ∑ x P ( x ) l o g Q ( x ) H(p,q)=-\sum_{x}P(x)logQ(x) H(p,q)=−x∑P(x)logQ(x)
Relative entropy (KL divergence)
D k l ( p , q ) = − ∑ x P ( x ) P ( x ) Q ( x ) = − ( H p ( x ) − H ( p , q ) ) D_{kl}(p,q)=-\sum_{x}P(x)\frac{P(x)}{Q(x)}=-(H_{p(x)}-H(p,q)) Dk l(p,q)=−x∑P(x)Q(x)P(x)=−(Hp(x)−H(p,q))
Optimization
Gradient descent algorithm
The influence of batch size of gradient descent algorithm
- Large batch: global optimal solution, easy to parallelize | training is slow when there are many training samples
- Small batch: fast training speed, slight decrease in accuracy | local optimal, training shock
Comparison of SGD and GD: sgd can make more effective use of information, especially when the information is more redundant. The early iteration effect of SGD is significant. When the amount of data is large, SGD has more advantages in computational complexity.
Newton Method
Solve for f (x) = 0 f(x)=0f(x)=0
x n + 1 = x n + 1 − f ( x n ) f ( x n ) ′ x_{n+1}=x_{n+1}-\frac{f(x_n)}{f(x_n)'} xn+1=xn+1−f(xn)′f(xn)
So in order to solve f (x) f(x)The extreme value of f ( x ) , that is,f (x) ′ = 0 f(x)'=0f(x)′=0
x n + 1 = x n + 1 − f ( x n ) ′ f ( x n ) ′ ′ x_{n+1}=x_{n+1}-\frac{f(x_n)'}{f(x_n)''} xn+1=xn+1−f(xn)′′f(xn)′
For multiple situations
x n + 1 = x n + 1 − ( ∇ 2 f ( x n ) ) − 1 f ( x n ) ′ = x n − H − 1 g x_{n+1}=x_{n+1}-(\nabla^2f(x_n))^{-1}{f(x_n)'}= x_n-H^{-1}g xn+1=xn+1−(∇2f(xn))−1f(xn)′=xn−H−1g
H H H is the Hession matrix,ggg is the gradient
problem
- Hession may be irreversible
- High computational complexity for inversion
- May not converge to an optimal solution (not even guaranteed convergence)
## Quasi-Newton Method
Do not calculate the inverse of the Hession matrix, use other methods to approximate the inverse