Machine learning-entropy and optimization methods

entropy

Information entropy formula

H ( X ) = − ∑ x P ( x ) l o g P ( x ) H(X)=-\sum_{x}P(x)logP(x) H(X)=xP(x)logP(x)

Conditional entropy

H (X ∣ Y) = - ∑ x, y P (x, y) log P (x ∣ y) H (X | Y) = - \ sum_ {x, y} P (x, y) logP (x | Y) H(XY)=x , yP(x,y ) l or g P ( x y )

Joint entropy

H (X, Y) = - ∑ x, y P (x, y) log P (x, y) H (X, Y) = - \ sum_ {x, y} P (x, y) logP (x, Y) H(X,And )=x , yP(x,y)logP(x,and )

Mutual information

I (X, Y) = H (X) - H (X ∣ Y) = - ∑ x, y P (x, y) P (x, y) P (x) P (y) I (X, Y) = H (X) -H (X | Y) = - \ sum_ {x, y} P (x, y) \ frac {P (x, y)} {P (x) P (y)} I(X,And )=H(X)H(XY)=x , yP(x,and )P(x)P(y)P(x,and )

Cross entropy

H ( p , q ) = − ∑ x P ( x ) l o g Q ( x ) H(p,q)=-\sum_{x}P(x)logQ(x) H(p,q)=xP(x)logQ(x)

Relative entropy (KL divergence)

D k l ( p , q ) = − ∑ x P ( x ) P ( x ) Q ( x ) = − ( H p ( x ) − H ( p , q ) ) D_{kl}(p,q)=-\sum_{x}P(x)\frac{P(x)}{Q(x)}=-(H_{p(x)}-H(p,q)) Dk l(p,q)=xP(x)Q(x)P(x)=(Hp(x)H(p,q))

Optimization

Gradient descent algorithm

The influence of batch size of gradient descent algorithm

  • Large batch: global optimal solution, easy to parallelize | training is slow when there are many training samples
  • Small batch: fast training speed, slight decrease in accuracy | local optimal, training shock

Comparison of SGD and GD: sgd can make more effective use of information, especially when the information is more redundant. The early iteration effect of SGD is significant. When the amount of data is large, SGD has more advantages in computational complexity.

Newton Method

Solve for f (x) = 0 f(x)=0f(x)=0

x n + 1 = x n + 1 − f ( x n ) f ( x n ) ′ x_{n+1}=x_{n+1}-\frac{f(x_n)}{f(x_n)'} xn+1=xn+1f(xn)f(xn)

So in order to solve f (x) f(x)The extreme value of f ( x ) , that is,f (x) ′ = 0 f(x)'=0f(x)=0

x n + 1 = x n + 1 − f ( x n ) ′ f ( x n ) ′ ′ x_{n+1}=x_{n+1}-\frac{f(x_n)'}{f(x_n)''} xn+1=xn+1f(xn)f(xn)

For multiple situations

x n + 1 = x n + 1 − ( ∇ 2 f ( x n ) ) − 1 f ( x n ) ′ = x n − H − 1 g x_{n+1}=x_{n+1}-(\nabla^2f(x_n))^{-1}{f(x_n)'}= x_n-H^{-1}g xn+1=xn+1(2f(xn))1f(xn)=xnH1g
H H H is the Hession matrix,ggg is the gradient

problem

  • Hession may be irreversible
  • High computational complexity for inversion
  • May not converge to an optimal solution (not even guaranteed convergence)

## Quasi-Newton Method

Do not calculate the inverse of the Hession matrix, use other methods to approximate the inverse

Improved iterative scaling method

Guess you like

Origin blog.csdn.net/lovoslbdy/article/details/104860379