[Statistical Learning|Book Reading] Chapter 6 Logistics Returning to China and Maximum Entropy Model p77-p88

train of thought

Logistic regression is a classic classification method for statistical learning. The maximum entropy is a criterion for probabilistic model learning, and it is extended to classification problems to obtain the maximum entropy model.

logistic regression model

Logistic distribution: Let X be a continuous random variable. X obeying logistic distribution means having the following distribution function and density function: F ( x ) = P ( X ≤ x ) = 1 1 + e − ( x − μ ) / γ F (x)=P(X\le x)=\frac{1}{1+e^{-(x-\mu)/\gamma } }F(x)=P(Xx)=1+e( x μ ) / c1
f ( x ) = F ′ ( x ) = e − ( x − μ ) / γ γ ( 1 + e − ( x − μ ) / γ ) 2 f(x)=F^{'}(x)=\ frac{e^{-(x-\mu)/\gamma }}{\gamma (1+e^{-(x-\mu)/\gamma })^2}f(x)=F(x)=c ( 1+e( x μ ) / c )2e( x μ ) / c

Logistics regression model: The binomial logistic regression model is the following conditional probability distribution: P ( Y = 1 ∣ x ) = exp ( w ∗ x + b ) 1 + exp ( w ∗ x + b ) P(Y=1|x )=\frac{exp(w*x+b)}{1+exp(w*x+b)}P ( Y)=1∣x)=1+exp(wx+b)exp(wx+b)
P ( Y = 0 ∣ x ) = 1 1 + e x p ( w ∗ x + b ) P(Y=0|x)=\frac{1}{1+exp(w*x+b)} P ( Y)=0∣x)=1+exp(wx+b)1

Parameter estimation of the model: The maximum likelihood estimation method can be used to estimate the model parameters, so as to obtain the logistic regression model. The log likelihood function is obtained as: L ( w ) = ∑ i = 1 N [ yi ( w ∗ xi ) − log ( 1 + exp ( w ∗ xi ) ] L(w)=\sum_{i=1} ^{N} [y_i(w*x_i)-log(1+exp(w*x_i)]L(w)=i=1N[yi(wxi)log(1+exp(wxi)] toL ( w ) L(w)L ( w ) finds the maximum value to get the estimated value of w. In this way, the problem becomes an optimization function with the log-likelihood function as the objective function. The gradient descent method and Newton method are usually used in logistic regression learning.

The multinomial logistic regression model is the same.

maximum entropy model

Maximum Entropy Model Definition

The principle of maximum entropy can be expressed as selecting the model with the largest entropy in the set of models satisfying the constraints.

Maximum entropy model:
insert image description here

Defined in the conditional probability distribution P ( Y ∣ X ) P(Y|X)The conditional entropy on P ( Y X ) is: H ( P ) = − ∑ x , y P ( x ) P ( y ∣ x ) log P ( y ∣ x ) ^ H(P)=-\sum_{x ,y}\hat{P(x)P(y|x)logP(y|x)}H(P)=x,yP(x)P(yx)logP(yx)^
Then the conditional entropy H ( P ) H(P) in the model set CThe model with the largest H ( P ) is called the maximum entropy model, and the logarithm in the formula is the natural logarithm.

Learning of Maximum Entropy Models

The learning process of the maximum entropy model is the process of solving the maximum entropy model, and the learning of the maximum entropy model can be formalized as a constrained optimization problem.

Transform the original problem of constrained optimization into the dual problem of unconstrained optimization, and solve the original problem by seeking the dual problem.

So the obtained maximum entropy model is: P w ( y ∣ x ) = 1 Z w ( x ) exp ( ∑ i = 1 nwifi ( x , y ) ) P_w(y|x) = \frac{1}{Z_w( x)}exp(\sum_{i=1}^{n}w_if_i(x,y))Pw(yx)=Zw(x)1exp(i=1nwifi(x,y)),其中 Z w ( x ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) ) Z_w(x)=\sum_{y}exp(\sum_{i=1}^{n}w_if_i(x,y)) Zw(x)=yexp(i=1nwifi(x,y)), Z w ( x ) Z_w(x) Zw( x ) is a normative factor,fi ( x , y ) f_i(x,y)fi(x,y ) is the characteristic function;wi w_iwiis the weight of the feature, where w is the parameter vector in the maximum entropy model.

maximum likelihood estimation

The dual function maximization in the maximum entropy model learning is equivalent to the maximum likelihood estimation of the maximum entropy model. In this way, the learning problem of the maximum entropy model is transformed into the problem of solving the maximization of the logarithmic likelihood function or the maximization of the dual function.

Optimization Algorithms for Model Learning

The learning of the logistic regression model and the maximum entropy model boils down to the optimization problem with the likelihood function as the objective function, which is usually solved by an iterative algorithm. From the optimization point of view, the objective function at this time has good properties, it is a smooth convex function, so many optimization methods are applicable to ensure that the global optimal solution can be found. The commonly used methods are iterative scaling method, Gradient descent method, Newton method or quasi-Newton method. Newton's or quasi-Newton's methods generally converge faster.

Guess you like

Origin blog.csdn.net/m0_52427832/article/details/127078682