Machine Learning(3)Generalized Linear Models

版权声明: https://blog.csdn.net/qq_26386707/article/details/79406364

Machine Learning(3)Generalized Linear Models


Chenjing Ding
2018/02/28


notation meaning
M M-dimensional feature space ϕ
K the number of classes
N the number of training data
W the weight matrix
w k the weight of class C k
w k j ( τ + 1 ) the new weight of j-th entry of w k
ϕ n ϕ n = ϕ ( x n )
ϕ n i ϕ n i = ϕ i ( x n ) the i-th entry of ϕ n
y ( ϕ n ) column vector, the discriminant function vector for all class
y k ( ϕ n ) k-th entry of y ( ϕ n ) , the discriminant function vector for class k

1. linear model and generalized linear model

1.1 Definition

linear model:

y k ( x ) = w k T x Y ( x n ) = [ y 1 ( x n ) , y 2 ( x n ) . . . y K ( x n ) ] T = W T x n Y ^ ( X ) = [ Y ( x 1 ) , Y ( x 2 ) . . . Y ( x N ) ] T = X W
you can see Machine Learning(3)Least-squares classification 1 for more details.

generalized model:

y k ( x ) = g ( w k T x )

g is the activation function:
g ( x ) = 1 1 + e a                     l o g i s t i c   r e g r e s s i o n

g k ( x ) = e x k j = 1 K e x j                     S o f t m a x   r e g r e s s i o n

1.2 nonlinear basis functions

If g is monotonous (which is typically the case), the resulting decision boundaries are still linear in input space x. Thus transform vector x with D nonlinear basis functions ϕ i ( x ) :

y k ( x n ) = g ( w k T Φ ( x n ) ) = g ( i = 0 D w k i ϕ i ( x n ) ) ,       ϕ 0 ( x n ) = 1

Advantages:
Allow non-linear decision boundaries. By choosing the right ϕ i ( x ) , every continuous function can (in principle)be approximated with arbitrary accuracy.

1.3 Why use generalized linear model compared with linear model

  • Can be used to limit the effect of outliers
    In linear model, the y k ( x n ) can grow arbitrarily large for some x n . As a result, too correct points (far away from linear boundaries but classified correctly) will have a strong influence on the decision boundaries. ML(3) Least-squares classification 3
    By choosing a suitable nonlinear activation function( eg: sigmoid function in 2.1), we can limit those influences because the output for sigmoid function is [0,1]

  • Choice of a sigmoid leads to a nice probabilistic interpretation 2.1

However, Least-squares minimization in general no longer leads to a closed-form analytical solution, we need do the Gradient descent to update the weight. 3

2. How to obtain generalized linear model

2.1 Logistic Sigmoid Activation Function

a nice probabilistic interpretation:
Consider 2 classes:

P ( C 1 | x ) = p ( x | C 1 ) P ( C 1 ) p ( x | C 1 ) P ( C 1 ) + p ( x | C 2 ) P ( C 2 ) = 1 1 + p ( x | C 2 ) P ( C 2 ) p ( x | C 1 ) P ( C 1 ) = 1 1 + e a

if a = ln p ( x | C 2 ) P ( C 2 ) p ( x | C 1 ) P ( C 1 ) P ( C 1 | x ) = g ( a ) = 1 1 + e a

Thus using function g(a), it leads to a nice probabilistic interpretation.

logistic function:
g ( a ) is logistic function, we also use σ ( a ) . Here are some properties of this function:

σ ( a ) = 1 σ ( a ) d σ d a = σ ( 1 σ )

logistic regression:
In the following, we will consider models of the form:

p ( C 1 | ϕ ) = y 1 ( ϕ ) = σ ( a ) = σ ( w 1 T ϕ ( x ) ) p ( C 2 | ϕ ) = 1 p ( C 1 | ϕ )
This model is called logistic regression.

Why use logistic regression:
Because we only need to update fewer parameters.
Assume we have an M-dimensional feature space ϕ . In logistic regression model, if we assume the p ( x | C 1 ) and p ( x | C 2 ) represented by Gaussians, and we need:

  1. the number of means: 2M(M for p ( x | C 1 ) , M for p ( x | C 2 )
  2. the number of variance: M ( M + 1 ) 2 (suppose the variances of p ( x | C 1 ) and p ( x | C 2 ) are same)
  3. prior probability: 1 ( p ( C 1 ) = 1 p ( C 2 ) )

    thus in total, with Gaussians representation we need to update M ( M + 5 ) 2 + 1 parameters.

But in logistic regression, we only need to update M parameters of p ( C 1 | ϕ ) because p ( C 2 | ϕ ) = 1 p ( C 1 | ϕ ) );

2.2 Normalized Exponential

t n { 1 , 2... K } P ( C k | x n ) = p ( x | C k ) P ( C k ) i = 1 K p ( x | C i ) P ( C i )
(2.2.1) = e x p ( a k ) j e x p ( a j ) , a j = ln p ( x | C j ) P ( C j )

Formula 2.2.1 is known as softmax function. or normalized exponential.
(2.2.2) y ( x n ; W ) = [ P ( y = 1 | W , x n ) P ( y = 2 | W , x n ) . . . P ( y = K | W , x n ) ] = 1 k = 1 K e x p ( w k T x n ) [ e x p ( w 1 T x n ) e x p ( w 2 T x n ) . . . e x p ( w K T x n ) ]

3. Gradient descent to update weight

3.1 steps of Gradient Descent

Gradient descent is Iterative minimization.
step1: Start with an initial guess for the parameter values w k j ( 0 ) ;
step2: Move towards a (local) minimum by following the gradient.

(3.1) w k j ( τ + 1 ) = w k j ( τ ) η E ( W ) w k j ( τ )

formula(3.1) corresponds to a 1st order Taylor expansion. If you are interested in why gradient descent correspond to 1st order Talyor expansion, just go on.


Using 1st order Taylor expansion of E ( W ) in W τ :

E ( W ( τ ) η Δ ) = E ( W ( τ ) ) + Δ ( η Δ ) < E ( W ( τ ) )
since η > 0
Δ = E ( W τ ) W τ

Thus updating W will lead to the minimum of E ( W ) .


3.2 Batch learning

Process the full data at once to computer the gradient.

E ( W ) = i = 1 N E i ( W ) w j i τ + 1 = w j i τ η E ( W ) W j i τ

3.3 Stochastic learning/Sequential Updating

Choose a single training sample x n to obtain E n ( W ) ;

w j i τ + 1 = w j i τ η E n ( W ) W j i τ

3.4 Delta rule /LMS rule

Delta/LMS rule are based on least-squares error.
Error function (least-squares error) of linear model:

E ( W ) = 1 2 n = 1 N k = 1 K ( t k n y k ( x n ) ) 2 = 1 2 n = 1 N k = 1 K ( t k n j = 1 M w k j ϕ j ( x n ) ) 2

E n ( W ) = 1 2 k = 1 K ( t k n y k ( x n ) ) 2 = 1 2 k = 1 K ( t k n j = 1 M w k j ϕ j ( x n ) ) 2

E n ( W ) w k ^ j = (   y k ^ ( x n ) t k ^ n ) ϕ j ( x n )

(3.4.1) w k ^ j τ + 1 = w k ^ j τ η (   y k ^ ( x n ) t k ^ n ) ϕ j ( x n )

Cases with differentiable, non-linear activation function:

E n ( W ) = 1 2 k = 1 K ( t k n g ( j = 1 M w k j ϕ j ( x n ) ) ) 2

(3.4.2) E n ( W ) w k ^ j = g (   y k ^ ( x n ) t k ^ n ) ϕ j ( x n )

Both formula 3.4.1 and 3.4.2 are the Delta/LMS rule.

3.5 logistic regression

3.5.1 Gradient Descent (1st order)

Let’s consider a data set ( ϕ n , t n ) with n = 1,…,N where ϕ n = ϕ ( x n ) and t n { 0 , 1 } , t = ( t 1 , t 2 . . . t N ) T

With y n = p ( C 1 | ϕ n ) , we can write the likelihood as

p ( t | w ) = n = 1 N y n t n ( 1 y n ) 1 t n
since y n only have parameter w and the second y C 2 ( x n ) = 1 y n , thus w is vector here, we needn’t use Capital letter W;
Define the error function as the negative log-likelihood:
(3.5.1) E ( w ) = ln p ( t | w ) = n = 1 N t n ln y n + ( 1 t n ) ln ( 1 y n )

Formula 3.4.1 is the so-called cross-entropy error function.
y n = σ ( w T ϕ ( x n ) ) y n w = y n ( 1 y n ) ϕ ( x n ) E ( w ) w = n = 1 N t n y n w y n + ( 1 t n ) 1 y n w 1 y n = n = 1 N t n ( 1 y n ) ϕ ( x n ) ( 1 t n ) y n ϕ ( x n )
(3.5.2) = n = 1 N ( y n t n ) ϕ ( x n )

Formula 3.4.2 is the gradient for logistic regression it is same with formula 3.4.1. we can use sequential updating:
w ( τ + 1 ) = w ( τ ) η E n ( w ( τ ) ) w ( τ ) = w ( τ ) η ( y n t n ) ϕ ( x n )

Disadvantage:
Relatively slow to converge, thus Newton-Raphson (2nd order) is introduced.

3.5.2 Newton-Raphson (2nd order)

w ( τ + 1 ) = w ( τ ) η H 1 E ( w ( τ ) ) w ( τ )

H is the Hessian matrix (the matrix of second derivatives).According to 3.4.2:
E ( w ) w = n = 1 N ( y n t n ) ϕ ( x n ) = ϕ T ( y t ) H = ϕ T d y d w = ϕ T y ( 1 y ) ϕ
w ( τ + 1 ) = w ( τ ) η ( ϕ T R ϕ ) 1 ϕ T ( y t )
R is an N*N matrix with R n n = y n ( 1 y n ) and ϕ T , y and t are all column vectors.

Logistic regression is only for 2 classes, for multi classes, softmax regression in introduced.

3.6 softmax regression

t n { 1 , 2... K }

y ( x n ; W ) = [ y 1 ( x n ) y 2 ( x n ) . . . y K ( x n ) ] = 1 k = 1 K e x p ( w k T x n ) [ e x p ( w 1 T x n ) e x p ( w 2 T x n ) . . . e x p ( w K T x n ) ]

cross-entropy error function:
E ( W ) = n = 1 N k = 1 K { I ( t n = k ) ln   y k ( x n | w ) } = n = 1 N k = 1 K { I ( t n = k ) ln e x p ( w k T x n ) i = 1 K e x p ( w i T x n )
To obtain the first order of gradient descent :
y k n = y k ( w T x n ) = e x p ( w k T x n ) i = 1 K e x p ( w i T x n )

(3.6.1) y k ( x ^ ) x ^ k == e x ^ k i e x ^ i e x ^ k e x ^ k ( i e x ^ i ) 2 = y k ( x ^ ) y k ( x ^ ) 2

(3.6.2) y k ( x ^ ) x ^ j = 0 e x ^ j e x ^ k ( i e x ^ i ) 2 = y k ( x ^ ) y j ( x ^ )

According to 3.6.1 and 3.6.2
y k ( w T x n ) w k T x n = y k ( w T x n ) y k ( w T x n ) 2
y k ( w T x n ) w j T x n = y k ( w T x n ) y j ( w T x n )

V = [ v 1 , v 2 . . . v K ] , v j = E ( W ) y j ( w T x n ) = 1 y j ( w T x n )
K is the number of classes.

E n ( W ) w k = j = 1 K E ( W ) y j n y j n w k T x n w k T x n w k = ( j = 1. j k K v j y k n y j n + v k ( y k n y k n 2 ) ) x n = [ v k y k n y k n j = 1 K v j y j n ] x n
(3.6.3) = y k n x n ( v k V y ( x n ) )

The formula 3.6.3 is matrix computation, it is highly recommended you can use this formula in your code.


To help your understand, list some related matrix here,very helpful when you coding:

ϕ n = w T x n

Softmax’s Jacobi matrix(K*N) :
J = [ y 1 ( ϕ 1 ) ϕ 1 y 1 ( ϕ 2 ) ϕ 2 . . . y 1 ( ϕ N ) ϕ N y 2 ( ϕ 1 ) ϕ 1 y 2 ( ϕ 2 ) ϕ 2 . . . y 2 ( ϕ N ) ϕ N . . . . . . . . . . . . y K ( ϕ 1 ) ϕ 1 y K ( ϕ 2 ) ϕ 2 . . . y K ( ϕ N ) ϕ N ]

gradient matrix

V = [ E ( W ) y 1 ( ϕ 1 ) E ( W ) y 2 ( ϕ 1 ) . . . E ( W ) y K ( ϕ 1 ) E ( W ) y 1 ( ϕ 2 ) E ( W ) y 2 ( ϕ 2 ) . . . E ( W ) y K ( ϕ 2 ) . . . . . . . . . . . . E ( W ) y 1 ( ϕ N ) E ( W ) y 2 ( ϕ N ) . . . E ( W ) y K ( ϕ N ) ]

V n = [ E ( W ) y 1 ( ϕ n )   E ( W ) y 2 ( ϕ n )   . . . E ( W ) y K ( ϕ n ) ]

y ( x n ) = [ y 1 ( x n )   y 2 ( x n )   . . . y K ( x n ) ] T

G = E ( W ) W = [ E 1 ( W ) w 1 E 2 ( W ) w 1 . . . E N ( W ) w 1 E 1 ( W ) w 2 E 2 ( W ) w 2 . . . E N ( W ) w 2 . . . . . . . . . . . . E 1 ( W ) w K E 2 ( W ) w K . . . E N ( W ) w K ]

G k n = y k n x n ( V n k V n y ( x n ) )


猜你喜欢

转载自blog.csdn.net/qq_26386707/article/details/79406364
今日推荐