Machine learning review--simple introduction and code call of logistic regression

Recently, I need to review the knowledge related to machine learning and record it.

1. Introduction

Linear regression: h ( x ) = w T x + bh(x)=w^T x +bh(x)=wTx+b

Logistic regression is to add a sigmoid function gg on the basis of the linear modelg,即 h ( x ) = g ( w T x + b ) h(x)=g(w^T x+b) h(x)=g(wTx+b) 。 。 g ( z ) = 1 / ( 1 + e − z ) g(z)=1/(1+e^{-z}) g(z)=1/(1+ez ).
It converts the result of a linear regression into a probability value. At this timeh ( x ) h(x)h ( x ) represents the probability of something happening, we can also record it asp ( Y = 1 ∣ x ) p(Y=1|x)p ( Y=1∣x)

Second, the loss function of logistic regression

Logistic regression uses the loss function of cross entropy.

For general binary logistic regression, the cross-entropy function is: J ( θ ) = − [ yln ( y ′ ) + ( 1 − y ) ln ( 1 − y ′ ) ] J(\theta)=-[yln (y')+(1-y)ln(1-y')]J(θ)=[ y l n ( y)+(1y ) l n ( 1y )], wherey ′ y'y' is the predicted value.

In fact, what we are looking for is the loss of all samples in training, so:

J ( θ ) = − 1 m ∑ [ y i l n ( y i ‘ ) + ( 1 − y i ) l n ( 1 − y i ‘ ) ] J(\theta )=-\frac{1}{m}\sum[y_i ln(y_i`)+(1-y_i )ln(1-y_i`)] J(θ)=m1[yil n ( yi)+(1yi)ln(1yi)]

Third, the optimization method of logistic regression

3.1 Gradient Descent

The direction of the gradient of the function is the direction in which the function grows fastest, whereas the opposite direction of the gradient is the direction in which the function decreases the fastest. So if we want to calculate the minimum of a function, we go in the opposite direction of the gradient of the function.
Suppose we need to optimize the function: f ( X ) = f ( x 1 , . . . , xn ) f(X)=f(x_1,...,x_n)f(X)=f(x1,...,xn)

First we initialize the independent variables from X ( 0 ) = ( x 1 ( 0 ) , . . . xn ( 0 ) ) X^(0)=(x_1^{(0)},...x_n^{(0 )})X(0)=(x1(0),...xn(0)) to start. Set a learning rateη \etan .
for anyi >= 0 i>=0i>=0:

If it is minimized fff

x 1 i + 1 = x 1 i − η ∂ f ∂ x 1 ( x ( i ) ) x_1^{i+1}=x_1^{i}-\eta \frac{\partial{f}}{\partial{x_1}}(x^{(i)}) x1i+1=x1ithex1f(x(i))

x n i + 1 = x n i − η ∂ f ∂ x n ( x ( i ) ) x_n^{i+1}=x_n^{i}-\eta \frac{\partial{f}}{\partial{x_n}}(x^{(i)}) xni+1=xnithexnf(x(i))

Conversely, if we ask for ffThe maximum value of f , then

x 1 i + 1 = x 1 i + η ∂ f ∂ x 1 ( x ( i ) ) x_1^{i+1}=x_1^{i}+\eta \frac{\partial{f}}{\partial{x_1}}(x^{(i)}) x1i+1=x1i+thex1f(x(i))

x n i + 1 = x n i + η ∂ f ∂ x n ( x ( i ) ) x_n^{i+1}=x_n^{i}+\eta \frac{\partial{f}}{\partial{x_n}}(x^{(i)}) xni+1=xni+thexnf(x(i))

3.2 Optimization of Logistic Regression

Objective function for logistic regression optimization:
J ( w , b ) = − 1 m ∑ [ yiln ( σ ( w T x + b ) ) + ( 1 − yi ) ln ( 1 − σ ( w T x + b ) ) ] J(w,b )=-\frac{1}{m}\sum[y_i ln(\sigma(w^T x +b))+(1-y_i )ln(1-\sigma(w^T x +b))]J(w,b)=m1[yil n ( σ ( wTx+b))+(1yi)ln(1s ( wTx+b))]

We need to optimize the parameters w , bw,bw,b , so that in our known sampleX , y X,yX,The upper value of y is the smallest. That is, we often say that the empirical risk is the smallest.

First we need to J ( w , b ) J(w,b)J(w,b ) Derivation.

先令 g = σ ( w T x + b ) g=\sigma(w^T x +b) g=s ( wTx+b)

∂ J ( g ) ∂ g = − ∂ ∂ g [ y l n ( g ) + ( 1 − y ) l n ( 1 − g ) ] = − y g + 1 − y 1 − g \frac{\partial J(g)}{\partial g}=-\frac{\partial}{\partial g}[yln(g)+(1-y)ln(1-g)]=-\frac{y}{g}+\frac{1-y}{1-g} gJ(g)=g[ y l n ( g )+(1y ) l n ( 1g)]=gy+1g1y

再令: a = w T x + b a=w^T x +b a=wTx+b

∂ g ∂ a = ∂ ( 1 1 + e − a ) ∂ a = − ( 1 + e − a ) − 2 − e − a = 1 1 + e − a 1 + e − a − 1 1 + e − a = σ ( a ) ( 1 − σ ( a ) ) = g ( 1 − g ) \frac{\partial g}{\partial a}=\frac{\partial ({\frac{1}{1+e^{-a}}})}{\partial a}=-(1+e^{-a})^{-2}-e^{-a}=\frac{1}{1+e^{-a}}\frac{1+e^{-a}-1}{1+e^{-a}}=\sigma(a)(1-\sigma (a))=g(1-g) ag=a(1+ea1)=(1+ea)2ea=1+ea11+ea1+ea1=s ( a ) ( 1s ( a ))=g(1g)

It can be found that g = σ ( a ) g=\sigma(a)g=σ ( a ) , butggg toaaAfter derivation of a , it turns out to be g ( 1 − g ) g(1-g)g(1g ) , in the subsequent gradient descent optimization, this property of the Sigmoid function can reduce a lot of unnecessary calculations.

Next, find the parameters w , bw,b that need to be optimizedw,Gradient of b .
According to chain derivation:

∂ J ∂ w = ∂ J ∂ g ∂ g ∂ a ∂ a ∂ w = ( − y g + 1 − y 1 − g ) g ( 1 − g ) x = ( g − y ) x \frac{\partial J}{\partial w}=\frac{\partial J}{\partial g}\frac{\partial g}{\partial a}\frac{\partial a}{\partial w}=(-\frac{y}{g}+\frac{1-y}{1-g})g(1-g)x=(g-y)x wJ=gJagwa=(gy+1g1y)g(1g)x=(gy)x

∂ J ∂ b = ∂ J ∂ g ∂ g ∂ a ∂ a ∂ b = ( − y g + 1 − y 1 − g ) g ( 1 − g ) = ( g − y ) \frac{\partial J}{\partial b}=\frac{\partial J}{\partial g}\frac{\partial g}{\partial a}\frac{\partial a}{\partial b}=(-\frac{y}{g}+\frac{1-y}{1-g})g(1-g)=(g-y) bJ=gJagba=(gy+1g1y)g(1g)=(gy)

4. Call lr in sklearn

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
X=datasets.load_iris()['data']
Y=datasets.load_iris()['target']
from sklearn.linear_model import LogisticRegression
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.1,stratify=Y)


model=LogisticRegression(penalty='l2',
                  class_weight=None,
                 random_state=None,  max_iter=100)
model.fit(X_train,y_train)
model.predict_proba(X_test)

penalty : The penalty coefficient, which is what we often call regularization, is "l2" by default, and l1 is optional.

class_weight : category weight, generally we use it when the classification is unbalanced, such as {0:0.1,1:1} means that when calculating the loss, the loss of category 0 is multiplied by 0.1. In this way, when there is too much data in category 0, it is equivalent to raising the right to category 1.

max_iter : The maximum number of iterations.

Fifth, why features are often discretized in logistic regression.

This is a common operation in the industry. Generally, we do not input continuous values ​​as features into the logistic regression model, but discretize them into 0, 1 variables. The benefits of this are:

1: The inner product multiplication of sparse variables is fast, the calculation results are convenient to store, and easy to expand;

2: The discretized features are very robust to abnormal data: for example, if a feature is age > 30, it is 1, otherwise it is 0. If the features are not discretized, an abnormal data "300 years old" will cause great disturbance to the model.

3: Logistic regression belongs to the generalized linear model, and its expressive ability is limited; after the single variable is discretized into N, each variable has a separate weight, which is equivalent to introducing nonlinearity into the model, which can improve the expressive ability of the model and increase the fitting ;

4: After discretization, feature crossover can be performed, changing from M+N variables to M*N variables, further introducing nonlinearity and improving expression ability;

5: After the features are discretized, the model will be more stable. For example, if the user's age is discretized, 20-30 is used as an interval, and a user will not become a completely different person just because he is one year older. Of course, the samples that are adjacent to the interval will be just the opposite, so how to divide the interval is more important.

Guess you like

Origin blog.csdn.net/zzpl139/article/details/129284471