Logistic regression model and Python code implementation

Principle of Logistic Regression

The linear regression basic model introduced earlier and the optimization model with regularization items can only be used to predict continuous values ​​(what is the label value), if you want to apply to classification problems (label value is 1 or 0), you need to further The operation, that is, the logistic regression that will be introduced in this article, is essentially to predict the probability of the label value being 1 and 0 respectively.

Since the range of continuous values ​​is ( − ∞ , + ∞ ) (-\infty, +\infty)(,+ ) , and the value range of the probability value is( 0 , 1 ) (0, 1)(0,1 ) , so it is necessary to convert the continuous value obtained by the linear regression model and compress it to( 0 , 1 ) (0, 1)(0,1 ) .

sigmoid function

In the logistic regression model, the way to compress continuous values ​​is to use the sigmoid function, whose expression is
f ( x ) = 1 1 + e − xf(x)=\frac{1}{1+e^{-x} }f(x)=1+ex1
The graph of the sigmoid function is as follows, obviously f ( x ) f(x)The range of f ( x ) is( 0 , 1 ) (0, 1)(0,1 ) .

The Python code for the graph solid line is

import matplotlib.pyplot as plt
import numpy as np
import math


# sigmoid函数
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
    

def plot_sigmoid():
    
    # 构造x和y数据
    x = np.linspace(-6, 6, 100)
    y = []
    for i in x:
        y.append(sigmoid(i))
    
    # 绘制基本数据
    plt.figure(figsize=(5, 5))
    plt.plot(x, y)
    
    # 将y轴移至坐标原点,并去除右侧实线
    ax = plt.gca()
    ax.xaxis.set_ticks_position('bottom')
    ax.spines['bottom'].set_position(('data', 0))
    ax.yaxis.set_ticks_position('left')
    ax.spines['left'].set_position(('data', 0))
    ax.spines['right'].set_color('none')
    
    # 增加坐标标签
    plt.xlabel('x', loc='right')
    plt.ylabel('f(x)', loc='top', rotation=0)


if __name__ == '__main__':
    # 绘制sigmoid曲线
    plot_sigmoid()

In addition to the above illustrations, we can also deduce from a mathematical point of view: the predicted value y ^ \hat{y} for the sampley^, its linear regression expression is
y ^ = w 1 ∗ x 1 + w 2 ∗ x 2 + . . . + wn ∗ xn + b \hat{y}=w_1*x_1+w_2*x_2+...+w_n* x_n+by^=w1x1+w2x2+...+wnxn+b
y ^ \hat{y} y^The range is ( − ∞ , + ∞ ) (-\infty, +\infty)(,+ ) , then the exponential functioney ^ e^{\hat{y}}ey^The range is ( 0 , + ∞ ) (0, +\infty)(0,+ ) , construct the following expression
ey ^ ey ^ + 1 \frac{e^{\hat{y}}}{e^{\hat{y}}+1}ey^+1ey^
Obviously, the range of the above formula is ( 0 , 1 ) (0, 1)(0,1 ) , while dividing the numerator and denominator of the above formula byey ^ e^{\hat{y}}ey^, get
1 1 + e − y ^ \frac{1}{1+e^{-\hat{y}}}1+ey^1
The above is the sigmoid function.

At this point, a question actually arises: Why does logistic regression use the sigmoid function? Can other functions be used instead?
I searched the Internet for a long time, but I didn't find an explanation that convinced me. One of them is relatively comprehensive, and I put it here for reference.

optimization modeling

Based on the above analysis, we can make the following definition: For any sample, the probability of its classification result being 1 is
f = 1 1 + e − y ^ f=\frac{1}{1+e^{-\hat{ y}}}f=1+ey^1
where y ^ = w 1 ∗ x 1 + w 2 ∗ x 2 + .y^=w1x1+w2x2+...+wnxn+b

The probability that the sample is classified as 0 is
g = 1 − 1 1 + e − y ^ g=1-\frac{1}{1+e^{-\hat{y}}}g=11+ey^1
Both the numerator and denominator are multiplied by ey ^ e^{\hat{y}}ey^, get
g = 1 1 + ey ^ g=\frac{1}{1+e^{\hat{y}}}g=1+ey^1
For this sample, to make the classification model accurate, then, if the real label yyy is 1, then letfff is maximized; otherwise, ggshould beg is maximized.
Therefore, for the entire training set, an optimization model can be established as follows: find the bestw 1 , w 2 , . . . , wn , b w_1, w_2,...,w_n,bw1,w2,...,wn,b , so that the following expression maximizes
F ∗ GF*GFG
where,FFF is the ffcorresponding to all samples with a real label of 1Multiplication of f function; GGG is the ggcorresponding to all samples with a real label of 0Multiplication of the g function.
In fact, this is the maximum likelihood function in probability theory, but the name is not important, what is important is the understanding of the modeling process.

The problem defined above is an unconstrained optimization problem that can be solved using many optimization algorithms, such as gradient algorithms.

Code

self code

Next, let's take a look at how to write code to implement the weight coefficient www and interceptbbThe optimization calculation of b .

In the following code, the forge dataset is used . This data set has 2-dimensional features, so there are three variables to be optimized: w 1 , w 2 , b w_1, w_2, bw1,w2,b . Values ​​realizes the calculation of multiplication, and at the same time adds a log function on the outside, so as to solve the risk of early convergence due to the small value of values ​​itself. The optimization method uses the scipy.optimize command.

from scipy import optimize


# 自编代码,带优化函数
def f(t):

    # 【数据集1】forge
    X, y = forge()

    values = 0

    for i in range(len(y)):
        # 去log,否则数值太小,无法优化
        if y[i] == 1:
            values -= math.log(1 / (1 + math.exp(-t[0] * X[i, 0] - t[1] * X[i, 1] - t[2])))
        if y[i] == 0:
            values -= math.log(1 / (1 + math.exp(t[0] * X[i, 0] + t[1] * X[i, 1] + t[2])))

    return values


# self代码优化权重系数和截距
def logreg_by_self():

    res = optimize.minimize(f, [0, 0, 0], method='BFGS')
    print('coef_by_self: {}, intercept_by_self: {}'.format(res.x[0:2], res.x[2]))

sklearn code

If you call the toolkit, you can directly use LogisticRegression in sklearn.linear_model.

from sklearn.linear_model import LogisticRegression

# sklearn优化权重系数和截距
def logreg_by_sklearn():
    # 【数据集1】forge
    X, y = forge()

    logreg = LogisticRegression(penalty='none', solver="lbfgs")
    logreg.fit(X, y)
    print('coef_by_sklearn: {}, intercept_by_sklearn: {}'.format(logreg.coef_, logreg.intercept_))

code testing

principle test

First, verify the accuracy of the self-compiled code, call the above two functions, and compare the weight coefficient and intercept value obtained by each

if __name__ == '__main__':

    # 对比self和sklearn计算结果
    logreg_by_self()
    logreg_by_sklearn()

The code output results are as follows. Obviously, the weight coefficients and intercepts obtained by the two methods are consistent, thus verifying the correctness of the principle and code.

coef_by_self: [1.28004377 4.13686808], intercept_by_self: -21.37174954243782
coef_by_sklearn: [[1.28004046 4.13685915]], intercept_by_sklearn: [-21.37170026]

Cross-validation

In fact, the cross-validation strategy can also be used in the logistic regression model to further improve the performance index of the algorithm. The following code uses cross-validation for the cancer data set to obtain the best regularization coefficient C.

import numpy as np
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split


# sklearn代码实现逻辑回归
def logreg_cv_by_sklearn(X, y):
    # 数据集拆分
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # 构造不同的Cs值,logregCV交叉验证得到最佳C
    Cs = np.logspace(-5, 2, 20)
    # 默认solver是lbfgs,会提示未收敛,故修改为liblinear
    logregCV = LogisticRegressionCV(Cs=Cs, cv=5, solver="liblinear")
    logregCV.fit(X_train, y_train)
    print('Logreg_best_C: {}'.format(logregCV.C_))

    # 使用最佳C重新训练
    logreg = LogisticRegression(C=logregCV.C_[0], solver="liblinear")
    logreg.fit(X_train, y_train)

    print('Training set score: {}'.format(logreg.score(X_train, y_train)))
    print('Test set score: {}'.format(logreg.score(X_test, y_test)))


if __name__ == '__main__':

    # 【数据集3】cancer,用于实现交叉验证
    cancer_data, X_arr, y_arr = breast_cancer()
    # sklearn代码实现交叉验证
    logreg_cv_by_sklearn(X_arr, y_arr)

Run the above program, you can get the following results.

Logreg_best_C: [100.]
Training set score: 0.9741784037558685
Test set score: 0.965034965034965

Guess you like

Origin blog.csdn.net/taozibaby/article/details/128172538