Derivation of Logistic Regression Algorithm and Detailed Implementation Based on Python

1 Overview of Logistic Regression

image.png

Logistic Regression is a statistical learning method for classification problems. It is based on the principle of linear regression, and classifies by mapping the output value of the linear function to the probability value on the [0,1] interval.

The input of logistic regression is a set of feature variables. It obtains a linear function by calculating the product of each feature and the corresponding coefficient, plus the intercept item, and then maps the output value of the function through the sigmoid function to obtain a probability value.

Logistic regression is often used in binary classification problems, that is, to divide samples into two categories, such as judging whether an email is spam. Logistic regression can also be extended to multi-classification problems, such as classifying samples into three or more classes.

Logistic regression has the advantages of simplicity, efficiency, and ease of understanding, and is widely used in practical applications, such as financial risk control, medical diagnosis, and recommendation systems.

2 Derivation and solution of logistic regression formula

2.1 Formula derivation

P ( y = 1 ∣ x ) = 1 1 + e − ( β 0 + β 1 x 1 + β 2 x 2 + . . . + β pxp ) P(y=1|x) = \frac{1}{ 1+e^{-(\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_px_p)}}P ( and=1x)=1+e( b0+ b1x1+ b2x2+ . . . + bpxp)1
You may be familiar with the formula of logistic regression, but you don't know the derivation process. In fact, the derivation process is very simple.

Given an input feature x \mathbf{x}x , the outputyyy can be expressed as:

y = P ( y = 1 ∣ x ) = σ ( w T x ) y = P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x})y=P ( and=1x)=s ( wTx)

σ ( z ) \sigma ( z)σ ( z ) represents the sigmoid function, defined as:σ ( z ) = 1 / ( 1 + e − z ) \sigma(z) = 1 / (1 + e^{-z})σ ( z )=1/(1+ez ). This function will be often used in later deep learning.

As for the use of the sigmoid function, the reason is very simple. Many textbooks ignore this important thinking link, and the blogger will add it here. We hope to find a function whose value range is [0, 1], but this kind of function is not easy to find. The formula y = θ xy=\theta x
used in linear regressiony=The value range of θ x is (-∞, +∞). Therefore, we introduce the concept of odds (probability). odds = P 1 − P odds=\frac{P}{1-P}odds=1PPThe value of odds is (0,+∞), and for the log function, its domain is just (0,+∞), and the value range is (-∞,+∞). Therefore, we can construct a function log ⁡ ( P 1 − P ) = θ x \log(\frac{P}{1-P})=\theta xlog(1PP)=θ x solves P, that is, the basic form of logistic regressionP = 1 1 + e − x P=\frac{1}{1+e^{-x}}P=1+ex1That is what we often call the sigmoid function.

2.2 Formula solution

For the convenience of derivation, we assume that the training data set contains mmm samples, each sample hasnnn features, namelyX ∈ R m × n \mathbf{X} \in \mathbb{R}^{m\times n}XRm × n , the label isy ∈ { 0 , 1 } my \in \{0, 1\}^my{ 0,1}m . In order to build the model, we need to use the training data set to solve the model parametersw \mathbf{w}w

We use maximum likelihood estimation to solve for the model parameters. The goal of maximum likelihood estimation is to find a set of model parameters w \mathbf{w}w , which maximizes the probability of occurrence of the training data set. Suppose the iiin the training data setThe input feature of i samples isxi \mathbf{x}_ixi, the output is yi y_iyi, whose probability is expressed as:

P ( y i ∣ x i ; w ) = σ ( w T x i ) y i ( 1 − σ ( w T x i ) ) 1 − y i P(y_i|\mathbf{x}_i; \mathbf{w}) = \sigma(\mathbf{w}^T \mathbf{x}_i)^{y_i} (1 - \sigma(\mathbf{w}^T \mathbf{x}_i))^{1-y_i} P ( andixi;w)=s ( wTxi)yi(1s ( wTxi))1yi

The probability of the training data set can be expressed as:

P ( y ∣ X ; w ) = ∏ i P ( y i ∣ x i ; w ) = ∏ i σ ( w T x i ) y i ( 1 − σ ( w T x i ) ) 1 − y i P(y|\mathbf{X}; \mathbf{w}) = \prod_i P(y_i|\mathbf{x}_i; \mathbf{w}) = \prod_i \sigma(\mathbf{w}^T \mathbf{x}_i)^{y_i} (1 - \sigma(\mathbf{w}^T \mathbf{x}_i))^{1-y_i} P(yX;w)=iP ( andixi;w)=is ( wTxi)yi(1s ( wTxi))1yi

The log-likelihood function is:

L ( w ) = log ⁡ P ( y ∣ X ; w ) = ∑ i [ y i log ⁡ ( σ ( w T x i ) ) + ( 1 − y i ) log ⁡ ( 1 − σ ( w T x i ) ) ] L(\mathbf{w}) = \log P(y|\mathbf{X}; \mathbf{w}) = \sum_i [y_i \log(\sigma(\mathbf{w}^T \mathbf{x}_i)) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T \mathbf{x}_i))] L(w)=logP(yX;w)=i[yilog g ( σ ( wTxi))+(1yi)log(1s ( wTxi))]

Our goal is to maximize the log-likelihood function L ( w ) L(\mathbf{w})L ( w ) . Use the gradient ascent algorithm to solve the optimal parameterw \mathbf{w}w。对L ( w ) L(\mathbf{w})L(w) 求导,得到: ∂ L ( w ) ∂ w = ∑ i ( σ ( w T x i ) − y i ) x i \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = \sum_i(\sigma(\mathbf{w}^T \mathbf{x}_i) - y_i)\mathbf{x}_i wL(w)=i( s ( wTxi)yi)xiUsing the gradient ascent algorithm, each update w \mathbf{w}wIndividual :w ← w + α ∑ i ( σ ( w T xi ) − yi ) xi \mathbf{w} \leftarrow \mathbf{w} + \alpha \sum_i (\sigma(\mathbf{w}^ T \mathbf{x}_i) - y_i)\mathbf{x}_iww+ai( s ( wTxi)yi)xiAmong them, α \alphaα is the learning rate.

3 Implementation based on Python

3.1 Acceptable parameters

In Python, logistic regression models can be created using the LogisticRegression class from the Scikit-learn library.

The following are the main parameters and methods of the LogisticRegression class:

parameter:

  • penalty: Penalty item, it can be one of 'l1', 'l2', 'elasticnet', 'none', the default is 'l2'.
  • C: Regularization coefficient, used to control the complexity of the model, the smaller the value of C, the simpler the model, the default is 1.0.
  • solver: The algorithm used to optimize the problem, it can be one of 'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga', the default is 'lbfgs'.
  • max_iter: The maximum number of iterations, used to control the number of iterations of the optimization algorithm, the default is 100

3.2 Complete code example

We use the Wisconsin breast cancer data set that comes with sklearn for model training and prediction.

  1. First import necessary packages and data such as sklearn:
import pandas as pd  
from sklearn.datasets import load_breast_cancer  
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import train_test_split  
import matplotlib.pyplot as plt  
import matplotlib as mpl  
  
## 设置字符集,防止中文乱码  
mpl.rcParams['font.sans-serif'] = [u'simHei']  
mpl.rcParams['axes.unicode_minus'] = False  
  
# 加载数据集  
data = load_breast_cancer()  
# 转换为DataFrame  
df = pd.DataFrame(data.data, columns=data.feature_names)  
df['target'] = pd.Series(data.target)

The data that comes with skearn here is already a cleaned version. If you are using the original Wisconsin breast cancer dataset or other personal datasets, you need to view, clean and feature preliminary screening of the data. For example, the feature set may contain useless information such as the patient's ID, and such information can be deleted directly.

At the same time, you can use commonly used functions to view the status of the data set, such as

# 检测非数据类型与缺失值  
print(df.info())
# 检查异常值  
print(df.describe())

The official data running results here are as follows, and you can see that there is no need to make any changes. The output is as follows:
image.png
if object appears in Type, it usually means that there is a non-numeric type in a row of this type, and you can use

df['A'] = pd.to_numeric(df['A'], errors='coerce').astype(float)

This function can convert the A column of df to float type, and change the value that cannot be converted into a null value. Then use it together with other missing values dropna​​to delete.
2. Model building and fitting

# 分割数据集  
X = df.drop('target', axis=1)  
y = df['target']  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)  
  
# 创建逻辑回归模型  
clf = LogisticRegression(max_iter=100)  
  
# 拟合模型  
clf.fit(X_train, y_train)

When building a model, usually there is no need to modify the hyperparameters, and the hyperparameters can also be adjusted according to the actual characteristics of the data set or some warnings after fitting (such as the function does not converge in the end).
3. Model prediction and plotting

# 预测  
y_pred = clf.predict(X_test)  
  
# 评估  
accuracy = clf.score(X_test, y_test)  
  
print("预测结果:", y_pred)  
print("准确率:", accuracy)  
  
plt.plot(range(len(X_test)), y_test, 'ro', markersize=4, zorder=3, label=u'真实值')  
plt.plot(range(len(X_test)), y_pred, 'go', markersize=10, zorder=2, label=u'预测值')  
plt.legend()  
plt.show()

It can be seen from the picture that the points where the red circle and the green circle appear at the same time are the predicted correct data, and the points where the two appear alone are the wrongly predicted data. The judgment of the prediction results here is based on the division of probability greater than 0.5 and less than 0.5. If we want to achieve when the probability greater than 0.8 is 0, it is considered as 0 (reduce the probability of missed diagnosis), we can use the following method.
image.png

  1. custom threshold
print(clf.predict_proba(X_test)[:,0]>0.8)

clf.predict_proba(X_test)It is used to output the probability of different categories, and its output type input is as follows:
image.png
If you want to obtain data with a probability of being classified as 0 greater than 0.8, you can extract the value of this column for judgment.

The code of this article can be downloaded here

Guess you like

Origin blog.csdn.net/nkufang/article/details/129760817