1. Algorithm idea
The essence of logistic regression is based on multiple linear regression, and multiple linear regression isy=w0 + w1*x1 + w2*x2 + ... + wn*xn
The value range of the multivariate function is (-∞, +∞), and logistic regression maps the value range to ( Between 0,1), because this can become a probability value. A commonly used method is to bring the value obtained by solving the multivariate function into the sigmoid function to obtain a value in the (0,1) interval; then control the threshold to perform a two-category assessment. Of course, multi-classification tasks only need to process the data set accordingly and turn it into multiple two-classification tasks.
Since I have written a related blog post before, I will not go into details here. For details, please refer to the blog post: 6. Logistic regression
2. Official website API
class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
There are quite a lot of parameters here. For specific parameter usage, you can learn based on the demo provided on the official website and try it more; here are some commonly used parameters for explanation.
Guide package:from sklearn.linear_model import LogisticRegression
①Penalty item penalty
The choice of penalty term adopts L2 regularization by default; regularization is simply a kind of restriction on the loss functionConstraints
in linear regression In linear regression, L1 regularization is also called Lasso regression, which can produce sparse models
In linear regression, L2 regularization is also called Ridge regression, which can obtain very small parameters and prevent overfitting< /span>': both L1 and L2 regularization are addedelasticnet '': Add L2 regularization, defaultl2 '': Add L1 regularizationl1 '': No penalty items are addedNone
'
The specific official website details are as follows:
Usage
LogisticRegression(penalty='l2')
The penalty term here is not optional and needs to be constrained by the parameters optimization algorithm selection parameter solver. By default < /span>solver uses lbfgs
②Optimization algorithm solver
The optimization algorithm selects the parameter solver, which is the loss function optimization method of logistic regression
'lbfgs' : Default selection; you can choose l2 regularization or None no penalty Item
'liblinear': Can be selectedl1Regularization orl2regularization; Small data sets are given priority and polynomial loss functions are supported a> regularization or '< /span>data Used when there are many sets, training is fasterDo not use the penalty term;NoneRegularization orl2 regularization or l1': You can choose sagaUse it when there are large data sets, and the training will be fasterDo not use the penalty term;Nonel2': You can choose sag 'Use it when the number of training samples is much more than the number of feature parametersDo not use the penalty term;Noneregularization orl2': You can choose newton-cholesky 'Supports polynomial loss function does not use the penalty term;NoneRegularization orl2': You can choosenewton-cg
'
The specific official website details are as follows:
Usage
LogisticRegression(penalty='l1',solver='liblinear')
LogisticRegression(penalty='l2',solver='newton-cg')
③The reciprocal C of regularization strength
C: The reciprocal of the regularization strength; must be a positive floating point number; The smaller the value, the stronger the regularization strength; The default is 1.0;
The specific official website details are as follows:
Usage
LogisticRegression(C = 1.2)
④Random seed random_state
If you need to control variables for comparison, it is best to set the random seed here to the same integer.
The specific official website details are as follows:
Usage
LogisticRegression(random_state = 42)
⑤Biary classification or multi-classification task multi_class
'auto': Automatically set according to the specific conditions of the task, default
' ovr': two-category task;two-category taskorSolver selected 'liblinear'
'multinomial': multi-classification task; use in other cases
In general, just leave it blank and use the default one
The specific official website details are as follows:
⑥Finally build the model
LogisticRegression(penalty=‘l2’,solver=‘newton-cholesky’,C=0.8,random_state=42)
3. Code implementation
①Guide package
Here you need to evaluate, train, save and load the model. The following are some necessary packages. If an error is reported during the import process, just install it with pip.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
②Load the data set
The data set can be simply created by itself in csv format. What I use here is 6 independent variables X and 1 dependent variable Y.
fiber = pd.read_csv("./fiber.csv")
fiber.head(5) #展示下头5条数据信息
③Divide the data set
The first six columns are the independent variable X, and the last column is the dependent variable Y
Official API of commonly used split data set functions:train_test_split
test_size
: Proportion of test set data
train_size
: Proportion of training set data
random_state
: Random seed
shuffle
: Whether to disrupt the data
Because my data set here has a total of 48, training set 0.75, test set 0.25, that is, 36 training sets and 12 test sets
X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']
X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)
print(X_train.shape) #(36,6)
print(y_train.shape) #(36,)
print(X_test.shape) #(12,6)
print(y_test.shape) #(12,)
④Build LR model
You can try setting and adjusting the parameters yourself.
lr = LogisticRegression(penalty='l2',solver='newton-cholesky',C=0.8,random_state=42)
⑤Model training
It’s that simple, a fit function can implement model training
lr.fit(X_train,y_train)
⑥Model evaluation
Throw the test set in and get the predicted test results
y_pred = lr.predict(X_test)
See if the predicted results are consistent with the actual test set results. If consistent, it is 1, otherwise it is 0. The average is the accuracy.
accuracy = np.mean(y_pred==y_test)
print(accuracy) # 0.8333333333333333
can also be evaluated by score. The calculation results and ideas are the same. They all look at the probability of the model guessing correctly in all data sets. However, the score function has been encapsulated. Of course, the incoming The parameters are also different, you need to import accuracy_score, from sklearn.metrics import accuracy_score
score = lr.score(X_test,y_test)#得分
print(score)
⑦Model testing
Get a piece of data and use the trained model to evaluate
Here are six independent variables. I randomly throw them alltest = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
into the model. Get the prediction result, prediction = lr.predict(test)
See what the prediction result is and whether it is the same as the correct result, print(prediction)
test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = lr.predict(test)
print(prediction) #[2]
⑧Save the model
lr is the model name, which needs to be consistent
The following parameter is the path to save the model
joblib.dump(lr, './lr.model')#保存模型
⑨Load and use the model
lr_yy = joblib.load('./lr.model')
test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#随便找的一条数据
prediction = lr_yy.predict(test)#带入数据,预测一下
print(prediction) #[4]
Complete code
Model training and evaluation does not include ⑧⑨.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import joblib
fiber = pd.read_csv("./fiber.csv")
# 划分自变量和因变量
X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']
#划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)
lr = LogisticRegression(penalty='l2',solver='liblinear',C=0.8,random_state=42)
lr.fit(X_train,y_train)#模型拟合
y_pred = lr.predict(X_test)#模型预测结果
accuracy = np.mean(y_pred==y_test)#准确度
print(accuracy)
score = lr.score(X_test,y_test)#得分
print(score)