2. Logistic regression algorithm (LR, Logistic Regression) (supervised learning)

1. Algorithm idea

The essence of logistic regression is based on multiple linear regression, and multiple linear regression isy=w0 + w1*x1 + w2*x2 + ... + wn*xn
The value range of the multivariate function is (-∞, +∞), and logistic regression maps the value range to ( Between 0,1), because this can become a probability value. A commonly used method is to bring the value obtained by solving the multivariate function into the sigmoid function to obtain a value in the (0,1) interval; then control the threshold to perform a two-category assessment. Of course, multi-classification tasks only need to process the data set accordingly and turn it into multiple two-classification tasks.
Since I have written a related blog post before, I will not go into details here. For details, please refer to the blog post: 6. Logistic regression

2. Official website API

Official website API

class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

There are quite a lot of parameters here. For specific parameter usage, you can learn based on the demo provided on the official website and try it more; here are some commonly used parameters for explanation.
Guide package:from sklearn.linear_model import LogisticRegression

①Penalty item penalty

The choice of penalty term adopts L2 regularization by default; regularization is simply a kind of restriction on the loss functionConstraints
in linear regression In linear regression, L1 regularization is also called Lasso regression, which can produce sparse models
In linear regression, L2 regularization is also called Ridge regression, which can obtain very small parameters and prevent overfitting< /span>': both L1 and L2 regularization are addedelasticnet '': Add L2 regularization, defaultl2 '': Add L1 regularizationl1 '': No penalty items are addedNone
'


The specific official website details are as follows:
Insert image description here

Usage

LogisticRegression(penalty='l2')
The penalty term here is not optional and needs to be constrained by the parameters optimization algorithm selection parameter solver. By default < /span>solver uses lbfgs

②Optimization algorithm solver

The optimization algorithm selects the parameter solver, which is the loss function optimization method of logistic regression
'lbfgs' : Default selection; you can choose l2 regularization or None no penalty Item
'liblinear': Can be selectedl1Regularization orl2regularization; Small data sets are given priority and polynomial loss functions are supported a> regularization or '< /span>data Used when there are many sets, training is fasterDo not use the penalty term;NoneRegularization orl2 regularization or l1': You can choose sagaUse it when there are large data sets, and the training will be fasterDo not use the penalty term;Nonel2': You can choose sag 'Use it when the number of training samples is much more than the number of feature parametersDo not use the penalty term;Noneregularization orl2': You can choose newton-cholesky 'Supports polynomial loss function does not use the penalty term;NoneRegularization orl2': You can choosenewton-cg
'


The specific official website details are as follows:
Insert image description here

Usage

LogisticRegression(penalty='l1',solver='liblinear')
LogisticRegression(penalty='l2',solver='newton-cg')

③The reciprocal C of regularization strength

C: The reciprocal of the regularization strength; must be a positive floating point number; The smaller the value, the stronger the regularization strength; The default is 1.0;

The specific official website details are as follows:
Insert image description here

Usage

LogisticRegression(C = 1.2)

④Random seed random_state

If you need to control variables for comparison, it is best to set the random seed here to the same integer.

The specific official website details are as follows:
Insert image description here

Usage

LogisticRegression(random_state = 42)

⑤Biary classification or multi-classification task multi_class

'auto': Automatically set according to the specific conditions of the task, default
' ovr': two-category task;two-category taskorSolver selected 'liblinear'
'multinomial': multi-classification task; use in other cases
In general, just leave it blank and use the default one

The specific official website details are as follows:
Insert image description here

⑥Finally build the model

LogisticRegression(penalty=‘l2’,solver=‘newton-cholesky’,C=0.8,random_state=42)

3. Code implementation

①Guide package

Here you need to evaluate, train, save and load the model. The following are some necessary packages. If an error is reported during the import process, just install it with pip.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

②Load the data set

The data set can be simply created by itself in csv format. What I use here is 6 independent variables X and 1 dependent variable Y.
Insert image description here

fiber = pd.read_csv("./fiber.csv")
fiber.head(5) #展示下头5条数据信息

Insert image description here

③Divide the data set

The first six columns are the independent variable X, and the last column is the dependent variable Y

Official API of commonly used split data set functions:train_test_split
Insert image description here
test_size: Proportion of test set data
train_size: Proportion of training set data
random_state: Random seed
shuffle: Whether to disrupt the data
Because my data set here has a total of 48, training set 0.75, test set 0.25, that is, 36 training sets and 12 test sets

X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']

X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)

print(X_train.shape) #(36,6)
print(y_train.shape) #(36,)
print(X_test.shape) #(12,6)
print(y_test.shape) #(12,)

④Build LR model

You can try setting and adjusting the parameters yourself.

lr = LogisticRegression(penalty='l2',solver='newton-cholesky',C=0.8,random_state=42)

⑤Model training

It’s that simple, a fit function can implement model training

lr.fit(X_train,y_train)

⑥Model evaluation

Throw the test set in and get the predicted test results

y_pred = lr.predict(X_test)

See if the predicted results are consistent with the actual test set results. If consistent, it is 1, otherwise it is 0. The average is the accuracy.

accuracy = np.mean(y_pred==y_test)
print(accuracy) # 0.8333333333333333

can also be evaluated by score. The calculation results and ideas are the same. They all look at the probability of the model guessing correctly in all data sets. However, the score function has been encapsulated. Of course, the incoming The parameters are also different, you need to import accuracy_score, from sklearn.metrics import accuracy_score

score = lr.score(X_test,y_test)#得分
print(score)

⑦Model testing

Get a piece of data and use the trained model to evaluate
Here are six independent variables. I randomly throw them alltest = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
into the model. Get the prediction result, prediction = lr.predict(test)
See what the prediction result is and whether it is the same as the correct result, print(prediction)

test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = lr.predict(test)
print(prediction) #[2]

⑧Save the model

lr is the model name, which needs to be consistent
The following parameter is the path to save the model

joblib.dump(lr, './lr.model')#保存模型

⑨Load and use the model

lr_yy = joblib.load('./lr.model')

test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#随便找的一条数据
prediction = lr_yy.predict(test)#带入数据,预测一下
print(prediction) #[4]

Complete code

Model training and evaluation does not include ⑧⑨.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import joblib


fiber = pd.read_csv("./fiber.csv")
# 划分自变量和因变量
X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']
#划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)

lr = LogisticRegression(penalty='l2',solver='liblinear',C=0.8,random_state=42)
lr.fit(X_train,y_train)#模型拟合

y_pred = lr.predict(X_test)#模型预测结果
accuracy = np.mean(y_pred==y_test)#准确度
print(accuracy)
score = lr.score(X_test,y_test)#得分
print(score)

Guess you like

Origin blog.csdn.net/qq_41264055/article/details/133016215