## 1. Algorithm idea

The essence of logistic regression is based on multiple linear regression, and multiple linear regression is`y=w0 + w1*x1 + w2*x2 + ... + wn*xn`

The value range of the multivariate function is (-∞, +∞), and logistic regression maps the value range to ( Between 0,1), because this can become a probability value. A commonly used method is to bring the value obtained by solving the multivariate function into the sigmoid function to obtain a value in the (0,1) interval; then control the threshold to perform a two-category assessment. Of course, multi-classification tasks only need to process the data set accordingly and turn it into multiple two-classification tasks.

Since I have written a related blog post before, I will not go into details here. For details, please refer to the blog post: 6. Logistic regression

## 2. Official website API

```
class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
```

There are quite a lot of parameters here. For specific parameter usage, you can learn based on the demo provided on the official website and try it more; here are some commonly used parameters for explanation.

Guide package:`from sklearn.linear_model import LogisticRegression`

### ①Penalty item penalty

The choice of penalty term adopts L2 regularization by default; regularization is simply a kind of restriction on the loss function**Constraints**

in linear regression In linear regression, L1 regularization is also called Lasso regression, which can produce sparse models

In linear regression, L2 regularization is also called Ridge regression, which can obtain very small parameters and prevent overfitting< /span>': both L1 and L2 regularization are added**elasticnet** '': Add L2 regularization, default**l2** '': Add L1 regularization**l1** '': No penalty items are added**None**

'

The specific official website details are as follows:

#### Usage

`LogisticRegression(penalty='l2')`

The penalty term here is not optional and needs to be constrained by the parameters** optimization algorithm selection parameter solver**. By default < /span>**solver uses lbfgs**

### ②Optimization algorithm solver

The optimization algorithm selects the parameter solver, which is the loss function optimization method of logistic regression

'**lbfgs**' : Default selection; you can choose **l2** regularization or **None** no penalty Item

'**liblinear**': Can be selected**l1**Regularization or**l2**regularization; **Small data sets are given priority and polynomial loss functions are supported**** a>** regularization or '< /span>**data Used when there are many sets, training is faster**Do not use the penalty term;**None**Regularization or**l2** regularization or **l1**': You can choose **saga****Use it when there are large data sets, and the training will be faster**Do not use the penalty term;**None****l2**': You can choose **sag** '**Use it when the number of training samples is much more than the number of feature parameters**Do not use the penalty term;**None**regularization or**l2**': You can choose **newton-cholesky** '**Supports polynomial loss function** does not use the penalty term;**None**Regularization or**l2**': You can choose**newton-cg**

'

The specific official website details are as follows:

#### Usage

`LogisticRegression(penalty='l1',solver='liblinear')`

`LogisticRegression(penalty='l2',solver='newton-cg')`

### ③The reciprocal C of regularization strength

C: The reciprocal of the regularization strength; must be a positive floating point number; **The smaller the value, the stronger the regularization strength**; The default is 1.0;

The specific official website details are as follows:

#### Usage

`LogisticRegression(C = 1.2)`

### ④Random seed random_state

If you need to control variables for comparison, it is best to set the random seed here to the same integer.

The specific official website details are as follows:

#### Usage

`LogisticRegression(random_state = 42)`

### ⑤Biary classification or multi-classification task multi_class

'**auto**': Automatically set according to the specific conditions of the task, default

' **ovr**': two-category task;**two-category task**or**Solver selected 'liblinear'**

'**multinomial**': multi-classification task; use in other cases

In general, just leave it blank and use the default one

The specific official website details are as follows:

### ⑥Finally build the model

**LogisticRegression(penalty=‘l2’,solver=‘newton-cholesky’,C=0.8,random_state=42)**

## 3. Code implementation

### ①Guide package

Here you need to evaluate, train, save and load the model. The following are some necessary packages. If an error is reported during the import process, just install it with pip.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
```

### ②Load the data set

The data set can be simply created by itself in csv format. What I use here is 6 independent variables X and 1 dependent variable Y.

```
fiber = pd.read_csv("./fiber.csv")
fiber.head(5) #展示下头5条数据信息
```

### ③Divide the data set

The first six columns are the independent variable X, and the last column is the dependent variable Y

Official API of commonly used split data set functions:train_test_split

`test_size`

: Proportion of test set data

`train_size`

: Proportion of training set data

`random_state`

: Random seed

`shuffle`

: Whether to disrupt the data

Because my data set here has a total of 48, training set 0.75, test set 0.25, that is, 36 training sets and 12 test sets

```
X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']
X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)
print(X_train.shape) #(36,6)
print(y_train.shape) #(36,)
print(X_test.shape) #(12,6)
print(y_test.shape) #(12,)
```

### ④Build LR model

You can try setting and adjusting the parameters yourself.

```
lr = LogisticRegression(penalty='l2',solver='newton-cholesky',C=0.8,random_state=42)
```

### ⑤Model training

It’s that simple, a fit function can implement model training

```
lr.fit(X_train,y_train)
```

### ⑥Model evaluation

Throw the test set in and get the predicted test results

```
y_pred = lr.predict(X_test)
```

See if the predicted results are consistent with the actual test set results. If consistent, it is 1, otherwise it is 0. The average is the accuracy.

```
accuracy = np.mean(y_pred==y_test)
print(accuracy) # 0.8333333333333333
```

can also be evaluated by score. The calculation results and ideas are the same. They all look at the probability of the model guessing correctly in all data sets. However, the score function has been encapsulated. Of course, the incoming The parameters are also different, you need to import **accuracy_score**, **from sklearn.metrics import accuracy_score**

```
score = lr.score(X_test,y_test)#得分
print(score)
```

### ⑦Model testing

Get a piece of data and use the trained model to evaluate

Here are six independent variables. I randomly throw them all`test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])`

into the model. Get the prediction result, `prediction = lr.predict(test)`

See what the prediction result is and whether it is the same as the correct result, `print(prediction)`

```
test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = lr.predict(test)
print(prediction) #[2]
```

### ⑧Save the model

lr is the model name, which needs to be consistent

The following parameter is the path to save the model

```
joblib.dump(lr, './lr.model')#保存模型
```

### ⑨Load and use the model

```
lr_yy = joblib.load('./lr.model')
test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#随便找的一条数据
prediction = lr_yy.predict(test)#带入数据，预测一下
print(prediction) #[4]
```

### Complete code

Model training and evaluation does not include ⑧⑨.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import joblib
fiber = pd.read_csv("./fiber.csv")
# 划分自变量和因变量
X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']
#划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)
lr = LogisticRegression(penalty='l2',solver='liblinear',C=0.8,random_state=42)
lr.fit(X_train,y_train)#模型拟合
y_pred = lr.predict(X_test)#模型预测结果
accuracy = np.mean(y_pred==y_test)#准确度
print(accuracy)
score = lr.score(X_test,y_test)#得分
print(score)
```