1. Logistic regression
Logistic Regression (Logistic Regression): It is a classification model in machine learning, a classification algorithm, and has a certain relationship with regression. Due to the simplicity and efficiency of the algorithm, it is widely used in practice.
Application scenarios: Advertisement click-through rate, whether it is spam, whether it is sick, financial fraud, false account, etc. It is characterized by the judgment between two categories, and logistic regression is a sharp tool for solving binary classification problems
Principle: To master logistic regression, you need to know what the input value is and how to judge its output
The input is the result of a linear regression
2. Activation function
The sigmoid function, as shown in the figure
The final classification of logistic regression is to judge whether it belongs to a certain category by the probability value of belonging to a certain category, and this category is marked as 1 (positive example) by default, and the other category is marked as 0 (negative example), which is convenient for loss calculation
Judgment criteria
- The regression result is input into the sigmoid function
- Output result: a probability value in the interval [0, 1], the default is 0.5 as the threshold
Interpretation of output results: Assume that there are two categories A and B, and assume that our probability value belongs to the probability value of the category A(1). Now there is a sample input to the logistic regression output result of 0.6, then the probability value exceeds 0.5, which means that the result of our training or prediction is the category A(1). Conversely, if the result is 0.3, the training or prediction result is category B(0).
3. Loss and optimization
3.1 Loss
The loss of logistic regression is called the log likelihood loss , and the formula is as follows:
- Separate categories:
- Integrated full loss function
The substitution calculation is as follows
3.2 Optimization
Use the gradient descent optimization algorithm to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm in front of the logistic regression are updated, the probability of originally belonging to category 1 is increased, and the probability of originally belonging to category 0 is reduced.
4. Logistic regression API
- sklearn.linear_model.LogisticRegression(solver='liblinear', penalty='l2', C = 1.0): solver is an optional parameter {'liblinear', 'sag', 'saga', 'newton-cg', 'lbfgs'}, default: 'liblinear'; the algorithm used to optimize the problem
- 'liblinear' is a good choice for small datasets, while 'sag' and 'saga' will be faster for large datasets
- For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' can handle multinomial losses, and 'liblinear' is limited to 'one-versus-rest' classification
- penalty: the type of regularization
- C: Regularization Strength
By default, a small number of categories is regarded as a positive example. The LogisticRegression method is equivalent to SGDClassifier(loss="log", penalty=" "), SGDClassifier implements a common stochastic gradient descent learning, and uses LogisticRegression (implements SAG)
5. Case: Classification prediction of benign/malignant breast cancer tumors
Original data download address: https://archive.ics.uci.edu/ml/machine-learning-databases/
- data description
- 699 samples, a total of 11 columns of data, the first column is the id retrieved by the term, the last 9 columns are the medical characteristics related to the tumor, and the last column indicates the value of the tumor type
- Contains 16 missing values, marked with "?"
The operation is as follows
The complete code is as follows
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# import ssl # 证书问题
# ssl._create_default_https_context = ssl._create_unverified_context
# 1.获取数据
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses', 'Class']
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
names=names)
data.head()
--------------------------------------------
data.describe()
# 2.基本数据处理
# 2.1 缺失值处理
data = data.replace(to_replace="?", value=np.NaN)
data = data.dropna() # 删除NAN值
data.describe()
--------------------------------------------
# 2.2 确定特征值,目标值
x = data.iloc[:, 1:10]
x.head()
y = data["Class"]
y.head()
--------------------------------------------
# 2.3 分割数据
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)
x_train.head()
--------------------------------------------
# 3.特征工程(标准化)
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
x_train
--------------------------------------------
# 4.机器学习(逻辑回归)
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
--------------------------------------------
# 5.模型评估
y_predict = estimator.predict(x_test)
score = estimator.score(x_test, y_test)
print('预测值为:', y_predict, '\n准确率为:', score)
output
预测值为: [2 4 4 2 2 2 2 2 2 2 2 2 2 4 2 2 4 4 4 2 4 2 4 4 4 2 4 2 2 2 2 2 4 2 2 2 4
2 2 2 2 4 2 4 4 4 4 2 4 4 2 2 2 2 2 4 2 2 2 2 4 4 4 4 2 4 2 2 4 2 2 2 2 4
2 2 2 2 2 2 4 4 4 2 4 4 4 4 2 2 2 4 2 4 2 2 2 2 2 2 4 2 2 4 2 2 4 2 4 4 2
2 2 2 4 2 2 2 2 2 2 4 2 4 2 2 2 4 2 4 2 2 2 4 2 2 2]
准确率为: 0.9854014598540146
Learning to navigate: http://xqnav.top/