Logistic regression, activation function sigmoid, loss and optimization, case code implementation

1. Logistic regression

Logistic Regression (Logistic Regression): It is a classification model in machine learning, a classification algorithm, and has a certain relationship with regression. Due to the simplicity and efficiency of the algorithm, it is widely used in practice.

Application scenarios: Advertisement click-through rate, whether it is spam, whether it is sick, financial fraud, false account, etc. It is characterized by the judgment between two categories, and logistic regression is a sharp tool for solving binary classification problems

Principle: To master logistic regression, you need to know what the input value is and how to judge its output

The input is the result of a linear regression

h(w)=w_{1} x_{1}+w_{2} x_{2}+w_{3} x_{3} \ldots+\mathrm{b}

2. Activation function

The sigmoid function, as shown in the figure

g\left(\theta^{T} x\right)=\frac{1}{1+e^{-\theta^{T} x}}

The final classification of logistic regression is to judge whether it belongs to a certain category by the probability value of belonging to a certain category, and this category is marked as 1 (positive example) by default, and the other category is marked as 0 (negative example), which is convenient for loss calculation

 Judgment criteria

  • The regression result is input into the sigmoid function
  • Output result: a probability value in the interval [0, 1], the default is 0.5 as the threshold

Interpretation of output results: Assume that there are two categories A and B, and assume that our probability value belongs to the probability value of the category A(1). Now there is a sample input to the logistic regression output result of 0.6, then the probability value exceeds 0.5, which means that the result of our training or prediction is the category A(1). Conversely, if the result is 0.3, the training or prediction result is category B(0).

3. Loss and optimization

3.1 Loss

The loss of logistic regression is called the log likelihood loss , and the formula is as follows:

  • Separate categories:

 \operatorname{cost}\left(h_{\theta}(x), y\right)=\left\{\begin{array}{ll} -\log \left(h_{\theta}(x)\right) & \text { if } \mathrm{y}=1 \\ -\log \left(1-h_{\theta}(x)\right) & \text { if } \mathrm{y}=0 \end{array}\right.

  • Integrated full loss function

\operatorname{cost}\left(h_{\theta}(x), y\right)=\sum_{i=1}^{m}-y_{i} \log \left(h_{\theta}(x)\right)-\left(1-y_{i}\right) \log \left(1-h_{\theta}(x)\right)

The substitution calculation is as follows

 3.2 Optimization

Use the gradient descent optimization algorithm to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm in front of the logistic regression are updated, the probability of originally belonging to category 1 is increased, and the probability of originally belonging to category 0 is reduced.

4. Logistic regression API

  • sklearn.linear_model.LogisticRegression(solver='liblinear', penalty='l2', C = 1.0): solver is an optional parameter {'liblinear', 'sag', 'saga', 'newton-cg', 'lbfgs'}, default: 'liblinear'; the algorithm used to optimize the problem
    • 'liblinear' is a good choice for small datasets, while 'sag' and 'saga' will be faster for large datasets
    • For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' can handle multinomial losses, and 'liblinear' is limited to 'one-versus-rest' classification
    • penalty: the type of regularization
    • C: Regularization Strength

By default, a small number of categories is regarded as a positive example. The LogisticRegression method is equivalent to SGDClassifier(loss="log", penalty=" "), SGDClassifier implements a common stochastic gradient descent learning, and uses LogisticRegression (implements SAG)

5. Case: Classification prediction of benign/malignant breast cancer tumors

Original data download address: https://archive.ics.uci.edu/ml/machine-learning-databases/

  • data description
    • 699 samples, a total of 11 columns of data, the first column is the id retrieved by the term, the last 9 columns are the medical characteristics related to the tumor, and the last column indicates the value of the tumor type
    • Contains 16 missing values, marked with "?"

The operation is as follows

 The complete code is as follows

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# import ssl # 证书问题
# ssl._create_default_https_context = ssl._create_unverified_context
# 1.获取数据
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                   'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                   'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
                  names=names)
data.head()
--------------------------------------------
data.describe()

# 2.基本数据处理
# 2.1 缺失值处理
data = data.replace(to_replace="?", value=np.NaN)
data = data.dropna()  # 删除NAN值
data.describe()
--------------------------------------------
# 2.2 确定特征值,目标值
x = data.iloc[:, 1:10]
x.head()
y = data["Class"]
y.head()
--------------------------------------------
# 2.3 分割数据
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)
x_train.head()
--------------------------------------------
# 3.特征工程(标准化)
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
x_train
--------------------------------------------
# 4.机器学习(逻辑回归)
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
--------------------------------------------
# 5.模型评估
y_predict = estimator.predict(x_test)
score = estimator.score(x_test, y_test)
print('预测值为:', y_predict, '\n准确率为:', score)

 output

预测值为: [2 4 4 2 2 2 2 2 2 2 2 2 2 4 2 2 4 4 4 2 4 2 4 4 4 2 4 2 2 2 2 2 4 2 2 2 4
 2 2 2 2 4 2 4 4 4 4 2 4 4 2 2 2 2 2 4 2 2 2 2 4 4 4 4 2 4 2 2 4 2 2 2 2 4
 2 2 2 2 2 2 4 4 4 2 4 4 4 4 2 2 2 4 2 4 2 2 2 2 2 2 4 2 2 4 2 2 4 2 4 4 2
 2 2 2 4 2 2 2 2 2 2 4 2 4 2 2 2 4 2 4 2 2 2 4 2 2 2] 
准确率为: 0.9854014598540146

Learning to navigate: http://xqnav.top/

Guess you like

Origin blog.csdn.net/qq_43874317/article/details/128283780