Logistic regression, tumor prediction case

1. Logistic regression

(1) Definition and usage scenarios

Logistic Regression is a classification model in machine learning. Logistic regression is a classification algorithm, although it has regression in its name. Due to the simplicity and efficiency of the algorithm, it is widely used in practice.

Examples of application scenarios:

  • Is it spam
  • Are you sick
  • Financial fraud
  • Fake account

Seeing the above example, we can find the characteristic, that is, it belongs to the judgment between the two categories. Logistic regression is a way to solve the binary classification problem.

(2) Input and output of logistic regression

The input of logistic regression is actually the output of linear regression first. It can be simply understood that   h(w) is the input of logistic regression.

With the input, how to judge the type of output? In this case, the activation function is needed. Send h(w) into the sigmoid function to get a probability value in the range of [0,1]. Usually 0.5 is the threshold, the prediction result greater than 0.5 is a positive example, and the prediction result less than 0.5 is a negative example.

(3) Loss calculation

The loss of logistic regression is called the log-likelihood loss , and the formula is as follows:

It   can be seen from  the image of the  -log  function  

  • When the true value is 1, the predicted value (that is, the probability value obtained after the sigmoid activation function) is closer to 1, the loss is smaller
  • When the true value is 0, the closer the predicted value is to 0, the smaller the loss will be

Example of loss calculation:

(4) sklearn logistic regression API

sklearn.linear_model.LogisticRegression(solver='liblinear', penalty=‘l2’, C = 1.0)

  • Solver optional parameters: {'liblinear','sag','saga','newton-cg','lbfgs'},

    • Default:'liblinear'; the algorithm used to optimize the problem.
    • For small data sets, "liblinear" is a good choice, while "sag" and'saga' are faster for large data sets.

    • For multi-class problems, only'newton-cg','sag','saga' and'lbfgs' can handle multiple losses; "liblinear" is limited to "one-versus-rest" classification.

  • penalty: the type of regularization

  • C: Strength of regularization

Official API:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn-linear-model-logisticregression

Note: The LogisticRegression method is equivalent to SGDClassifier(loss="log", penalty=" "). SGDClassifier implements an ordinary stochastic gradient descent learning.

 

2. Tumor prediction case

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, SGDClassifier

# 不进行ssl验证
import ssl
ssl._create_default_https_context = ssl._create_unverified_context


"""
1.获取数据
2.基本数据处理
2.1 缺失值处理
2.2 确定特征值,目标值
2.3 分割数据
3.特征工程(标准化)
4.机器学习(逻辑回归)
5.模型评估

"""

# 1.获取数据
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                   'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                   'Normal Nucleoli', 'Mitoses', 'Class']
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
                  names=names)

# 2.基本数据处理
# 2.1 缺失值处理
data = data.replace("?",np.nan)
data = data.dropna()

# 2.2 确定特征值,目标值
x = data.iloc[:, 1:10]
# y = data.iloc[:, 10:10]
# print(type(y))
# 或者
y = data["Class"]
print(type(y))

# 2.3 分割数据
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)

# 3.特征工程(标准化)
transfor = StandardScaler()
x_train = transfor.fit_transform(x_train)
x_test = transfor.transform(x_test)

# 4.机器学习(逻辑回归)
estimator = LogisticRegression(max_iter=10000, n_jobs=-1,)
estimator.fit(x_train, y_train)

# 5.模型评估
res = estimator.score(x_test,y_test)
print("准确率为:",res)
y_pred = estimator.predict(x_test)

# 精确率|召回率
report = classification_report(y_test, y_pred)
print(report)

y_test = np.where(y_test>2.5, 1, 0)
print("AUC指标:", roc_auc_score(y_test, y_pred))

 

 

Guess you like

Origin blog.csdn.net/qq_39197555/article/details/115282870