Machine Learning: Linear Model Learning Summary (2): Logistic Regression Classification

Based on Teacher Zhou Zhihua's "Machine Learning", the previous study notes , and other materials on the Internet, I will summarize this part of the linear model. Continued from: Machine Learning: Summary of Linear Model Learning (1) .
Study time: 2022.04.18

1. Use SK-Learn to build a logistic regression model

For classification: sklearn.linear_model.LogisticRegression

  • penaltyRegularization method: Default is " l 2 l2l 2 ”。
    • There' l 2 l2l2’, ‘ l 1 l1 l1’, ' e l a s t i c n e t elasticnet There are three types of e l a s t i c n e t ', corresponding to ridge regression, Lasso regression and elastic regression respectively.
  • l1_ratio: Elastic regression parameter, [0,1]. OK' l 2 l2l 2 'and'l 1 l1l 1 'ratio r, used only if it is an "elastic net". Default=None.
  • tolStop tolerance criterion: If not None, stop training at consecutive epochs (loss > best_loss - tol). Default=1e-4.
  • CReciprocal of regularization strength: As with support vector machines, smaller values ​​specify stronger regularization. default=1.0.
  • solver: Algorithms for optimization problems. {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs'.
    • ‘newton-cg’: It is also a member of the Newton method family. It uses the second-order derivative matrix of the loss function, the Hessian matrix, to iteratively optimize the loss function.
    • ‘lbfgs’: A type of quasi-Newton method that uses the second-order derivative matrix of the loss function, the Hessian matrix, to iteratively optimize the loss function.
    • ‘liblinear’: Does not support setting penalty='none'; L1 regularization, mostly used for binary classification.
    • ‘sag’: Requires data to be scaled. Supports L2 regularization. Stochastic average gradient descent is a variant of the gradient descent method.
    • ‘saga’: Requires data to be scaled. Supports L1, L2 and elastic regularization.
  • max_iter: Maximum number of iterations, default=100.
  • multi_class: Taxonomy type, default='auto'.
    • ‘auto’: Automatic judgment.
    • ‘ovr’: Each label is treated as a binary classification problem.
    • ‘multinomial’: Softmax algorithm, multi-classification, even if the data is binary classification , the minimum loss is polynomial loss to fit the entire probability distribution. ('multinomial' is not available when solver ='liblinear').
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

logitReg = LogisticRegression()
params = [
    {
    
    'penalty': ['l2'], 'C': [0.1, 0.08, 0.12], 'solver': ['lbfgs'],
     'max_iter': [1000, 2000], 'multi_class': ['auto', 'multinomial']},
    {
    
    'penalty': ['l1', 'l2'], 'C': [0.1, 0.08, 0.12], 'solver': ['saga'],
     'max_iter': [1000, 2000], 'multi_class': ['auto', 'multinomial']},
    {
    
    'penalty': ['elasticnet'], 'l1_ratio': [0.1, 0.11, 0.12], 'C': [0.1, 0.08, 0.12], 'solver': ['saga'],
     'max_iter': [1000, 2000], 'multi_class': ['auto', 'multinomial']}
]  # 根据所要搜索的模型,调整需要搜索的参数
scores = ['accuracy', 'f1']
best_logitReg = GridSearchCV(logitReg, param_grid=params, n_jobs=-1, scoring=scores, refit='f1', error_score='raise')

# 进行网格搜索
best_logitReg.fit(train_x, train_y)
# 将最优模型传入fare_SGD
logitReg = best_logitReg.best_estimator_

logitReg.fit(train_x, train_y)
train_result = logitReg.predict(train_x)

2. Use SK-Learn to evaluate regression models

2.1 Simply call the Classification_report function

sklearn.metrics.classification_report

Possible parameters:

  • y_true: 1-dimensional array, classification labels of real data.
  • y_pred: 1-dimensional array, classification label predicted by the model.
  • labels=None: List, label names to be evaluated.
  • target_names=None: List, specify the label name (as many categories as there are labels)
  • sample_weight=None: 1-dimensional array, the weight of different data points in the evaluation results.
  • digits=2: The number of retained decimal digits in the evaluation report. If output_dict=Truethis parameter has no effect, the returned value will not be processed.
  • output_dict=False: If true, the evaluation result is returned in dictionary form.
  • zero_division='warn': Set the value to be returned when dividing by zero. If set to "warn", this represents 0 but will also raise a warning.
print(classification_report(y_true, y_pred, digits=6))

2.2 Build functions for batch use

Or try to build a template yourself that can output relevant evaluation indicators in batches:

def classification_evaluation(y_true, y_pred):

    # 输出准确率
    accuracy = accuracy_score(y_true, y_pred)

    # 得到混淆矩阵
    matrix = confusion_matrix(y_true, y_pred)

    # 输出精度:
    precision = precision_score(y_true, y_pred)
    # 输出宏平均 精确率
    macro_precision = metrics.precision_score(y_true, y_pred, average='macro')

    # 输出召回率:
    recall = recall_score(y_true, y_pred)
    # 输出宏平均 召回率
    macro_recall = metrics.recall_score(y_true, y_pred, average='macro')

    # fl_score
    f1 = f1_score(y_true, y_pred)
    # 输出宏平均 fl_score
    macro_f1 = metrics.f1_score(y_true, y_pred, average='weighted')

    # ROC-AUC分数
    roc_auc = roc_auc_score(y_true, y_pred)

3. Complete code

The data set still uses the training set of Spaceship Titanic.

import pandas as pd
from Data_processing_by_Pandas import mango_processing
from Classification_Model_evaluation import classification_evaluation
from Classification_Model_evaluation import plot_confusion_matrix
from Classification_Model_evaluation import plot_curve

# 读取数据
train = pd.read_csv('Titanic.csv')
print(train.describe())

train_target = train['Transported']
train_feature_before = train.drop(['PassengerId', 'Cabin', 'Name', 'Transported'], axis=1)

# 进行数据处理
train_feature = mango_processing(train_feature_before)

# 划分训练集与测试集
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(train_feature, train_target, test_size=0.2, random_state=42)

# 引入网格搜索,找到最优模型
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

logitReg = LogisticRegression()

params = [
    {
    
    'penalty': ['l2'], 'C': [0.05, 0.08, 0.03], 'solver': ['lbfgs'],
     'max_iter': [1000, 800], 'multi_class': ['auto', 'multinomial']},
    {
    
    'penalty': ['l1'], 'C': [0.05, 0.08, 0.03], 'solver': ['liblinear'],
     'max_iter': [1000, 800], 'multi_class': ['auto']},
    {
    
    'penalty': ['l2'], 'C': [0.05, 0.08, 0.03], 'solver': ['sag'],
     'max_iter': [1000, 800], 'multi_class': ['auto', 'multinomial']}
]  # 根据所要搜索的模型,调整需要搜索的参数
scores = ['accuracy', 'f1']
best_logitReg = GridSearchCV(logitReg, param_grid=params, n_jobs=-1, scoring=scores, refit='f1', error_score='raise')

# 进行网格搜索
best_logitReg.fit(train_x, train_y)

# 得到相关参数:
print(best_logitReg.best_score_)
print(best_logitReg.best_params_)

# 将最优模型传入模型
logitReg = best_logitReg.best_estimator_
# 训练模型
logitReg.fit(train_x, train_y)
# 模型预测
train_result = logitReg.predict(train_x)

# 应用自己设置的评价函数输出
labels = ['False', 'True']
# 训练集结果评价
plot_confusion_matrix(classification_evaluation(train_y, train_result), labels)
plot_curve(train_y, train_result)

# 测试集结果评价
test_result = logitReg.predict(test_x)
plot_confusion_matrix(classification_evaluation(test_y, test_result), labels)
plot_curve(test_y, test_result)

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124251740