Based on Teacher Zhou Zhihua's "Machine Learning", the previous study notes , and other materials on the Internet, I will summarize this part of the linear model. Continued from: Machine Learning: Summary of Linear Model Learning (1) .
Study time: 2022.04.18
Article directory
1. Use SK-Learn to build a logistic regression model
For classification: sklearn.linear_model
.LogisticRegression
penalty
Regularization method: Default is " l 2 l2l 2 ”。- There' l 2 l2l2’, ‘ l 1 l1 l1’, ' e l a s t i c n e t elasticnet There are three types of e l a s t i c n e t ', corresponding to ridge regression, Lasso regression and elastic regression respectively.
l1_ratio
: Elastic regression parameter, [0,1]. OK' l 2 l2l 2 'and'l 1 l1l 1 'ratio r, used only if it is an "elastic net". Default=None.tol
Stop tolerance criterion: If not None, stop training at consecutive epochs (loss > best_loss - tol). Default=1e-4.C
Reciprocal of regularization strength: As with support vector machines, smaller values specify stronger regularization. default=1.0.solver
: Algorithms for optimization problems. {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs'.‘newton-cg’
: It is also a member of the Newton method family. It uses the second-order derivative matrix of the loss function, the Hessian matrix, to iteratively optimize the loss function.‘lbfgs’
: A type of quasi-Newton method that uses the second-order derivative matrix of the loss function, the Hessian matrix, to iteratively optimize the loss function.‘liblinear’
: Does not support settingpenalty='none'
; L1 regularization, mostly used for binary classification.‘sag’
: Requires data to be scaled. Supports L2 regularization. Stochastic average gradient descent is a variant of the gradient descent method.- ⭐
‘saga’
: Requires data to be scaled. Supports L1, L2 and elastic regularization.
max_iter
: Maximum number of iterations, default=100.multi_class
: Taxonomy type, default='auto'.‘auto’
: Automatic judgment.‘ovr’
: Each label is treated as a binary classification problem.‘multinomial’
: Softmax algorithm, multi-classification, even if the data is binary classification , the minimum loss is polynomial loss to fit the entire probability distribution. ('multinomial' is not available when solver ='liblinear').
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
logitReg = LogisticRegression()
params = [
{
'penalty': ['l2'], 'C': [0.1, 0.08, 0.12], 'solver': ['lbfgs'],
'max_iter': [1000, 2000], 'multi_class': ['auto', 'multinomial']},
{
'penalty': ['l1', 'l2'], 'C': [0.1, 0.08, 0.12], 'solver': ['saga'],
'max_iter': [1000, 2000], 'multi_class': ['auto', 'multinomial']},
{
'penalty': ['elasticnet'], 'l1_ratio': [0.1, 0.11, 0.12], 'C': [0.1, 0.08, 0.12], 'solver': ['saga'],
'max_iter': [1000, 2000], 'multi_class': ['auto', 'multinomial']}
] # 根据所要搜索的模型,调整需要搜索的参数
scores = ['accuracy', 'f1']
best_logitReg = GridSearchCV(logitReg, param_grid=params, n_jobs=-1, scoring=scores, refit='f1', error_score='raise')
# 进行网格搜索
best_logitReg.fit(train_x, train_y)
# 将最优模型传入fare_SGD
logitReg = best_logitReg.best_estimator_
logitReg.fit(train_x, train_y)
train_result = logitReg.predict(train_x)
2. Use SK-Learn to evaluate regression models
2.1 Simply call the Classification_report function
sklearn.metrics
.classification_report
Possible parameters:
- y_true: 1-dimensional array, classification labels of real data.
- y_pred: 1-dimensional array, classification label predicted by the model.
- labels=None: List, label names to be evaluated.
- target_names=None: List, specify the label name (as many categories as there are labels)
- sample_weight=None: 1-dimensional array, the weight of different data points in the evaluation results.
- digits=2: The number of retained decimal digits in the evaluation report. If
output_dict=True
this parameter has no effect, the returned value will not be processed. - output_dict=False: If true, the evaluation result is returned in dictionary form.
- zero_division='warn': Set the value to be returned when dividing by zero. If set to "warn", this represents 0 but will also raise a warning.
print(classification_report(y_true, y_pred, digits=6))
2.2 Build functions for batch use
Or try to build a template yourself that can output relevant evaluation indicators in batches:
def classification_evaluation(y_true, y_pred):
# 输出准确率
accuracy = accuracy_score(y_true, y_pred)
# 得到混淆矩阵
matrix = confusion_matrix(y_true, y_pred)
# 输出精度:
precision = precision_score(y_true, y_pred)
# 输出宏平均 精确率
macro_precision = metrics.precision_score(y_true, y_pred, average='macro')
# 输出召回率:
recall = recall_score(y_true, y_pred)
# 输出宏平均 召回率
macro_recall = metrics.recall_score(y_true, y_pred, average='macro')
# fl_score
f1 = f1_score(y_true, y_pred)
# 输出宏平均 fl_score
macro_f1 = metrics.f1_score(y_true, y_pred, average='weighted')
# ROC-AUC分数
roc_auc = roc_auc_score(y_true, y_pred)
3. Complete code
The data set still uses the training set of Spaceship Titanic.
import pandas as pd
from Data_processing_by_Pandas import mango_processing
from Classification_Model_evaluation import classification_evaluation
from Classification_Model_evaluation import plot_confusion_matrix
from Classification_Model_evaluation import plot_curve
# 读取数据
train = pd.read_csv('Titanic.csv')
print(train.describe())
train_target = train['Transported']
train_feature_before = train.drop(['PassengerId', 'Cabin', 'Name', 'Transported'], axis=1)
# 进行数据处理
train_feature = mango_processing(train_feature_before)
# 划分训练集与测试集
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(train_feature, train_target, test_size=0.2, random_state=42)
# 引入网格搜索,找到最优模型
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
logitReg = LogisticRegression()
params = [
{
'penalty': ['l2'], 'C': [0.05, 0.08, 0.03], 'solver': ['lbfgs'],
'max_iter': [1000, 800], 'multi_class': ['auto', 'multinomial']},
{
'penalty': ['l1'], 'C': [0.05, 0.08, 0.03], 'solver': ['liblinear'],
'max_iter': [1000, 800], 'multi_class': ['auto']},
{
'penalty': ['l2'], 'C': [0.05, 0.08, 0.03], 'solver': ['sag'],
'max_iter': [1000, 800], 'multi_class': ['auto', 'multinomial']}
] # 根据所要搜索的模型,调整需要搜索的参数
scores = ['accuracy', 'f1']
best_logitReg = GridSearchCV(logitReg, param_grid=params, n_jobs=-1, scoring=scores, refit='f1', error_score='raise')
# 进行网格搜索
best_logitReg.fit(train_x, train_y)
# 得到相关参数:
print(best_logitReg.best_score_)
print(best_logitReg.best_params_)
# 将最优模型传入模型
logitReg = best_logitReg.best_estimator_
# 训练模型
logitReg.fit(train_x, train_y)
# 模型预测
train_result = logitReg.predict(train_x)
# 应用自己设置的评价函数输出
labels = ['False', 'True']
# 训练集结果评价
plot_confusion_matrix(classification_evaluation(train_y, train_result), labels)
plot_curve(train_y, train_result)
# 测试集结果评价
test_result = logitReg.predict(test_x)
plot_confusion_matrix(classification_evaluation(test_y, test_result), labels)
plot_curve(test_y, test_result)