Machine learning model building and evaluation


Datawhale community hands-on data analysis Chapter 3 model building and evaluation learning records

  • model building
    • data division
      • train_test_split()
    • Model Selection and Fitting Data
      • model instantiation,fit()
    • predict
      • predict()
      • predict_proba()
  • model evaluation
    • Cross-validation
      • cross_val_score()
      • cross_val_predict()
    • confusion matrix
      • cofusion_matrix()
      • classification_report() , precision_score() , recall_score() , f1_score()
      • precision_recall_curve()
    • ROC curve
      • roc_curve()
      • roc_auc_score()

Chapter 3 Model Building and Evaluation – Modeling

We have a dataset of the Titanic, so our goal this time is to complete the task of predicting the survival of the Titanic.

import pandas as pd
import numpy as np
%matplotlib inline

Load the cleaned data (clear_data.csv) provided by us, and everyone also loads the original data (train.csv), and talk about their differences

#写入代码
orgin_data = pd.read_csv('train.csv')
orgin_data.head(3)
PassengerId Survived Pclass Name Sex Age SibSp respect Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. A loan female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
#写入代码
clean_data = pd.read_csv('clear_data.csv')
clean_data.head(3)
PassengerId Pclass Age SibSp respect Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 0 3 22.0 1 0 7.2500 0 1 0 0 1
1 1 1 38.0 1 0 71.2833 1 0 1 0 0
2 2 3 26.0 0 0 7.9250 1 0 0 0 1

#different

  1. Removed Survivedfeature (separated labels)
  2. SexConvert text features Embarkedto One-hot encoding
  3. Deleted the Name, Ticket, Cabinfeatures in the original data

model building

  • After processing the previous data, we get the modeling data, and the next step is to choose the appropriate model
  • Before model selection, we need to know whether the data set is ultimately supervised or unsupervised .
  • The choice of model is on the one hand determined by our task.
  • In addition to selecting the model according to our task, it can also be determined according to the data sample size and the sparsity of the features
  • At the beginning, we always try to use a basic model as its baseline, and then train other models for comparison, and finally choose a model with better generalization ability or performance

[Thinking] Which differences in the data set will cause the model to change when fitting the data

#思想answer
The number of samples, the number of features, the importance of features, the gap between sample distributions, noise, etc.

Task 1: Cut training set and test set

Here the data set is divided using the hold-out method

  • Divide the dataset into independent and dependent variables
  • Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
  • using stratified sampling
  • Set random seed for reproducible results

【think】

  • What are the methods for partitioning datasets?
  • Why use stratified sampling and what are the benefits?

#think

  1. Methods for Partitioning Datasets
  • set aside method
    • random partition
    • stratified sampling
  • K-fold cross-validation
  • self-help
  1. Stratified sampling can make the divided data preserve the distribution of the original data

Task Tip 1

  • The purpose of cutting the data set is to evaluate the generalization ability of the model in the future
  • The method of cutting the data set in sklearn istrain_test_split
  • To view the function documentation, you can use it in jupyter noteboo and press train_test_split?Enter to see it
  • Hierarchical and random seeds are found in the parameters

Parameters needed to extract train_test_split() from clear_data.csv and train.csv

#写入代码
X = clean_data
y = orgin_data.Survived
#写入代码
from sklearn.model_selection import train_test_split
#写入代码
?train_test_split
#写入代码
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((712, 11), (179, 11), (712,), (179,))

【think】

  • Under what circumstances does it not need to be randomly selected when cutting the data set?
#思考回答

1. 数据集非常大,随机划分即可能浪费大量时间,也可能降低准确率
2. 时序数据,采用分段切割,避免数据泄露
3. 数据类别不平衡,会采用重采样

Task 2: Model Creation

  • Create a classification model based on a linear model (logistic regression)
  • Create tree-based classification models (decision trees, random forests)
  • Use these models for training, respectively, to the scores of the training set and the test set
  • View the parameters of the model, and change the parameter values ​​to observe the model changes

hint

  • Logistic regression is not a regression model but a classification model, not to be LinearRegressionconfused with
  • Random forest is actually a decision tree ensemble in order to reduce the overfitting of decision trees
  • The module where the linear model is located issklearn.linear_model
  • The module where the tree model is located issklearn.ensemble
#写入代码
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#写入代码
lr = LogisticRegression(max_iter=3000)
lr.fit(X_train, y_train)
lr.get_params()
{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 3000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}
#写入代码 查看训练集和测试集的得分
print("Training set score: {:.2f} ,Testing set score: {:.2f}".format(lr.score(X_train, y_train), lr.score(X_test, y_test)))
Training set score: 0.81 ,Testing set score: 0.80
#写入代码
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc.get_params()
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
#写入代码 查看训练集和测试集的得分
print("Training set score: {:.2f} ,Testing set score: {:.2f}".format(rfc.score(X_train, y_train), rfc.score(X_test, y_test)))
Training set score: 1.00 ,Testing set score: 0.82

【think】

  • Why can linear models perform classification tasks, and what is the mathematical relationship behind them
  • For multi-classification problems, how does the linear model classify

#think answer

  • Use the input features to calculate a weighted sum s = w T xs = w^Txs=wT x, and then map between 0 and 1 through the activation function for classification
  • Can be converted to multiple binary classification problems

Task 3: Output model prediction results

  • output model predicted classification labels
  • Output predicted probabilities for different class labels

Tip 3

  • predictThe general supervision model has a predictive label in sklearn , predict_probawhich can output the label probability
#写入代码
pred = lr.predict(X_test)
pred[:5]
array([0, 0, 0, 0, 1], dtype=int64)
#写入代码
pred_proba = lr.predict_proba(X_test)
pred_proba[:5]
array([[0.92961653, 0.07038347],
       [0.95523041, 0.04476959],
       [0.8395754 , 0.1604246 ],
       [0.96137521, 0.03862479],
       [0.34067921, 0.65932079]])

【think】

  • How does predicting the probability of a label help us
#思考回答  
获取模型对预测结果的确信程度

Chapter 3 Model Construction and Evaluation - Evaluation

According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the division of the data set we know and so on. So how do we know if a model is useful or not? So that we can safely use the results that the model gives me? Then the assessment of today's learning will be very helpful.

Load the following library

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
plt.rcParams['figure.figsize'] = (10, 6)  # 设置输出图片大小

Task: Load data and split test and training sets

#写入代码
data = pd.read_csv('clear_data.csv')
train = pd.read_csv('train.csv')
X = data
y = train['Survived']
from sklearn.model_selection import train_test_split
#写入代码
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
#写入代码
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
LogisticRegression(max_iter=1000)

model evaluation

Model evaluation is to know the generalization ability of the model.

  • K-fold cross-validation (k-fold cross-validation)
  • confusion matrix (confusion_matrix)
    • Accuracy (precision)
    • Recall rate (recall)
    • F1 score
  • ROC curve

Task 1: Cross Validation

  • Evaluate the previous logistic regression model with 10-fold cross-validation
  • Calculate the mean of the cross-validation accuracy
#提示:交叉验证
Image('Snipaste_2020-01-05_16-37-56.png')

insert image description here

Tip 4

  • The module of cross-validation in sklearn issklearn.model_selection
#载入模块
from sklearn.model_selection import cross_val_score
#K折交叉验证
lr = LogisticRegression(C=100, max_iter=1000)
scores = cross_val_score(lr, X_train, y_train, cv=10)
scores
array([0.8358209 , 0.7761194 , 0.82089552, 0.80597015, 0.85074627,
       0.86567164, 0.73134328, 0.85074627, 0.75757576, 0.6969697 ])
#获取K折平均值
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Average cross-validation score: 0.80

thinking 4

  • What kind of impact will it bring when there are more k-folds?
#思考回答  
增加了计算开销

Task 2: Confusion Matrix

  • Calculate the confusion matrix for a binary classification problem
  • Calculate precision, recall, and f-scores

[Thinking] What is the confusion matrix of the binary classification problem? Understand this concept and know what tasks it is mainly used for.

#思考回答  
混淆矩阵用于评估分类器性能,其总体思路就是统计A类别实例被分成为B类别的次数  
混淆矩阵中的行表示实际类别,列表示预测类别  
#提示:混淆矩阵
Image('Snipaste_2020-01-05_16-38-26.png')


insert image description here

#提示:准确率 (Accuracy),精确度(Precision),Recall,f-分数计算方法
Image('Snipaste_2020-01-05_16-39-27.png')


insert image description here

Tip 5

  • sklearn.metricsThe method of confusion matrix is ​​a module in sklearn
  • The confusion matrix needs to input the real label and the predicted label
  • Precision, recall and f-scores available classification_reportmodules
#写入代码
from sklearn.metrics import confusion_matrix
#写入代码
lr = LogisticRegression(C=100, max_iter=500)
lr.fit(X_train, y_train)
LogisticRegression(C=100, max_iter=500)
#写入代码
pred = lr.predict(X_train)
#写入代码
confusion_matrix(y_train, pred)
array([[350,  62],
       [ 71, 185]], dtype=int64)
from sklearn.metrics import classification_report
print(classification_report(y_train, pred))
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       412
           1       0.75      0.72      0.74       256

    accuracy                           0.80       668
   macro avg       0.79      0.79      0.79       668
weighted avg       0.80      0.80      0.80       668

【think】

  • What should you pay attention to when implementing the confusion matrix yourself?
#思考回答  
混淆矩阵中每个值代表的含义和位置,即 TN,FP,FN,TP。  

Draw the PR curve

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, lr.decision_function(X_test))
plt.plot(precisions, recalls)
plt.xlabel("precision")
plt.ylabel("recall")
plt.grid();

insert image description here

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.legend(loc=0)
    plt.xlabel("thresholds")
    plt.grid()

plot_precision_recall_vs_threshold(precisions, recalls, thresholds);

insert image description here

Task 3: ROC curve

  • Draw the ROC curve

[Thinking] What is the ROC curve, and what problem does the ROC curve exist to solve?

#《机器学习实战》  
**受试者工作特征曲线(简称ROC)**
绘制是真正类率(召回率,灵敏度)和假正类率(FPR)关系。  
FPR是被错误分为正类的负类实例比率。它等于1减去真负类率(TNR),后者是被正确分类为负类的负类实例比率,也称为特异度。  


**ROC曲线和PR曲线的选取:**  
一个经验法则是,当正类非常少见或者你更关注假正类而不是假负类时,应该选择PR曲线,反之则是ROC曲线。

Tip 6

  • The module of ROC curve in sklearn issklearn.metrics
  • The larger the area enclosed under the ROC curve, the better
#写入代码
from sklearn.metrics import roc_curve
#写入代码
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
plt.grid();


insert image description here

Think 6

  • How to draw ROC curve for multi-classification problem
#思考回答
绘制每个类别的ROC曲线

[Thinking] What information can you get from this ROC curve? What can this information do?

#思考回答  
ROC曲线下面积,用来指导模型选择。

[Note] Only personal learning records, see Datawhale community open source course hands-on data analysis
for details

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/128758063