Machine learning model building and evaluation

Model building and evaluation

Datawhale community hands-on data analysis Chapter 3 model building and evaluation learning records

model building
- data division
  - train_test_split()
- Model Selection and Fitting Data
  - model instantiation,fit()
- predict
  - predict()
  - predict_proba()
model evaluation
- Cross-validation
  - cross_val_score()
  - cross_val_predict()
- confusion matrix
  - cofusion_matrix()
  - classification_report() , precision_score() , recall_score() , f1_score()
  - precision_recall_curve()
- ROC curve
  - roc_curve()
  - roc_auc_score()

Chapter 3 Model Building and Evaluation – Modeling

We have a dataset of the Titanic, so our goal this time is to complete the task of predicting the survival of the Titanic.

import pandas as pd
import numpy as np

%matplotlib inline

Load the cleaned data (clear_data.csv) provided by us, and everyone also loads the original data (train.csv), and talk about their differences

#写入代码
orgin_data = pd.read_csv('train.csv')
orgin_data.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. A loan	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

#写入代码
clean_data = pd.read_csv('clear_data.csv')
clean_data.head(3)

	PassengerId	Pclass	Age	SibSp	Fare	Sex_female	Sex_male	Embarked_C	Embarked_S
0	0	3	22.0	1	7.2500	0	1	0	1
1	1	1	38.0	1	71.2833	1	0	1	0
2	2	3	26.0	0	7.9250	1	0	0	1

#different

Removed Survivedfeature (separated labels)
SexConvert text features Embarkedto One-hot encoding
Deleted the Name, Ticket, Cabinfeatures in the original data

model building

After processing the previous data, we get the modeling data, and the next step is to choose the appropriate model
Before model selection, we need to know whether the data set is ultimately supervised or unsupervised .
The choice of model is on the one hand determined by our task.
In addition to selecting the model according to our task, it can also be determined according to the data sample size and the sparsity of the features
At the beginning, we always try to use a basic model as its baseline, and then train other models for comparison, and finally choose a model with better generalization ability or performance

[Thinking] Which differences in the data set will cause the model to change when fitting the data

#思想answer
The number of samples, the number of features, the importance of features, the gap between sample distributions, noise, etc.

Task 1: Cut training set and test set

Here the data set is divided using the hold-out method

Divide the dataset into independent and dependent variables
Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
using stratified sampling
Set random seed for reproducible results

【think】

What are the methods for partitioning datasets?
Why use stratified sampling and what are the benefits?

#think

Methods for Partitioning Datasets

set aside method
- random partition
- stratified sampling
K-fold cross-validation
self-help

Stratified sampling can make the divided data preserve the distribution of the original data

Task Tip 1

The purpose of cutting the data set is to evaluate the generalization ability of the model in the future
The method of cutting the data set in sklearn istrain_test_split
To view the function documentation, you can use it in jupyter noteboo and press train_test_split?Enter to see it
Hierarchical and random seeds are found in the parameters

Parameters needed to extract train_test_split() from clear_data.csv and train.csv

#写入代码
X = clean_data
y = orgin_data.Survived

#写入代码
from sklearn.model_selection import train_test_split

#写入代码
?train_test_split

#写入代码
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 11), (179, 11), (712,), (179,))

【think】

Under what circumstances does it not need to be randomly selected when cutting the data set?

#思考回答

1. 数据集非常大，随机划分即可能浪费大量时间，也可能降低准确率
2. 时序数据，采用分段切割，避免数据泄露
3. 数据类别不平衡，会采用重采样

Task 2: Model Creation

Create a classification model based on a linear model (logistic regression)
Create tree-based classification models (decision trees, random forests)
Use these models for training, respectively, to the scores of the training set and the test set
View the parameters of the model, and change the parameter values to observe the model changes

hint

Logistic regression is not a regression model but a classification model, not to be LinearRegressionconfused with
Random forest is actually a decision tree ensemble in order to reduce the overfitting of decision trees
The module where the linear model is located issklearn.linear_model
The module where the tree model is located issklearn.ensemble

#写入代码
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#写入代码
lr = LogisticRegression(max_iter=3000)
lr.fit(X_train, y_train)
lr.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 3000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

#写入代码 查看训练集和测试集的得分
print("Training set score: {:.2f} ,Testing set score: {:.2f}".format(lr.score(X_train, y_train), lr.score(X_test, y_test)))

Training set score: 0.81 ,Testing set score: 0.80

#写入代码
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

#写入代码 查看训练集和测试集的得分
print("Training set score: {:.2f} ,Testing set score: {:.2f}".format(rfc.score(X_train, y_train), rfc.score(X_test, y_test)))

Training set score: 1.00 ,Testing set score: 0.82

【think】

Why can linear models perform classification tasks, and what is the mathematical relationship behind them
For multi-classification problems, how does the linear model classify

#think answer

Use the input features to calculate a weighted sum $s = w^Tx$ , and then map between 0 and 1 through the activation function for classification
Can be converted to multiple binary classification problems

Task 3: Output model prediction results

output model predicted classification labels
Output predicted probabilities for different class labels

Tip 3

predictThe general supervision model has a predictive label in sklearn , predict_probawhich can output the label probability

#写入代码
pred = lr.predict(X_test)
pred[:5]

array([0, 0, 0, 0, 1], dtype=int64)

#写入代码
pred_proba = lr.predict_proba(X_test)
pred_proba[:5]

array([[0.92961653, 0.07038347],
       [0.95523041, 0.04476959],
       [0.8395754 , 0.1604246 ],
       [0.96137521, 0.03862479],
       [0.34067921, 0.65932079]])

【think】

How does predicting the probability of a label help us

#思考回答  
获取模型对预测结果的确信程度

Chapter 3 Model Construction and Evaluation - Evaluation

According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the division of the data set we know and so on. So how do we know if a model is useful or not? So that we can safely use the results that the model gives me? Then the assessment of today's learning will be very helpful.

Load the following library

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
plt.rcParams['figure.figsize'] = (10, 6)  # 设置输出图片大小

Task: Load data and split test and training sets

#写入代码
data = pd.read_csv('clear_data.csv')
train = pd.read_csv('train.csv')
X = data
y = train['Survived']

from sklearn.model_selection import train_test_split

#写入代码
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

#写入代码
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

model evaluation

Model evaluation is to know the generalization ability of the model.

K-fold cross-validation (k-fold cross-validation)
confusion matrix (confusion_matrix)
- Accuracy (precision)
- Recall rate (recall)
- F1 score
ROC curve

Task 1: Cross Validation

Evaluate the previous logistic regression model with 10-fold cross-validation
Calculate the mean of the cross-validation accuracy

#提示：交叉验证
Image('Snipaste_2020-01-05_16-37-56.png')

insert image description here

Tip 4

The module of cross-validation in sklearn issklearn.model_selection

#载入模块
from sklearn.model_selection import cross_val_score

#K折交叉验证
lr = LogisticRegression(C=100, max_iter=1000)
scores = cross_val_score(lr, X_train, y_train, cv=10)
scores

array([0.8358209 , 0.7761194 , 0.82089552, 0.80597015, 0.85074627,
       0.86567164, 0.73134328, 0.85074627, 0.75757576, 0.6969697 ])

#获取K折平均值
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Average cross-validation score: 0.80

thinking 4

What kind of impact will it bring when there are more k-folds?

#思考回答  
增加了计算开销

Task 2: Confusion Matrix

Calculate the confusion matrix for a binary classification problem
Calculate precision, recall, and f-scores

[Thinking] What is the confusion matrix of the binary classification problem? Understand this concept and know what tasks it is mainly used for.

#思考回答  
混淆矩阵用于评估分类器性能，其总体思路就是统计A类别实例被分成为B类别的次数  
混淆矩阵中的行表示实际类别，列表示预测类别

#提示：混淆矩阵
Image('Snipaste_2020-01-05_16-38-26.png')

insert image description here

#提示：准确率 (Accuracy),精确度（Precision）,Recall,f-分数计算方法
Image('Snipaste_2020-01-05_16-39-27.png')

insert image description here

Tip 5

sklearn.metricsThe method of confusion matrix is a module in sklearn
The confusion matrix needs to input the real label and the predicted label
Precision, recall and f-scores available classification_reportmodules

#写入代码
from sklearn.metrics import confusion_matrix

#写入代码
lr = LogisticRegression(C=100, max_iter=500)
lr.fit(X_train, y_train)

LogisticRegression(C=100, max_iter=500)

#写入代码
pred = lr.predict(X_train)

#写入代码
confusion_matrix(y_train, pred)

array([[350,  62],
       [ 71, 185]], dtype=int64)

from sklearn.metrics import classification_report

print(classification_report(y_train, pred))

              precision    recall  f1-score   support

           0       0.83      0.85      0.84       412
           1       0.75      0.72      0.74       256

    accuracy                           0.80       668
   macro avg       0.79      0.79      0.79       668
weighted avg       0.80      0.80      0.80       668

【think】

What should you pay attention to when implementing the confusion matrix yourself?

#思考回答  
混淆矩阵中每个值代表的含义和位置，即 TN，FP，FN，TP。

Draw the PR curve

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, lr.decision_function(X_test))
plt.plot(precisions, recalls)
plt.xlabel("precision")
plt.ylabel("recall")
plt.grid();

insert image description here

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.legend(loc=0)
    plt.xlabel("thresholds")
    plt.grid()

plot_precision_recall_vs_threshold(precisions, recalls, thresholds);

insert image description here

Task 3: ROC curve

Draw the ROC curve

[Thinking] What is the ROC curve, and what problem does the ROC curve exist to solve?

#《机器学习实战》  
**受试者工作特征曲线（简称ROC）**
绘制是真正类率（召回率，灵敏度）和假正类率（FPR）关系。  
FPR是被错误分为正类的负类实例比率。它等于1减去真负类率（TNR），后者是被正确分类为负类的负类实例比率，也称为特异度。  


**ROC曲线和PR曲线的选取：**  
一个经验法则是，当正类非常少见或者你更关注假正类而不是假负类时，应该选择PR曲线，反之则是ROC曲线。

Tip 6

The module of ROC curve in sklearn issklearn.metrics
The larger the area enclosed under the ROC curve, the better

#写入代码
from sklearn.metrics import roc_curve

#写入代码
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
plt.grid();

insert image description here

Think 6

How to draw ROC curve for multi-classification problem

#思考回答
绘制每个类别的ROC曲线

[Thinking] What information can you get from this ROC curve? What can this information do?

#思考回答  
ROC曲线下面积，用来指导模型选择。

[Note] Only personal learning records, see Datawhale community open source course hands-on data analysis
for details