Model Selection and Evaluation in Python Machine Learning

1 Introduction

In a machine learning system, how to train a better model, how to judge the effect of the model, and whether the model is overfitting are of great significance to the final use of the model.

2. Model fitting effect

During the training process of a machine learning model, three situations may occur: model underfitting, model normal fitting, and model overfitting. Both model underfitting and model overfitting are bad situations. The following will introduce how to judge which fitting situation the model belongs to from different angles.

2.1 Underfitting and overfitting performance

The three fitting situations of the data are explained as follows.

  • Underfitting: Underfitting refers to the inability to learn useful data patterns from the training data well, so that neither the training data nor the data to be predicted can obtain good prediction results. If too few training samples are used, it is easier to obtain an underfit training model.
  • Normal fitting: Normal fitting means that the trained model can learn from the training set to obtain a model with strong generalization ability and small prediction error. At the same time, the model can also make good predictions for the data to be tested and obtain impressive Satisfactory prediction effect.
  • Overfitting: Overfitting refers to fitting a particular data set too precisely so that the resulting model does not fit other data well or predict future observations. If the model is overfitted, the deviation of the model will be small, but the variance will be large.

The following introduces the representation of the model after training on the data obtained from the fitting effect of different tasks for classification problems and regression problems.

For binary classification problems, an interface can be used to represent the representation of the obtained model and training data.
dichotomy problem

From the above figure, the underfitting data model is relatively simple, so the prediction error obtained will be larger, while the overfitting model is just the opposite. Its interface perfectly classifies all the training data correctly, and the obtained model is too complicated. Although the training data can be predicted 100% correctly, there will be a high error rate when predicting new test data. However, for a model that is normally fitted to the data, the fitting effect on the data is between underfitting and overfitting, and a less complex model is obtained through training to ensure the generalization ability on the test set. The form of prediction error on the training set for the three cases is: underfitting > normal fitting > overfitting; while the form of prediction error on the test set is: underfitting > overfitting > normal fitting.

For regression problems, when predicting continuous variables, the three data fitting situations are shown in the figure below, which shows the possible underfitting, normal fitting and overfitting when data fitting is performed on a set of continuous variables of the three situations.
insert image description here

In many cases, in the face of high-dimensional data, it is difficult to visualize the interface of the classification model and the prediction effect of the regression model, so how to judge the fit of the model? In this case, two judgment schemes can usually be used: one is Judging the difference between the prediction error on the training set and the test set, the normal fitting model usually has the same prediction error on the training set and the test set, and the prediction effect is good; The prediction effect on the test set is poor; the overfitting model will obtain a small prediction error on the training set, but will obtain a large prediction error on the test set. The second is to visualize the changes in the loss function of the three different data fittings on the training data and test data (or verification data) during the training process of the model, as shown in the figure below.
insert image description here

2.2 Methods to avoid underfitting and overfitting

In practice, if it is found that the trained model underfits or overfits the data, it is usually necessary to adjust the model. Solving these problems is a complicated process, and often requires multiple adjustments. The following introduces some methods that can be used solution.

1. Increase the amount of data
If the training data is small, it may lead to underfitting of the data, and occasionally the problem of overfitting on the training set will also occur. Therefore, more training samples usually make the model more stable, so the increase of training samples can not only obtain more effective training results, but also adjust the fitting effect of the model to a certain extent and enhance its generalization ability. However, if the training samples are limited, data augmentation techniques can also be used to expand the existing data set.

2. Reasonable data segmentation
For the existing data set, when training the model, the data set can be divided into training set, verification set and test set (or use cross-validation method). After the data is segmented, the training set can be used to train the model, the learning process of the model can be supervised through the verification set, and the training of the model can be terminated before the network is overfitted. After the model training is over, the test set can be used to test the generalization ability of the training results.
Of course, in the case of ensuring that the data comes from the same distribution as much as possible, how to effectively segment the data set is also very important. The traditional data segmentation method is usually split according to the ratio of 60:20:20, but the amount of data is different. The ratio of data segmentation is also different, especially in the era of big data, if the data set has millions or even hundreds of millions of entries, this traditional division of 60:20:20 ratio is no longer suitable, a better way It is to use 98% of the data set for training, to ensure that as many samples as possible are trained, and 1% of the samples are used for the verification set. This 1% of the data already has enough samples to monitor whether the model is overfitting, and finally use 1% of the samples test the generalization ability of the network. Therefore, with regard to the size of the data volume and the number of network parameters, the data splitting ratio can be determined according to actual needs.

3. Regularization method
The regularization method is a means to solve the problem of model overfitting. It usually adds a norm penalty to the training parameters on the loss function, and constrains the parameters that need to be trained through the added norm penalty to prevent the model from overfitting. Commonly used regularization parameters are l, and 1, the norm, 1, the purpose of the norm penalty is to minimize the absolute value of the parameter, and 1, the purpose of the norm penalty is to minimize the sum of squares of the parameter. Using regularization to prevent overfitting is very effective. In the classic linear regression model, the model using 1 norm regularization is called LASSO regression, and the model using 1 norm regularization is called Ridge regression. These two methods will Introduced in later chapters.

3. Example analysis

3.1 Iris data set

In the Sklearn machine learning package, various datasets are integrated. Iris (Iris) data set, which is a very commonly used data set. There are three subgenera of irises, Iris-setosa, Iris-versicolor and Iris-virginica. The data set contains a total of 4 feature variables and 1 categorical variable. There are 150 samples in total. Iris is an iris plant. The length and width of its sepals and petals are stored here. There are 4 attributes in total. There are three types of iris plants.
insert image description here

3.2 Clustering the iris data

##输出高清图像
##图像显示中文的问题
import matplotlib
matplotlib.rcParams['axes.unicode_minus'] = False
import seaborn as sns ##设置绘图的主题
sns.set(font="Kaiti",style="ticks" , font_scale=1.4)

import pandas as pd #设置数据表每个单元显示内容的最大宽度
pd.set_option ("max_colwidth",100)
import numpy as np
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from mlxtend.plotting import plot_decision_regions
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

##数据准备,读取鸢尾花数据集
X,y=load_iris(return_X_y=True)
##为了方便数据的可视化分析,将数据降维到二维空间
pca=PCA(n_components=2,random_state=3)
X=pca.fit_transform(X)
#可视化数据降维后在空间中的分布情况
plt.figure(figsize=(4,3))
sns.scatterplot(x=X[:,0],y= X[:,1],style=y)
plt.title("Iris Dimension Reduction")
plt.legend(loc="lower right")
plt.grid()
plt.show()

##使用KFold对Iris数据集分类
kf=KFold(n_splits=6,random_state=1,shuffle=True)
datakf=kf.split(X,y)
##获取6折数据
##使用线性判别分类算法进行数据分类
LDA_clf=LinearDiscriminantAnalysis(n_components=2)
scores=[]
##用于保存每个测试集上的精度
plt.figure(figsize=(7,4))
for ii,(train_index,test_index) in enumerate (datakf):
    #使用每个部分的训练数据训练模型
    LDA_clf=LDA_clf.fit (X[train_index], y[train_index])#计算每次在测试数据上的预测精度
    prey=LDA_clf.predict (X[test_index])
    acc=metrics.accuracy_score(y[test_index] ,prey)##可视化每个模型在训练数据上的切分平面
    plt.subplot(2,3,ii+1)
    plot_decision_regions (X[train_index], y[train_index],LDA_clf)
    plt.title ("Test Acc:"+str(np.round (acc,4)))
    scores.append(acc)

plt.tight_layout()
plt.show()
#计算精度的平均值
print("平均Acc:" , np.mean (scores))

result
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/wokaowokaowokao12345/article/details/128444410