learning curve learning_curve

Determine the cross-validated train and test scores for different training set sizes.

A method used to judge the training model. By observing the drawn learning curve, we can intuitively understand what state our model is in, such as overfitting or underfitting. (underfitting).

The cross-validation generator splits the entire dataset into training and test data k times. Subsets of the training set with different sizes will be used to train the estimator, and scores are calculated for each training subset size and test set. Afterwards, for each training subset size, the scores across all k runs are averaged.

sklearn.model_selection.learning_curve(estimator, X, y, *, groups=None, train_sizes=array([0.1, 0.33, 0.55, 0.78, 1.0]), cv=None, scoring=None, exploit_incremental_learning=False, n_jobs=None, pre_dispatch='all', verbose=0, shuffle=False, random_state=None, error_score=nan, return_times=False, fit_params=None)

Parameter
estimator : The type of object that implements the "fit" and "predict" methods An
object of this type that will be cloned for each validation.

X
: Array-like, training vector of shape (n_samples, n_features) , where n_samples is the number of samples and n_features is the number of features.

y : Array-like, shape (n_samples) or (n_samples, n_features), optional
Target for classification or regression with respect to X; None for unsupervised learning.

groups : Array class, shape (n_samples,), optional
Label grouping of samples used when splitting the dataset into training/test sets. Only used to join groups of cross-validation instances (e.g. GroupKFold).

train_sizes : Array class, shape (n_ticks), dtype float or int
Relative or absolute number of training examples that will be used to generate the learning curve. If dtype is float, it is considered a fraction of the maximum size of the training set (determined by the selected validation method), i.e. it must be within (0, 1], otherwise it will be interpreted as an absolute size. Note that for classification, The number of samples must usually be large enough to contain at least one sample from each class (default: np.linspace(0.1, 1.0, 5))

cv : int, cross-validation generator or iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:

●None, to use the default three-fold cross-validation (will be changed to five-fold in v0.22 version)
●Integer, used to specify the number of folds in (stratified) KFold,
●CV splitter
●Iterable set (training, test) split into indexed arrays.
For integer/none input, use StratifiedKFold if the estimator is a classifier and y is binary or multiclass. In all other cases, use KFold.

scoring : String, callable or None, optional, default: None
string (see model evaluation documentation) or scorer callable object/function with signature scorer(estimator, X, y).

exploit_incremental_learning : boolean, optional, default: False
If the estimator supports incremental learning, this parameter will be used to speed up fitting different training set sizes.

n_jobs : int or None, optional (default=None)
Number of jobs to run in parallel. None means 1. -1 means use all processors. See the glossary for more details.

pre_dispatch : integer or string, optional
Number of pre-scheduled jobs to be executed in parallel (default is all). This option reduces the allocated memory. The string can be an expression like "2 * n_jobs".

verbose : integer, optional
Control verbosity: the higher, the more messages.

shuffle : boolean, optional
Whether to shuffle the training data before prefixing it based on ``train_sizes''.

random_state : int, RandomState instance or None, optional (default = None)
If int, random_state is the seed used by the random number generator; otherwise, false. If a RandomState instance, random_state is the random number generator; if None, the random number generator is the RandomState instance used by np.random. Used when shuffle is True.

error_score : 'raise' | 'raise-deprecating' or number
The value assigned to the score if an error occurs in the estimator fit. If set to "raise", an error is raised. If set to "raise-deprecating", a FutureWarning will be printed before an error occurs. If a numeric value is given, a FitFailedWarning is raised. This parameter does not affect the reinstall step, which will always throw an error. The default is "deprecated", but starting with version 0.22 it will be changed to np.nan.

Return value
train_sizes_abs : array, shape (n_unique_ticks,), dtype int
Number of training examples that have been used to generate the learning curve. Note that the number of ticks may be less than n_ticks since duplicate entries will be removed.

train_scores : Array, shape (n_ticks, n_cv_folds)
training set scores.

test_scores : Array, shape (n_ticks, n_cv_folds)
test set scores.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve #画学习曲线的类
from sklearn.model_selection import ShuffleSplit #设定交叉验证模式的类
from time import time
import datetime

def plot_learning_curve(estimator,title, X, y, 
                        ax, #选择子图
                        ylim=None, #设置纵坐标的取值范围
                        cv=None, #交叉验证
                        n_jobs=None #设定索要使用的线程
                       ):
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y
                                                            ,cv=cv,n_jobs=n_jobs)   
    #train_sizes 每次分训练集和测试集建模之后,训练集上的样本数量
    #train_scores 训练集上的分数
    ax.set_title(title)
    if ylim is not None:
        ax.set_ylim(*ylim)
    ax.set_xlabel("Training examples")
    ax.set_ylabel("Score")
    ax.grid() #显示网格作为背景,不是必须
    ax.plot(train_sizes, np.mean(train_scores, axis=1), 'o-'
            , color="r",label="Training score")
    ax.plot(train_sizes, np.mean(test_scores, axis=1), 'o-'
            , color="g",label="Test score")
    ax.legend(loc="best")
    return ax
digits = load_digits()
X, y = digits.data, digits.target
X.shape #(1797, 64)
X #是一个稀疏矩阵

title = ["Naive Bayes","DecisionTree","SVM, RBF kernel","RandomForest","Logistic"]
model = [GaussianNB(),DTC(),SVC(gamma=0.001)
         ,RFC(n_estimators=50),LR(C=.1,solver="lbfgs")]
cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0) #n_splits把数据分为多少份,20% * 50 份的数据会被作为测试集

fig, axes = plt.subplots(1,5,figsize=(30,6))
for ind, title_, estimator in zip(range(len(title)),title,model):
    times = time()
    plot_learning_curve(estimator, title_, X, y,
                        ax=axes[ind], ylim = [0.7, 1.05],n_jobs=4, cv=cv)
    print("{}:{}".format(title_,datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f")))
plt.show()

insert image description here
The states shown by the three models are very interesting.

The first result we return is the running time of each algorithm . As you can see, decision trees and Bayesian are about the same (if you don't find this result, you can run it a few more times, and you will find that the running times of Bayesian and decision trees gradually become similar). The reason why the decision tree can run very fast is because the classification tree in sklearn is "lazy" when selecting features. It does not calculate the information entropy of all features but randomly selects a part of the features for calculation. Therefore, it is understandable that it is fast, but We know that the computational efficiency of decision trees will become slower and slower as the sample size gradually increases, but Naive Bayes can obtain good results on very few samples, so we can expect that as the sample size gradually increases, Bayesian will gradually become faster than decision tree. The calculation speed of Naive Bayes is far better than that of SVM. For complex models such as Random Forest, the operation of logistic regression is strongly affected by the maximum number of iterations and the input data (logistic regression generally runs faster on linear data, but in This should be affected by the sparse matrix). Therefore, in terms of calculation time, Naive Bayes is still very advantageous.

Next, let's take a look at the fit of each algorithm on the training set . The handwritten digit data set is a relatively simple data set. Decision trees, forests, SVC and logistic regression all successfully fit 100% accuracy, but the highest Bayesian training accuracy does not exceed 95%, which should also be It proves what we said at the beginning that the classification effect of Naive Bayes is actually not as good as other classifiers, and Bayes’ inherent learning ability is relatively weak. And we noticed that as the training sample size gradually increases, the training fitting of other models remains at 100% level, but the training accuracy of Bayesian gradually decreases, which proves that the larger the sample size, the better Bayesian training accuracy. The more things Si needs to learn, the worse the fit to the training set becomes. On the contrary, a relatively small number of samples can give Bayesian a higher training accuracy.

Let’s look at the fitting problem again . At first glance, we can see that all models are in an over-fitting state when the sample size is small (performing well on the training set, but performing poorly on the test set). However, as the number of samples gradually increases, the over-fitting problem disappears. Gradually disappear, but each model handles it differently. More powerful classifiers, such as SVM, random forest and logistic regression, rely on quickly improving the performance of the model on the test set to alleviate the overfitting problem. In contrast, although decision trees also reduce overfitting by improving the performance of the model on the test set, as the number of training samples increases, the performance of the model on the test set improves very slowly. Naive Bayes is unique in that it relies on the accuracy of the training set to decrease and the accuracy of the test set to increase to gradually solve the over-fitting problem.

Next, look at the fitting results of each algorithm on the test set, that is, the size of the generalization error . As the number of training samples rises, the test performance of all models rises, but the performance of Bayesian and decision trees on the test set is far inferior to SVM, random forest and logistic regression. When the amount of training data increases to about 1,500 samples, the performance of SVM on the test set is very close to 100%, while the performance of random forest and logistic regression is also above 95%, while the performance of decision tree and naive Bayes is still lingering. Around 85%. But the situations faced by the two models are very different: although the test results of the decision tree are not high, it still has potential because its overfitting phenomenon is very serious. We can make the test results of the decision tree approach the training by reducing branches. result. However, Bayesian over-fitting phenomenon almost disappears when the training samples reach about 1500. The scores on the training set and the scores on the test set are very close. Only when there are very few cases, the scores on the test set can be It is higher than the result on the training set, so we can basically judge that about 85% is the limit of Bayesian on this data set. It can be predicted that if we adjust parameters, the decision tree should eventually reach a prediction accuracy of about 90%, but Bayesian has almost no potential.

From this comparison, we can see that Bayesian is an algorithm that is very fast, but the classification effect is average, and the results after the first training are very close to the algorithm limit, and there is almost no room for parameter adjustment. In other words, if we are pursuing probability predictions and want to be as accurate as possible, then we should choose logistic regression first. If the data is very complex, or a sparse matrix, then we firmly use Bayesian. If the goal of our classification is not to pursue the prediction of probability, then we can try the effect of Gaussian Naive Bayes first (anyway, it operates very quickly and does not require too many samples). If the effect is very good, we can I was lucky enough to get a model that performed well and was fast. If we don't get better results, we can choose to change to a more complex model.

Guess you like

Origin blog.csdn.net/qq_45694768/article/details/121090321