Iris data set classification-random forest (traversal feature + over-fitting analysis)


The classification of the iris data set-random forest is a relatively simple understanding, which is more basic. Now directly traverse the features of the data set and analyze the overfitting situation.
https://blog.csdn.net/weixin_42567027/article/details/107488666

data set

Insert picture description here

Code

// An highlighted block
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.ensemble import RandomForestClassifier


def iris_type(s):
    it = {
    
    'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
    return it[s]

# 'sepal length', 'sepal width', 'petal length', 'petal width'
iris_feature = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度'

if __name__ == "__main__":
    mpl.rcParams['font.sans-serif'] = [u'SimHei']  # 黑体 FangSong/KaiTi
    mpl.rcParams['axes.unicode_minus'] = False

    data = pd.read_csv('F:\pythonlianxi\shuju\iris.data', header=None)
    x_prime = data[range(4)]
    y = pd.Categorical(data[4]).codes

    feature_pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
    plt.figure(figsize=(16, 9), facecolor='#FFFFFF')
    for i, pair in enumerate(feature_pairs):
        # 准备数据
        x = x_prime[pair]

        # 随机森林  200课决策树,深度为3
        clf = RandomForestClassifier(n_estimators=200, criterion='entropy', max_depth=3)
        clf.fit(x, y.ravel())

        # 画图
        N, M = 50, 50  # 横纵各采样多少个值
        x1_min, x2_min = x.min()
        x1_max, x2_max = x.max()
        t1 = np.linspace(x1_min, x1_max, N)
        t2 = np.linspace(x2_min, x2_max, M)
        x1, x2 = np.meshgrid(t1, t2)  # 生成网格采样点
        x_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点

        # 训练集上的预测结果
        y_hat = clf.predict(x)
        y = y.reshape(-1)
        c = np.count_nonzero(y_hat == y)    # 统计预测正确的个数
        print ('特征:  ', iris_feature[pair[0]], ' + ', iris_feature[pair[1]],end='\t')
        print ('\t预测正确数目:', c,end='\t')
        print ('\t准确率: %.2f%%' % (100 * float(c) / float(len(y))))

        # 显示
        cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
        cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
        y_hat = clf.predict(x_test)  # 预测值
        y_hat = y_hat.reshape(x1.shape)  # 使之与输入的形状相同
        plt.subplot(2, 3, i+1)
        plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)  # 预测值
        plt.scatter(x[pair[0]], x[pair[1]], c=y, edgecolors='k', cmap=cm_dark)  # 样本
        plt.xlabel(iris_feature[pair[0]], fontsize=18)
        plt.ylabel(iris_feature[pair[1]], fontsize=18)
        plt.xlim(x1_min, x1_max)
        plt.ylim(x2_min, x2_max)
        plt.grid()
    plt.tight_layout(2.5)
    plt.subplots_adjust(top=0.92)
    plt.suptitle(u'随机森林对鸢尾花数据的两特征组合的分类结果', fontsize=18)
    plt.show()

experiment analysis

Combining two or two features, the final iris classification effect is not used, which also shows that the distinguishing effect of some features is not good, so the feature can be selected to get the best recognition effect.
Insert picture description here
Insert picture description here

Overfitting analysis

From previous research, it is found that the depth of the decision tree is different, and the recognition rate will also be different. Although the appropriate depth will get a good recognition effect, it will also cause over-fitting, making the algorithm not robust.
The random forest algorithm uses Boostrap in the process of generating random decision trees. Therefore, all samples are not used in the process of generating a tree. The unused samples are called (Out_of_bag) out-of-bag samples (oob data set). Samples can be used to evaluate the accuracy of this tree, and other sub-leaves can be evaluated according to this principle, and finally the average can be taken.
oob_score = True: means to use the oob data set as the verification data set to estimate the generalization ability of the algorithm;
oob_score is False by default, and the oob data set is not used as the verification data set.

Modify the code to analyze the phenomenon of over-fitting.

// An highlighted block
clf = RandomForestClassifier(n_estimators=200, criterion='entropy', max_depth=3, oob_score=True)
clf.fit(x, y.ravel())
print(clf.oob_score_,end='\t')

When max_depth=3, there is no obvious overfitting.
Insert picture description here
When max_depth=5 , the characteristics: calyx length + calyx width, there is an over-fitting phenomenon.
Insert picture description here
When max_depth=10 , the characteristics: calyx length + calyx width, severe overfitting occurred.
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_42567027/article/details/107526826