[Python things] Prepare data - training set and test set

Ready to work

Data must be carefully prepared before it can be used in machine learning algorithms. Providing training and test sets with consistent class distributions is important for a successful classification model. Continuing to use the iris dataset, 80% of the records are included in the training set, and the remaining 20% ​​are used as the test set.

How to operate

#导入必要的库
from sklearn.datasets import load_iris
import numpy as np
from sklearn.cross_validation import train_test_split

#加载iris数据集
def get_iris_data():
    data = load_iris()
    x = data['data']#实例
    y = data['target']#类别标签
    input_dataset = np.column_stack([x, y])#列合并——类别标签和实例
    np.random.shuffle(input_dataset)#把数据打乱,使得记录能随机地分布到训练集和测试集中
    return input_dataset

#打印原始数据集及分割的训练集和测试集信息
def print_data(input_dataset):
    train_size = 0.80#80%作为训练集
    test_size = 1 - train_size#20%作为测试集
    input_dataset = get_iris_data()
    train, test = train_test_split(input_dataset, train_size=train_size)
    print('Compare Data Set Size')
    print('==========================')
    print('Original Dataset size: {}'.format(input_dataset.shape))
    print('Train size: {}'.format(train.shape))
    print('Test size: {}'.format(test.shape))
    return train, test, test_size

#定义get_class_distribution函数,它采用y的类别标签作为参数,返回一个字典,键是类别标签,值是这类记录数占总数的百分比分布。返回类别标签的分布情况。在随后的函数里调用这个函数,就可以了解训练集和测试集里的类别分布。
def get_class_distribution(y):
    d = {}
    set_y = set(y)
    for y_label in set_y:
        no_elements = len(np.where(y == y_label)[0])
        d[y_label] = no_elements
    dist_percentage = {class_label: count/(1.0*sum(d.values())) for class_label, count in d.items()}
    return dist_percentage

#定义print_class_label_split函数,把训练集和测试集作为参数。
def print_class_label_split(train, test):
    #打印训练集类别分布
    y_train = train[:,-1]
    train_distribution = get_class_distribution(y_train)
    print('\n Train data set class label distribution')
    for k, v in train_distribution.items():
        print('Class label = %d, percentage records = %.2f)'%(k, v))

    #打印测试集类别分布
    y_test = test[:,-1]
    test_distribution = get_class_distribution(y_test)
    print('\n Test data set class label distribution')
    for k, v in test_distribution.items():
        print('Class label = %d, percentage records = %.2f)'%(k, v))

if __name__ == '__main__':
    train, test = print_data(input_dataset=get_iris_data())
    print_class_label_split(train, test)

output:

Compare Data Set Size
==========================
Original Dataset size: (150, 5)
Train size: (120, 5)
Test size: (30, 5)

 Train data set class label distribution
Class label = 0, percentage records = 0.33)
Class label = 1, percentage records = 0.35)
Class label = 2, percentage records = 0.32)

 Test data set class label distribution
Class label = 0, percentage records = 0.33)
Class label = 1, percentage records = 0.27)
Class label = 2, percentage records = 0.40)

Analysis of the results : If you look closely at this result, you will find that the class label distribution of the training set is inconsistent with that of the test set. Just 40% of the instances in the test set belong to class 2, while 32% of the instances in the training setbelong to class 2. This shows that our previous data segmentation method is not suitable, because the column distribution in the training set and the test set should be the same.

Next, let's see how to evenly distribute the class labels in the training and test sets:

if __name__ == '__main__':
    from sklearn.cross_validation import StratifiedShuffleSplit#导入库

    input_dataset = get_iris_data()
    train, test, test_size = print_data(input_dataset)
    print_class_label_split(train, test)
    stratified_split = StratifiedShuffleSplit(input_dataset[:,-1],test_size=test_size)#调用sklearn里的StratifiedShuffleSplit函数。第一个参数是输入的数据集;第二个参数test_size定义了测试集的大小;第三个参数n_iter定义了只进行一次分割。

    for train_indx,test_indx in stratified_split:
        train = input_dataset[train_indx]
        test = input_dataset[test_indx]

    print_class_label_split(train, test)

output:

Train data set class label distribution
Class label = 0, percentage records = 0.33)
Class label = 1, percentage records = 0.33)
Class label = 2, percentage records = 0.33)

 Test data set class label distribution
Class label = 0, percentage records = 0.33)
Class label = 1, percentage records = 0.33)
Class label = 2, percentage records = 0.33)

Now the class distributions of the training and test sets are the same!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324969486&siteId=291194637