Visualize the data distribution of training set and test set (can be run directly)

Attach the code directly, brothers can modify these two lines of path

test = pd.read_csv('D:\wangyong\Wang\kaggle\year/test.csv')
train = pd.read_csv('D:\wangyong\Wang\kaggle\year/train.csv')

Here is the full source code 

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
test = pd.read_csv('D:\wangyong\Wang\kaggle\year/test.csv')
train = pd.read_csv('D:\wangyong\Wang\kaggle\year/train.csv')

print(train.head())
print(train.describe())
print(test.describe())
key_train = train.keys()
key_test = test.keys()

print(key_train)
print(key_test)

for i in range(len(key_test)-1):
    train_data = []
    test_data = []

    for x in train[key_train[i+1]]:
        train_data.append(x)
    for x in test[key_test[i+1]]:
        test_data.append(x)
    plt.figure(figsize=(8,4),dpi = 150)
    sns.kdeplot(train_data,color = "Red",shade = True)
    ax = sns.kdeplot(test_data,color = "Blue",shade = True)

    ax.set_xlabel(key_train[i])
    ax.set_ylabel("values")
    ax.legend(["train","test"])
    plt.show()

 This is a general effect. It can be found that there is still a certain gap between the distribution of the training set and the test set.

This is to observe the distribution of training set and test set in a visual way

Guess you like

Origin blog.csdn.net/weixin_53374931/article/details/131067487