Seaborn.kdeplot drawing image display abnormal problem

Background: When verifying the distribution of the data set, I used kdeplot to plot and visualize, but it was found that there are jagged and pulse-shaped abnormal images, which is very puzzling at first glance.

Real cases of personal data exploration (spark environment):

from scipy.stats import kstest
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 定义函数kde_plot,kde默认选择高斯分布gau,绘制2个数据集的特征分布情况
def kde_plot(dataset_1,dataset_2,feature_num):
    
    plot_01 = np.array(dataset_1.select('features').collect())[:,0,feature_num]
    plot_02 = np.array(dataset_2.select('features').collect())[:,0,feature_num]
    
    sns.kdeplot(plot_01,legend=True)
    sns.kdeplot(plot_02,legend=True)

#可视化
plt.figure(figsize=[16,9])

for i in range(1,12):
    plt.subplot(3,4,i)
    kde_plot(train,test,i-1)

Run the above code for visual output:

It was found that the penultimate graph was abnormal, because the KS test had been conducted before , and the data characteristics of the two were consistent and statistically significant.

Found that the second to last picture is abnormal

In order to find out, I checked the specific situation of the second-to-last feature data and found that the values ​​of this feature are quite different. In order to verify this phenomenon, I deliberately made a data set with a large difference between max and min for verification:

make_list = [1.22,1.23,1.25,2.34,5.33,2.11,2,0.001,100.55,1000]
make_array = np.array(make_list)
sns.kdeplot(make_array);

Impulse image

Sure enough, a pulse-like image appeared instead of a smooth Gaussian distribution.

Therefore, for the visual comparison of such discrete data, perhaps it can only be observed simply through the histogram:

x = np.array(train.select('features').collect())[:,0,9]
y = np.array(test.select('features').collect())[:,0,9]

plt.hist(x,bins=30)
plt.hist(y,bins=30);

hist visualization

By drawing the histogram, it is found that the distributions of the two are indeed consistent, but the catch is that there are areas that cannot be matched. The main reason is that the total amount of data is not large enough, resulting in less data in the test set (orange) after the split.

In addition, by bootstrapping other data cases, it is found that impulsive/jaggy images will also appear when the ratios of the values ​​of features differ greatly, such as:

#当0和1的比例比较接近1:1时
make_list03 = [1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0]
make_array = np.array(make_list03)
sns.kdeplot(make_array);

Visualization results:

Smooth distribution curve

#当0的数量明显大于1时
make_list03 = [0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0]
make_array = np.array(make_list03)
sns.kdeplot(make_array);

Visualization results:

Jagged curve

It can be found that the output of the result is not only not smooth, but even the ratio of 0 to 1 is distorted. It is better to visualize this type of situation through hist.

If your problem is solved, welcome to follow + like + favorite~

Guess you like

Origin blog.csdn.net/weixin_45281949/article/details/105310108