Background: When verifying the distribution of the data set, I used kdeplot to plot and visualize, but it was found that there are jagged and pulse-shaped abnormal images, which is very puzzling at first glance.
Real cases of personal data exploration (spark environment):
from scipy.stats import kstest
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# 定义函数kde_plot,kde默认选择高斯分布gau,绘制2个数据集的特征分布情况defkde_plot(dataset_1,dataset_2,feature_num):
plot_01 = np.array(dataset_1.select('features').collect())[:,0,feature_num]
plot_02 = np.array(dataset_2.select('features').collect())[:,0,feature_num]
sns.kdeplot(plot_01,legend=True)
sns.kdeplot(plot_02,legend=True)#可视化
plt.figure(figsize=[16,9])for i inrange(1,12):
plt.subplot(3,4,i)
kde_plot(train,test,i-1)
Run the above code for visual output:
It was found that the penultimate graph was abnormal, because the KS test had been conducted before , and the data characteristics of the two were consistent and statistically significant.
In order to find out, I checked the specific situation of the second-to-last feature data and found that the values of this feature are quite different. In order to verify this phenomenon, I deliberately made a data set with a large difference between max and min for verification:
Sure enough, a pulse-like image appeared instead of a smooth Gaussian distribution.
Therefore, for the visual comparison of such discrete data, perhaps it can only be observed simply through the histogram:
x = np.array(train.select('features').collect())[:,0,9]
y = np.array(test.select('features').collect())[:,0,9]
plt.hist(x,bins=30)
plt.hist(y,bins=30);
By drawing the histogram, it is found that the distributions of the two are indeed consistent, but the catch is that there are areas that cannot be matched. The main reason is that the total amount of data is not large enough, resulting in less data in the test set (orange) after the split.
In addition, by bootstrapping other data cases, it is found that impulsive/jaggy images will also appear when the ratios of the values of features differ greatly, such as:
It can be found that the output of the result is not only not smooth, but even the ratio of 0 to 1 is distorted. It is better to visualize this type of situation through hist.
If your problem is solved, welcome to follow + like + favorite~