Perform KS test on training set, test set, and verification set in Pyspark

The purpose of the KS test: to verify the consistency of the data distribution characteristics

After the training set, test set, and verification set are split, there may be inconsistent feature distributions, resulting in unnecessary errors and errors in model training

Perform KS inspection through scipy.stats library:

(Assuming that the two data sets being tested have the same distribution characteristics, α=0.05)

import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import ks_2samp

#spark环境
data1 = df_new.select(col('label').alias('label'),
                     col('features').alias('features'))
#拆分数据集                     
train,test,validation = data1.randomSplit([0.6,0.2,0.2],seed=10)

#将pyspark.dataframe 转换为1维array,方便计算ks_2samp
set_1_array = np.array(set_1.select(col_name).collect())[:,0]
set_2_array = np.array(set_2.select(col_name).collect())[:,0]

ks_stat,p_value = ks_2samp(set_1_array,set_2_array)
        
print('The KS statistics for label is: %.8f'%(ks_stat))
print('The P value for label is: %.8f'%(p_value))                    

Output:

The KS statistics for label is: 0.04651163
The P value for label is: 0.99999955

From the above results, it is found that P_value>α cannot reject the original hypothesis, that is, it cannot reject the two data sets belonging to the same distribution, indicating that the data sets still have more consistent distribution characteristics after splitting, and the model obtained by training is relatively difficult Unnecessary errors occur due to uneven feature distribution.

**

If your problem is solved, welcome to follow + like + favorite~

**

Guess you like

Origin blog.csdn.net/weixin_45281949/article/details/105302704