KS test sample test set and training set distribution problem

The reason for the initial introduction of the KS test: When using SVM for classification problems, it was found that the accuracy and other parameters run on the test set were much higher than those on the training set. After analysis, it was inferred that it may be due to the data distribution on the training set and test set. Inconsistent. So I want to check the data distribution through KS.

What is the KS test: The Kolmogorov–Smirnov test, referred to as the KS test, is a non-parametric hypothesis test in statistics, used to detect whether a single sample obeys a certain distribution, or whether two samples obey the same distribution.

KS test uses:

     Single sample:

 

      Two samples:

Note: data1 and data2 are one-dimensional arrays, not lists.

          List to array: np.array(); multi-dimensional array to one-dimensional array: .flatten().

KS return results: The ks test generally returns two values: the first value represents the maximum distance between the two distributions. The smaller the value, the smaller the gap between the two distributions, and the more consistent the distribution will be. The second value is the p value, a parameter used to determine the result of the hypothesis test. The larger the p value, the less able it is to reject the null hypothesis (the two distributions to be tested are identically distributed), that is, the more the two distributions are identically distributed.

     example:

Another application of KS - judging whether the two-classification model can separate positive and negative samples well

     Output result: 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_58222015/article/details/129176914