How to solve the problem of inconsistent distribution of training set and test set

First, why should the distribution of the training set and test set be consistent

The distribution of the training set and the test set is inconsistent, that is, the data set is offset , and its uniform division and uniform distribution have an impact on the final model prediction results. Of course, the more robust and smarter model networks and algorithms that may appear in the future may need this inconsistent division to enhance the robustness and actual effectiveness of the network, but at the moment, it does not affect our model. Training has a relatively large impact, which will lead to lower scores after the test set prediction is submitted, resulting in unsatisfactory results.

There are two main reasons for it (there are various reasons):

  1. Sample selection bias : The training set is obtained by a biased method, such as non-uniform selection (Non-uniform Selection), resulting in a real sample space that cannot be well represented by the training set
  2. Environmental imbalance (that is, different time and space have different situations) : This problem occurs when the collection environment of the training set data is inconsistent with the test set, usually due to changes in time or space.

For example, in classification (image classification, NLP classification, etc.) tasks, the random division of the data set or the original data set itself does not take into account the problem of class balance. On the contrary; or in both the training set and the test set, category 1 is much larger than category 2; or a certain type of data in the training set accounts for the vast majority or a minority, while the test set for this type of data accounts for the vast majority or majority, etc. This kind of sample selection bias problem will lead to the poor robustness of the trained model on the test set, because the training set does not cover the entire sample space well . In addition, in addition to the target variable, the input features may also have sample selection bias problems . For example, to predict the survival rate of Titanic passengers, the input features of the training set are more male under the "gender", while the "gender" in the test set is more is female, which will also cause the model to perform poorly on the test set. The deviation caused by the input features is mainly due to the fact that the feature engineering stage does not process the data set well , which leads to the low score and poor robustness of the model on the test set.

As for the partial month of data caused by environmental imbalance, this is mainly due to the influence of various objective and subjective factors during data collection , which leads to data inconsistency.

Second, how to judge

  1. KDE (Kernel Density Estimation) Distribution Plot
    KDE is a nonparametric test for estimating density functions for which the distribution is unknown. By drawing the probability density function histogram or kernel density estimation graph to compare and observe the distribution of the training set and the test set in the data set, you can intuitively observe the distribution between the two.

#Mainly used function: sns.kdeplot()

Sample code:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 创建样例特征
train_mean, train_cov = [0, 2], [(1, .5), (.5, 1)]
test_mean, test_cov = [0, .5], [(1, 1), (.6, 1)]
train_feat, _ = np.random.multivariate_normal(train_mean, train_cov, size=50).T
test_feat, _ = np.random.multivariate_normal(test_mean, test_cov, size=50).T

# 绘KDE对比分布
sns.kdeplot(train_feat, shade = True, color='r', label = 'train')
sns.kdeplot(test_feat, shade = True, color='b', label = 'test')
plt.xlabel('Feature')
plt.legend()
plt.show()
  1. KS test
    KS test is also a non-parametric test method (that is, the test is performed without knowing the data distribution). It is based on CDF (cumulative distribution function) to test whether the distribution between two data sets is consistent.

Example:
insert image description here
As shown in the figure above, the figure shows the CDF curves under two different data sets, and the maximum vertical difference in the figure can be used to describe the distribution difference .

#Mainly used function: scipy.stats.ks_2samp()[Used to get the statistical value of KS (maximum vertical difference) and p-value under hypothesis test ]

code:

from scipy import stats
stats.ks_2samp(train_feat,test_feat)

When the KS statistic value is small and the p value is large, the null hypothesis H0 is accepted, that is, the two data distributions are consistent; when the p value is <0.01, the null hypothesis is rejected.

  1. Adversarial verification
    We build a classifier to classify the training set and the test set. If the model can clearly classify, it means that there is a clear difference between the training set and the test set (that is, the distribution is inconsistent)

(1) The training set and the test set are merged, and the label 'Is_Test' is added to mark the training set sample as 0 and the test set sample as 1.

(2) Build a classifier (such as LGB, XGB, etc.) to train the mixed data set (cross-validation can be used), and fit the target label 'Is_Test'.

(3) Output the optimal AUC score in cross-validation. The larger the AUC (closer to 1), the more inconsistent the distribution of the training set and the test set.

Third, how to solve the inconsistent distribution of data sets

  1. Construct a suitable validation set (to build a validation set that is similar to the distribution of the test set)

(1) Manually divide the verification set
to take time series as an example, because the general test set will also be future data, so we must also ensure that the training set is historical data, and the divided verification set is future data, otherwise "time travel" will occur The problem of data leakage caused the model to overfit (for example, using the future to predict historical data). At this time, there are two methods of verification division that can be used for reference:

a. TimeSeriesSplit: b. TimeSeriesSplit provided by Sklearn
.

Fixed window sliding division method: fixed time window, continuously sliding on the data set, to obtain training set and verification set.
insert image description here
In addition to time series data, the validation set division of other data sets must follow a principle, that is, to conform to the data pattern of the test set as much as possible.

(2) Select the sample most similar to the test set as the validation set

Adversarial verification, we train a classifier to classify the training set and the test set, then naturally we can also predict the probability that the training set belongs to the test set (that is, the training set predicts the probability under the 'Is_Test' label), our training set The predicted probability is sorted in descending order, and the top 20% sample division with the highest probability is selected as the verification set , so that we can obtain a verification set with a distribution close to the test set from the original data set. Afterwards, we can also evaluate the distribution of the divided verification set and test set. **Evaluation method: **Do adversarial verification on the verification set and test set. If the AUC is smaller, it indicates the distribution of the divided verification set and test set. The closer it is (that is, the more indistinguishable the classifier is from the validation set and the test set).
insert image description here
(3) Weighted cross-validation

If we give greater sample weights to samples whose distribution in the training set is more biased towards the distribution of the test set, and give smaller weights to the samples in the training set that are not consistent with the distribution of the test set, it can also help us to get less jitter offline to a certain extent. evaluation score. In the Dataset initialization parameters of the lightgbm library, the sample weighted parameter weight is provided, see the document [8] for details. In Figure 7, the Is_Test probability of the training set predicted by the classifier against the verification can be used as the weight.

  1. Remove features with inconsistent distribution

When we encounter features with inconsistent distribution and less important features, we can choose to delete such features directly.

For those features whose distribution is inconsistent but very important, we need to coordinate the relationship between feature distribution and feature importance according to the actual situation, and make certain processing.

  1. Fix prediction output with inconsistent distribution

Examine the distribution of the target features to see if there is room for correction. That is, correction processing is performed on the model result after training. (adjust accordingly for better and more efficient results)

  1. Fix feature inputs with inconsistent distributions

When we compare and observe the KDE (kernel density probability map) of the training set and the test set, if we find that mathematical operations on the data (such as addition, subtraction, multiplication, and division) or adding and deleting samples can correct the distribution, so that the distribution is close to the same, then the data can be corrected. Such preprocessing is done to better train the model.

  1. Pseudo-label

Pseudo-labeling is a semi-supervised method that utilizes unlabeled data for training.

General methods and steps:
(1) Use the labeled training set to train the model M;

(2) Then use the model M to predict the unlabeled test set;

(3) Select samples with high prediction confidence in the test set to join the training set;

(4) Use labeled samples and high-confidence predicted samples to train the model M';

(5) Predict the test set and output the prediction results.

The training of the model introduces some samples of the test set, which is equivalent to introducing a part of the distribution of the test set. But it should be noted: (1) Compared with the previous method, the pseudo-label usually does not perform very well, because it introduces the test set samples with high confidence , and these samples are likely to be close to the training set distribution, so it will The predicted probability is high. Therefore, the introduced test set distribution is not very different, so overfitting often occurs when used . (2) Note that the samples with high confidence are introduced. If samples with low confidence are introduced, it will bring a lot of noise. In addition, it is not recommended to select too many high-confidence samples to add to the training set, which is also to avoid model overfitting. (3) Pseudo-labels apply a little more to the image domain.

insert image description here

Original:
link

This article is only for learning records and sharing, intrusion and deletion

Guess you like

Origin blog.csdn.net/qq_53250079/article/details/128385744