Summary of solutions to training/test set distribution inconsistencies

Reference article:

1. The distribution of the training set and the test set are different - know almost

2. Summary of training/test set distribution inconsistency solutions- Know about

Different situations are handled differently

First, the large distribution gap is the large gap in the value of important features . For example, one is 0-1, and the other is 0.5-2. If this is the case, there is no solution. All you can do is to expand the training set, because once the important features deviate, Even if there are intersections in mathematics, the actual context is very different, which is related to non-technical issues such as data collection processes and specifications.

Second, the important feature data gap is not large, and the less important feature data gap is large. In this case, these features can be shielded, or the idea of ​​​​similar to the migration learning prototype can be used to constrain the feature input not to deviate too much from the training set.

Third, there is not much difference in the value of all the features, but there is a large gap in the correlation statistics between the features. For example, the correlation between A and B in the training set is stronger, but the correlation between A and C in the test set is stronger. This is very important for your model itself. High-order combinations need to be constrained. For example, if you use DNN, it is not a wise choice in the early stage.

Fourth, there is not much difference in feature value and feature correlation, but the gap in target value is too large. This is easy to handle. Change the task to set a common intermediate goal, such as whether the target value you mentioned can take a relative value, the growth rate , Sharpe et al., rather than the absolute value of

Guess you like

Origin blog.csdn.net/ytusdc/article/details/128515236