Unbalanced machine learning process

Sample imbalance often leads to two questions:

1. The model of large number of samples classified cause over fitting that the samples are always assigned to the larger number of sample classification;

2. In addition, a typical problem is Accuracy Paradox, this question refers to the high accuracy of the sample forecasting model, but the generalization ability of the model is poor.

The reason is that the model will most samples are classified as large number of samples that class.

 

For imbalance samples, there are several common Solutions

  1. Gather more data
  2. Evaluation index change
  3. Sample data
  4. Sample synthesis
  5. Changing sample weights

 

Gather more data

Collect more data, so that the proportion of the balance of positive and negative samples, this method is often the most overlooked way to practice, however, when the cost of not collecting data, this method is most effective.

Note, however, when the proportion of collected scene data to produce the original data is unbalanced, this approach does not solve the problem of unbalanced ratio data.

 

Evaluation index change

Change evaluation index, which is not to judge the accuracy and selection model, the reason is Accuracy Paradox problems we mentioned above. In fact there are some indicators that specifically address the judge when the judge issues Unbalanced, such as accuracy, recall, F1 value, ROC (AUC), Kappa and so on.

According to the article, ROC curve with good nature does not change with the proportion of the sample, and therefore better able to reflect the merits of the classification of samples in the case of unbalanced ratio.

 

Sample data

Sample data can be targeted to change the ratio of data samples, samples there are two ways: over-sampling, and under-sampling, the former is the number of samples in the sample increased less, which method is a direct copy of the original sample, and then It is reduced by the larger number of samples in the sample, which method is to discard those extra samples.

Generally speaking, when more of the total number of samples considered under-sampling, the number of samples and less time to consider the number of over-sampling.

 

Sample synthesis

Synthesis sample (Synthetic Samples) sample of the kind in order to increase the number of samples is small, the synthesis means by combining existing sample of each feature to generate a new sample.

One of the simplest way is selected at random from each feature has a value, then spliced ​​into a new sample, this method increases the number of samples of the small sample size categories, with the above-mentioned effect of over- sampling the same method except that the above method is to simply copy the sample, and where the new sample is obtained splicing.

Such methods is a method representative SMOTE (Synthetic Minority Over-sampling Technique), by the method of random selection feature similar samples and in splicing the new sample.

 

Changing sample weights

Changing sample weights refers to the right to increase the small sample size of sample weight category, when such a sample was mistakenly sharing, the loss of value to be multiplied by the corresponding weight, so that the classifier pay more attention to small number of this type sample.

Guess you like

Origin www.cnblogs.com/tianqizhi/p/12156216.html