Uneven sample set of problems

2019-08-27 11:01:52

Problem Description: For binary classification, if the set of positive and negative samples in the training is very uneven, such as the emergence of 1: 1000 even larger proportion of the poor, how to deal with data to better training model.

Problem Solving:

Why do so many of unbalanced data classification model problem occurs in the training data it? The reason is the nature of the training model optimization objective function and evaluation criteria when people test is inconsistent. This may be due to sample not aligned when the test sample distribution and training data distribution agreement, such as in the training of positive and negative samples serious imbalance, but at the time of the test, the proportion of positive and negative is almost 1: 1. It may be due to the different weight categories of heavy training phase is inconsistent with the test, for example, believe in the training of the contribution of all samples is the same, but the accuracy of some of the samples at the time of the test requirements is much higher.

According to the above-described problems, the general process may be imbalance in the sample from two perspectives.

  • Methods based on data

The method is based on the core data of the original training set by uneven sampling of the way so that it becomes balanced. Specifically, there are two options, one oversampling, the second is undersampled.

Oversampling : oversampling is essentially expand a small sample, so that the number increases. Oversampling algorithms can be understood as a heuristic, that makes sense, and is not an absolute right algorithm to ensure that the effect of over-sampling algorithm.

1) The simplest solution is to focus on small samples directly from the training sample selection back there, nature is a small sample set of data is duplicated. The disadvantage is that there is no increase in the amount of information the training data set, is easy to over-fitting.

2) SMOTE algorithm, based on a few samples of each sample set x, which randomly selected K-nearest neighbor set in a small number of samples from a y, and a randomly selected point on the x, y synthesized sample as a new connection.

3) Add minority of disturbance or noise transform (image dataset crop the image, flip, rotate, plus light, etc.) to construct a new sample.

 

Undersampled : sub-sampling is to reduce diverse nature of the present, so that the number becomes small. Undersampling algorithm can be understood as a heuristic, also not a standard algorithm can guarantee the effect of undersampling.

1) The simplest solution is randomly selected from the most part of the sample directly to form a new class sample set majority class. The disadvantage is that information is lost, resulting in the model learned only part of the overall pattern.

2) can be used to solve informed undersampling algorithm random sub-sampling data loss caused problems. The most common method is easy ensemble algorithm. That is, each class most random from a subset of the subset and then using the trained classifier minority; above process is repeated to obtain a plurality of classifiers, the final classification result of the aforementioned classification result of integration.

3) NearMiss algorithm, using the information K nearest neighbor selection of a representative sample, or be selected according to the result of clustering.

 

  • Based on the method

When the sample is not balanced, it can also to correct this imbalance by changing the model training when the objective function. The most common solution is to use different weights for different categories.

 

Guess you like

Origin www.cnblogs.com/TIMHY/p/11417389.html