How to deal with uneven data

 

definition

In binary classification, for example, we assume the data set is S, the majority of the dataset class S_maj, s_min is the minority class, usually the proportion of the majority class samples is 100: 1, 1000: 1, and even 10000: 1, in this case an unbalanced data, learning data, i.e., the imbalance needs to uneven distribution of the data in such a study focused on the useful information.

 

Problem: uneven understanding of forecasting data is simple, always predicted that multi-party data, so sure, especially in the case of data a lot more of that party, for example, accounted for more than 90%, accounted for less 10%. As long as each time that a group of more than predicted data, forecast accuracy can reach 90% of the.

 

Why class imbalance is bad

1. From the point of view of training process model

From the perspective of the training model, if a certain number of samples is small, then this category provide "information" too little.

Experience with risk (average loss model in the training set) to minimize the learning criterion as the model. Set loss function for the 0-1 loss (which is equal the cost of a typical loss of function), then the optimization target is equivalent to minimizing the error rate (that is, to maximize accuracy). Consider the extreme case: 1000 training samples, samples of 999 positive class and negative class samples 1. The training process at the end of an iteration, the model for all samples are divided into positive category, although the negative points are wrong class, but the losses caused by minuscule, accuracy has been 99.9%, thus satisfying the shutdown conditions or to achieve after the maximum number of iterations naturally not necessary to optimize it, the end of the training, so the model does not learn how to distinguish a minority class.

2. From the point of view of the prediction process model

Consider two Logistic regression model. A sample input X, the output of the model is the probability of belonging to the class n y ^. When y ^> 0.5, the model determines whether the sample belongs to the n-type, or a like that belongs.

Why is it 0.5? Can be considered a model for the angle of maximum a posteriori decision to consider, choose the model of 0.5 means that when the posterior probability estimation sample belongs to the class being larger than the posterior probability sample belongs to the class of negative samples will be judged as positive class. But in fact, the estimated value of the posterior probability of accurate?

From the probability (odds) consider: the expression is the probability of a positive likelihood ratio of the sample belongs to the class belonging to the class of negative possibilities. For samples model predicted probability of y ^ / (1-y ^ ).

Model when making decisions, of course, want to be able to follow the real sample population sample distribution of positive and negative class: Let N equal to the number of samples positive class divided by the total number of samples, the samples of the real probability of N / (1-N). When the time is greater than the probability of observing the real probability, i.e. y ^> 0, then it is determined that the sample belongs to the class n.

While we can not learn the true sample of the population, but in the training set, there is a hypothesis: the training set is a true sample of the population of unbiased sampling. It is because of this assumption, it is considered a training set of observation probability n represents the real probability n / (1-n) ^ / (1-n ^).

 

Solution

Method 1: find a way to get more data

First, we have to think about whether we can get more data, sometimes we get in early data, the data is usually a trend change, this time as a certain kind of performance above normal amount of data, wait until after the data half of the period, the trend changed data might not be the same.

If you do not get the data of the latter period, on the whole, the prediction might not be so precise, so find ways to get more data is likely to improve this situation -

Method 2: Evaluation in a different way

Under normal circumstances, we will use accuracy (Accuracy) and error (Cost) are two ways to judge the results of machine learning, but in the face of unbalanced data, high accuracy and low error becomes less useful and important a.

So we can change the way to calculate, many times we will use the Confusion Matrix to compute Precision & Recall, and then to calculate the F1 Score (or F-score) by Precision & Recall. With this data, we can to a large extent to distinguish uneven data and can give better scores.

Method 3: Recombinant data

This method is relatively simple and crude the most, not the imbalance data reassembled to make it balanced. The first way is to copy data in a small number of samples, so that it can reach the most number of data samples and similar, which is oversampling. The second way is that most of the data samples were cut, cut some of the data the majority of samples, or the number of both almost, that is under-sampling method.

However, simple and crude place or add data easily change the original distribution, reduce model generalization ability, need to take into account the distribution of the data.

Oversampling random: a random sample from a minority repeated with replacement of taking samples to obtain more samples.

Cons: multiple copies of the minority sample, expanding the size of the data, adds to the complexity of the model, easy to over-fitting.
Solution: SMOTE algorithm
simply put, is a small number of classes each sample x, y from a sample of randomly selected him in the K class neighborhood a few samples, and then randomly selected point as the new synthesis in connection x and y samples. This method avoids the replication minority sample, increase the diversity of the sample, you can reduce the risk of over-fitting.
However, this method increases the degree of overlap between the class, and will provide useful information that can not produce a sample. For this reason there has been Borderline-SMOTE (only for those in the minority samples on the synthesis of new classification boundary sample), ADASYN (to sample a few different classes synthesis of new samples of different number)

Random sub-sampling: random samples from the majority class with replacement (without replacement or) select fewer samples.
Cons: discard part of the sample, may lose some useful information, resulting in the model learned only part of the overall pattern.
Solution: Easy Ensemble algorithm
each time a random subset from the majority class, and a minority trained classifier; repeated several times to obtain a plurality of classifiers, the end result is the fusion of a plurality of classifiers.
Balance Cascade algorithm
cascade structure, in each randomly selected from a subset of the class majority, minority and train the classification stage, and then from the most current class sample weed out classifier can identify candidates for admission of the sample, continue to the next an operation is repeated several times, the final result is the fusion at various levels of the classifier.

In fact, the data expansion method often used is also an oversampled, noise disturbance some samples or transform (crop, flip, plus light and the like)

 

Method 4: Use a different machine learning methods

In using some machine learning methods, such as neural networks, in the face of unbalanced data are helpless, but as such an approach would not be affected by the decision tree uneven data

Method 5: Modify the algorithm

In all methods, the method most creative way to modify this algorithm, if you are using a Sigmoid function, he predicted there will be a threshold, if the result is below the threshold, the predicted pear, if it exceeds the threshold, forecast results for Apple.

But because too many pears now, at this time we need to position the threshold of mediation, the bar has more biased in favor of the Apple side, when only very accurate data, the model will predict for Apple, so that the machine learning learning to better results.

You can also change the objective function of model training; problem can also be converted to a single class learning.

Guess you like

Origin www.cnblogs.com/shona/p/12165786.html