Uneven sample set of resampling

Original: http://www.iterate.site/2019/04/13/07-%E4%B8%8D%E5%9D%87%E8%A1%A1%E6%A0%B7%E6%9C%AC % E9% 9B% 86% E7 % 9A% 84% E9% 87% 8D% E9% 87% 87% E6% A0% B7 /

 

When training dichotomous model, such as medical diagnostics, network intrusion detection, credit card fraud and other, often encounter the problem of imbalance of positive and negative samples. Yes, very few positive samples, a lot of negative samples.

For many classification algorithm, if the direct use of disproportionate sample set for training to learn, there will be some problems. For example, if the proportion of positive and negative samples reached 1:99, the classification is simply All samples were judged as negative samples will be able to achieve the correct rate of 99%, which is obviously not what we want, we want the classifier positive We have enough precision and recall rates on and negative samples.

Sampling, data expansion

For binary classification, when the training set of positive and negative samples very uneven, how to process the data to better train the classification model?

Why do so many classification model problems occur when the training data is not balanced?

The reason is the nature of the evaluation criteria model optimization objective function and in training people to use when testing inconsistent. Yes. :

  • This "mismatch" may be due to inconsistent with the desired test sample distribution when the training sample distribution data, for example, is optimized at the time of training the entire training set (positive and negative samples could be 1:99 ratio) in the correct ratio, and the test when you might want to model the average accuracy rate in the positive and negative samples as large as possible (in fact desirable ratio of positive and negative samples 1:1);
  • Also may be due to weight different types of training phase weight (importance) is inconsistent with the test phase, for example, believe that the contribution of all the samples are equal when training, and testing false positive samples (False Positive) and false negative samples (False Negative) has different price. Yes, different price

Based on this analysis, generally we process the samples from two angles imbalance [17].

Methods based on data

To resample the data, so that the original sample becomes uneven balance. First, a large number of samples in mind category C m A J , a small number of sample type C m I n- , their corresponding sample set of S MAJ and S m I n- . The question is provided with  | S m A J | > > | S m I n- | .

The simplest approach is not balanced set of samples randomly sampled.

Sampling is generally divided oversampling (Over-sampling) and sub-sampling (Under-sampling).

  • Oversampling random sample set from the minority class  S m I n- random samples taken Smin in duplicate (with replacement) to obtain more samples;
  • Random sub-sampling on the contrary, from the majority class sample set  S  MAJ  randomly selected in S maj fewer samples (with replacement or without replacement).

Feeling oversampling and undersampling have some issues now. Ah, here I say.

Although direct random sampling can sample set to become balanced, but it will bring some problems, such as:

  • Oversampling of minority class samples were multiple copies of data to expand the scale and increase the complexity of the model training, but also easily lead to over-fitting;
  • Undersampling discards some of the samples, it may lose some useful information, resulting in the model learned only part of the overall pattern.

Yes.

To solve these problems, usually when oversampling is not simply replicate sample, instead of using a number of methods to generate new sample. For example, SMOTE algorithm minority sample set  S m in each sample  X X, from which the  S m I n- K nearest neighbor in a random sample is selected from  Y , then randomly selected in the x, y as the new connecting one o'clock FIG synthesized sample (required oversampling scale according to the process is repeated several times), as shown in 8.14. Ah, I feel some powerful, but how this synthesis to synthesis? Two samples of how to synthesize a sample?

mark

mark

This synthesis of new samples of over-sampling method can reduce the risk of over-fitting.

SMOTE algorithm for each sample combining the minority class the same number of new samples, which may increase the degree of overlap between the class, and will generate some sample does not provide useful information . Yes.

Borderline-SMOTE, ADASYN and other improved algorithm for this purpose appear.

  • Borderline-SMOTE only to a few classes in the classification of samples on those boundaries synthesis of new samples,
  • And ADASYN the sample to different classes Synthesis few different number of samples.

In addition, some of the data may also be employed cleaning methods (e.g., based Tomek Links) to further reduce overlap between synthetic samples brought class to obtain a more well-defined class cluster (well-defined), and in order to better train the classifier. Wow, so much! Simply a variety of means, terrible! It is summed up in.

Similarly, for under-sampling can be used to solve Informed Undersampling due to random sampling due to data loss problems caused. Common Informed Undersampling algorithm are:

  • Easy Ensemble algorithm. From each class the majority of  the Smaj a randomly selected subset of  E ( | E | | S min | ) , then  E + S min trained classifier; repeat the process several times to obtain a plurality of classifiers, final this classification result is a fusion of multiple classifiers results. Ah, there was still justified.
  • Balance Cascade algorithm. Cascade structure, in each stage from the majority class  S m AJ  random subset  E E, with  E + Smin of training the classifier stage; then S MAJ in current can be determined correctly classified samples weed out , under a continued operation is repeated several times to give cascade structure; the final output is the result of fusion classifiers levels. Yes, this feeling is similar to a decision tree.
  • Other algorithms such as NearMiss (selection of information by using a K-representative sample), One-sided Selection (using data cleaning technology). These should be summarized below.

In practice, the specific sampling operation may not always be the same as the above-mentioned several algorithms, but the basic idea is still the same many times. E.g:

  • Sampling based clustering, cluster information using the data type to guide oversampling / undersampling operation;
  • Data expansion method often used is also an over-sampling, based on a few samples or transform some of the noise disturbance (image data set of image cropping, flipping, rotating, adding light, etc.) to construct a new sample;
  • The Hard Negative Mining is a kind of under-sampling, the more difficult to extract samples for an iterative classifier.

Based on the method

When the sample is not balanced, the objective function may be changed by the model training (e.g., cost-sensitive learning different categories have different weights) to correct this imbalance; when the number of samples is extremely uneven, the problem may be transformed learning is a single type (one-class learning), the abnormality detection (anomaly detection). This section focuses on sampling, not repeat them. Wow! Ideas Qing Qi ah! It may actually be converted into a single type of detection and anomaly detection. Ah, or again supplemented under.

Summary and extensions

In the actual interview, this question there are a lot scalable knowledge points. E.g:

  • Model evaluation criteria on uneven sample set
  • Different sample sizes to choose a suitable processing method (absolute value) (considering the ratio of the difference between positive and negative samples of 1:100 and 1000:100000)
  • Cost-sensitive learning differences and sampling methods, as well as the effect of contact and contrast. Ah, this would like to know.

These will have to summarize, clarify.

Original and related

  • "Hundred face machine learning"

Guess you like

Origin www.cnblogs.com/ottll/p/11057852.html