spark achieve smote sample sampling

A theory .smote

(1).

SMOTE is an improvement of a general oversampling (oversampling) a. Ordinary oversampling will make the training set has a lot of duplicate samples.

SMOTE stands Synthetic Minority Over-Sampling Technique, translated as "artificial minority oversampling method."

SMOTE no direct minority resampling, but the design of algorithms to sample some of the new synthetic minority class.

For convenience, it is assumed positive for the minority class, negative for the majority of the class

The new algorithm positive samples of synthetic minority class as follows:

  1. Selected a positive samples S S
  2. Found S S latest K K samples, K K may be taken 5, 10 or the like. This k k samples may have a positive but also negative.
  3. From K K samples in a randomly selected sample, referred to as R & lt R & lt.
  4. Synthesis of a new positive samples S ' S', S ' = [lambda] S + ( 1 - [lambda] ) R & lt S' lambda] s = + ([lambda]-1) R & lt, [lambda] [lambda] is ( 0 , 1 ) between the (0,1) random number. In other words, the newly generated points R & lt R & lt and s connection between the s.

 

Repeat the above steps, you can generate a lot of positive samples.

======= painted a few pictures, update the look ======

Description about the form of the step of FIG SMOTE of:

1. Begin with a positive samples (assumed positive in the minority class)

2. Identify the positive samples of k-nearest neighbor (assuming that k = 5). 5 neighbors have been circled.

3. k nearest neighbors were randomly selected from a sample (with a green circle out).

4. On the connection between the positive specimens and was elected this neighborhood, looking for a little random. This point is the new positive samples synthetic (green plus sign marked).

Above from http://sofasofa.io/forum_main_post.php?postid=1000817 described in

 

(2).

With this approach, the positive class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbours. Depending upon the amount of over-sampling required, neighbours from the k nearest neighbours are randomly chosen. This process is illustrated in the following Figure, where xixi is the selected point, xi1xi1 to xi4xi4are some selected nearest neighbours and r1r1 to r4r4 the synthetic data points created by the randomized interpolation. The implementation of this work uses only one nearest neighbour with the euclidean distance, and balances both classes to 50% distribution.

Synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbour. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general. An example is detailed in the next Figure.

In short, the main idea is to form new minority class examples by interpolating between several minority class examples that lie together. In contrast with the common replication techniques (for example random oversampling), in which the decision region usually become more specific, with SMOTE the overfitting problem is somehow avoided by causing the decision boundaries for the minority class to be larger and to spread further into the majority class space, since it provides related minority class samples to learn from. Specifically, selecting a small k-value could also avoid the risk of including some noise in the data.

以上来自https://sci2s.ugr.es/multi-imbalanced中的叙述

 

二.spark实现smote

 

Guess you like

Origin www.cnblogs.com/little-horse/p/11237890.html