Comparative summary of oversampling techniques for handling imbalanced data

Classification algorithms trained on imbalanced data often result in poor prediction quality. The model is heavily biased towards the majority class, ignoring minority examples that are critical to many use cases. This makes the model impractical for real-world problems involving rare but high-priority events.

Oversampling provides a way to rebalance classes before model training begins. By duplicating minority class data points, oversampling balances the training data and prevents the algorithm from ignoring important but small classes. Although there is a risk of overfitting, oversampling can offset the negative impact of imbalanced learning and allow the machine learning model to gain the ability to solve key use cases.

Common oversampling techniques include random oversampling, SMOTE (Synthetic Minority Oversampling Technology) and ADASYN (Adaptive Synthetic Sampling Method for Imbalanced Learning). Random oversampling simply copies a small number of samples, while SMOTE and ADASYN strategically generate synthetic new data to augment real samples.

What is oversampling

Oversampling is a data augmentation technique used to solve the problem of class imbalance (where one class significantly outnumbers other classes). It aims to rebalance the training data distribution by enlarging the sample size belonging to underrepresented classes.

Oversampling increases minority class samples by copying existing samples or generating synthetic new data points. This is accomplished by replicating a small number of real observations or creating artificial additions based on real-world patterns.

Amplify underrepresented categories through oversampling before model training so that the model learns to more fully represent all categories, rather than being heavily skewed towards the dominant category. This improves various evaluation metrics for addressing needs involving the detection of important but uncommon events.

Why oversampling

When dealing with imbalanced datasets, we are usually interested in correctly classifying the minority class. The cost of a false negative (i.e., failing to detect a minority class) is much higher than the cost of a false positive (i.e., incorrectly identifying a sample as belonging to a minority class).

Traditional machine learning algorithms such as logistic regression and random forest objective optimization assume generalized performance metrics of equilibrium class distributions. So models trained on skewed data tend to be very biased towards large numbers of classes, while ignoring patterns in small but important classes.

By oversampling minority class samples, the dataset is rebalanced to reflect a more equal cost of misclassification across all outcomes. This ensures that the classifier can more accurately identify underrepresented classes and reduce costly false negatives.

Oversampling vs Undersampling

Oversampling and undersampling are both techniques to resolve class imbalance by balancing the training data distribution. They achieve this balance in opposite ways.

Oversampling solves the imbalance problem by replicating or generating new samples to increase the minority class. Undersampling, on the other hand, balances classes by reducing the number of samples in the overrepresented majority class.

Undersampling can be used when most classes have many redundant or similar samples or when dealing with huge data sets. However, its undersampling may lead to loss of information, leading to biased models.

Oversampling can be used when the data set is small and the available samples of the minority class are limited. It can also lead to overfitting due to duplication of data or the creation of synthetic data that does not represent the real data.

Below we will explore different types of oversampling methods.

1. Random oversampling

Random oversampling randomly copies minority class samples to balance the class distribution, so its implementation is very simple. It selects existing samples from underrepresented categories in a random manner and replicates them without changes. The advantage of this is that when the dataset size is small, it can effectively improve the number of observations without the need to collect additional real-world data.

The randomoverampler in the imbalanced-learn library can implement the oversampling process.

 from imblearn.over_sampling import RandomOverSampler
 from imblearn.pipeline import make_pipeline
 
 X, y = create_dataset(n_samples=100, weights=(0.05, 0.25, 0.7))
 
 fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))
 
 clf.fit(X, y)
 plot_decision_function(X, y, clf, axs[0], title="Without resampling")
 
 sampler = RandomOverSampler(random_state=0)
 model = make_pipeline(sampler, clf).fit(X, y)
 plot_decision_function(X, y, model, axs[1],
                        f"Using {model[0].__class__.__name__}")
 
 fig.suptitle(f"Decision function of {clf.__class__.__name__}")
 fig.tight_layout()

As can be seen in the above figure, by copying the sample, the minority class is correctly identified in the classification results.

2. Smooth bootstrap oversampling

Noisy random oversampling is an improved version of simple random oversampling, aiming to solve its overfitting problem. Rather than replicating minority class samples exactly, this approach synthesizes new data points by introducing randomness or noise into existing samples.

By default, random oversampling results in bootstrapping. The shrinkage parameter adds a small perturbation to the generated data to produce a smooth bootstrapping. The image below shows the difference between the two data generation strategies.

 fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))
 
 sampler.set_params(shrinkage=1)
 plot_resampling(X, y, sampler, ax=axs[0], title="Normal bootstrap")
 
 sampler.set_params(shrinkage=0.3)
 plot_resampling(X, y, sampler, ax=axs[1], title="Smoothed bootstrap")
 
 fig.suptitle(f"Resampling with {sampler.__class__.__name__}")
 fig.tight_layout()

Rather than arbitrarily repeating a small number of observed samples, smooth bootstrap interpolation creates new data points that are combinations or interpolations of feature vectors from real samples. The effect of this is to extend what little data is available beyond the original record through data expansion rather than direct copying.

The interpolated data points are "smooth" combinations that occupy the feature space around the real samples rather than overlaying them. Smooth bootstrap oversampling therefore produces more new synthetic minority samples than random oversampling. This helps resolve overfitting issues from repeated techniques while still balancing the class distribution.

The benefit of random oversampling is that it is a very straightforward and simple technique. It does not require complex algorithms or assumptions about the underlying distribution of the data. Therefore, it can be easily applied to any imbalanced data set without requiring special prior knowledge.

But random oversampling is also limited by the possibility of overfitting. Since it only replicates existing minority examples rather than generating truly new examples, the observations do not provide additional informative details about the underrepresented classes. It is also possible that this duplication amplifies the noise in the training data rather than correctly representing the minority class more comprehensively.

Models trained in this way may over-tailor the specific nuances of the initial data set, rather than capturing the true underlying pattern. This limits their ability to generalize to new, unknown data.

3、SMOTE

SMOTE (Synthetic Minority Oversampling Technique) is an oversampling method widely used in machine learning to alleviate the class imbalance problem.

The key concept behind SMOTE is that it generates new synthetic data points for underrepresented classes through interpolation rather than copying. It randomly selects a minority class observation and determines its nearest k neighboring minority class samples based on the feature space distance.

New synthetic samples are then generated by interpolating between the initial sample and k neighbors. This interpolation strategy synthesizes new data points that fill the areas between real observations, functionally extending the few samples available without duplicating the original records.

The workflow of SMOTE is as follows:

For each minority class sample, calculate its K nearest neighbor samples in the feature space, where K is a user-defined parameter.
For each minority class sample, a sample is randomly selected from its K nearest neighbors.
For the selected nearest neighbor sample and the current minority class sample, calculate the difference between them, multiply it by a random number (usually between [0, 1]), add the product to the current sample, and generate a new synthesis sample.
Repeat the above steps to generate a certain number of synthetic samples for each minority class sample.
The generated synthetic samples are merged with the original data and used to train the classification model.

The key advantage of SMOTE is that it can increase the number of minority class samples in the data set by synthesizing samples, rather than simply repeating existing samples. This helps prevent the model from overfitting to minority class samples while improving generalization performance to unseen samples.

There are also some variants of SMOTE, such as Borderline-SMOTE and ADASYN, which consider the boundary conditions and density information of the sample when generating synthetic samples, further improving the processing of class imbalance problems.

4. Adaptive synthesis sampling (ADASYN)

Adaptive Synthetic Sampling (ADASYN) is a method based on data resampling, which generates new samples by synthesizing minority class samples in the feature space to balance the sample distribution of different classes. Different from simple oversampling methods (such as repeating minority class samples), ADASYN can adaptively generate new samples according to the density distribution of samples, paying more attention to generating samples in lower density areas to improve the model's generalization to boundary areas. ability.

The workflow of ADASYN is as follows:

For each minority class sample, calculate its K nearest neighbor samples in the feature space, where K is a user-defined parameter.
The sample density ratio between each minority class sample and its K nearest neighbor samples is calculated. This ratio is used to represent the density of the area where the sample is located.
For each minority class sample, a certain number of synthetic samples are generated according to its sample density ratio, so that the synthetic samples are more concentrated in areas with lower density.
The generated synthetic samples are merged with the original data and used to train the classification model.

The main goal of ADASYN is to try to maintain the performance of the classifier near the decision boundary while increasing the number of minority class samples. That is if some of the nearest neighbors of the minority class are from the opposite class, the more neighbors from the opposite class, the more likely it is to be used as a template. After selecting a template, it generates samples by interpolating between the template and nearest neighbors of the same class.

Comparison of generation methods

 from imblearn import FunctionSampler  # to use a idendity sampler
 from imblearn.over_sampling import ADASYN, SMOTE
 
 X, y = create_dataset(n_samples=150, weights=(0.1, 0.2, 0.7))
 
 fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))
 
 samplers = [
     FunctionSampler(),
     RandomOverSampler(random_state=0),
     SMOTE(random_state=0),
     ADASYN(random_state=0),
 ]
 
 for ax, sampler in zip(axs.ravel(), samplers):
     title = "Original dataset" if isinstance(sampler, FunctionSampler) else None
     plot_resampling(X, y, sampler, ax, title=title)
 fig.tight_layout()

You can see the difference between ADASYN and SMOTE in the picture above. ADASYN will focus on samples that are difficult to classify, while regular SMOTE will not make any distinction.

Let’s take a look at the classification results of different algorithms

 X, y = create_dataset(n_samples=150, weights=(0.05, 0.25, 0.7))
 
 fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20, 6))
 
 models = {
     "Without sampler": clf,
     "ADASYN sampler": make_pipeline(ADASYN(random_state=0), clf),
     "SMOTE sampler": make_pipeline(SMOTE(random_state=0), clf),
 }
 
 for ax, (title, model) in zip(axs, models.items()):
     model.fit(X, y)
     plot_decision_function(X, y, model, ax=ax, title=title)
 
 fig.suptitle(f"Decision function using a {clf.__class__.__name__}")
 fig.tight_layout()

It can be seen that if oversampling is not performed, the minority classes are basically indistinguishable. Through oversampling technology, minority classes are effectively distinguished.

SMOTE treats all minority class samples equally, regardless of the distribution density between them. ADASYN takes into account the number of neighboring samples for each minority class sample, so that for those minority class samples with fewer neighboring samples, more synthetic samples are generated to better cover the entire decision boundary.

However, the selection of these two algorithms must be based on actual applications. For example, the impact of changes in the yellow and blue decision boundaries in the picture above needs to be actually measured before it can be judged which algorithm is more suitable for the current application.

https://avoid.overfit.cn/post/1814d699b1574f258fd3aea341d9e487

Author: Abdallah Ashraf