Dealing with unbalanced data: technical details and case studies

Imbalanced datasets are a common problem in the world of machine learning and data science. An unbalanced dataset refers to a classification problem in which the categories of the target variable are unevenly distributed, that is, the number of samples of a certain category far exceeds that of other categories. This article will explain in detail how to deal with imbalanced data, including resampling methods, ensemble methods, and performance metrics for imbalanced data.

Table of contents

1. Resampling method

oversampling

undersampling

2. Integration method

Bagging

Boosting

3. Performance Metrics

in conclusion


1. Resampling method

Resampling is a common method to deal with unbalanced data, mainly including oversampling and undersampling.

oversampling

Oversampling refers to increasing the number of samples in the minority class so that the number of samples in the minority class and the majority class are similar. Here is an example of oversampling using Python's imbalanced-learn library:

from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_classification
import numpy as np

# 创建不平衡数据集
X, y = make_classification(n_classes=2, class_sep=1.5, weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

# 打印原始数据集的类别分布
print('Original dataset shape %s' % np.bincount(y))

# 过采样
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)

# 打印过采样后的数据集的类别分布
print('Resampled dataset shape %s' % np.bincount(y_res))

In this example, we first create an unbalanced binary classification dataset, then RandomOverSamplerrandomly oversample with classes, and finally print the class distributions of the original and oversampled datasets.

undersampling

Undersampling refers to reducing the number of samples in the majority class so that the number of samples in the majority class and the minority class are similar. Here is an example of undersampling using the imbalanced-learn library:

from imblearn.under_sampling import RandomUnderSampler

# 欠采样
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)

# 打印欠采样后的数据集的类别分布
print('Resampled dataset shape %s' % np.bincount(y_res))

In this example, we RandomUnderSamplerrandomly undersample with classes and then print the class distribution of the undersampled dataset.

2. Integration method

Ensemble methods are another common method for dealing with imbalanced data, including Bagging and Boosting.

Bagging

The bagging method trains multiple models by creating multiple subsets, and then combines the prediction results of these models. When dealing with unbalanced data, we can combine undersampling and bagging, that is, undersampling in each subset, and then use these subsets to train the model.

Here is an example of undersampling and bagging using Python's imbalanced-learn library:

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# 创建基分类器
base_cls = DecisionTreeClassifier()

# 创建BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(base_estimator=base_cls, random_state=42)

# 训练模型
bbc.fit(X, y)

In this example, we first create a decision tree classifier as the base classifier, and then use it BalancedBaggingClassifierfor undersampling and bagging.

Boosting

Boosting works by training multiple models, where each model tries to correct the mistakes of the previous model. When dealing with unbalanced data, we can use improved Boosting methods such as AdaBoost and Gradient Boosting.

Here is an example of AdaBoost using Python's scikit-learn library:

from sklearn.ensemble import AdaBoostClassifier

# 创建AdaBoostClassifier
abc = AdaBoostClassifier(random_state=42)

# 训练模型
abc.fit(X, y)

In this example, we use AdaBoostClassifierAdaBoost.

3. Performance Metrics

When dealing with unbalanced data, we cannot only rely on the accuracy rate to evaluate the performance of the model, because the number of samples in the majority class far exceeds that of the minority class, even if the model only predicts the majority class, the accuracy rate may be high. Therefore, we need to use performance metrics suitable for imbalanced data, such as confusion matrix, precision rate, recall rate, F1 score, ROC curve and AUC value.

Here is an example of computing these metrics using Python's scikit-learn library:

from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score

# 预测
y_pred = bbc.predict(X)

# 计算混淆矩阵
print('Confusion Matrix:\n', confusion_matrix(y, y_pred))

# 计算精确率
print('Precision: ', precision_score(y, y_pred))

# 计算召回率
print('Recall: ', recall_score(y, y_pred))

# 计算F1分数
print('F1 Score: ', f1_score(y, y_pred))

# 计算AUC值
print('AUC: ', roc_auc_score(y, y_pred))

In this example, we first use the model to make predictions, and then calculate the confusion matrix, precision, recall, F1 score, and AUC value.

in conclusion

Dealing with imbalanced data is an important and complex task that requires the use of various techniques including resampling, ensemble methods, and performance metrics applicable to imbalanced data. When dealing with unbalanced data, no one method is the most efficient in all situations. We need to choose and adapt the method according to the specific problem and data set.

For example, if the number of samples in a dataset is large, oversampling may lead to insufficient computing resources, while undersampling may cause loss of information. In this case, we can consider using ensemble methods or combining resampling and ensemble methods. In addition, we can also try to use different performance metrics to evaluate the performance of the model. For example, in the field of medical or financial risk control, we are more concerned about the recall rate of minority classes (disease or fraud).

After theoretical study, we recommend to practice multiple times on actual data and problems, so as to better understand and master these methods. Hope this article helps you better understand how to deal with imbalanced data and achieve better results in your machine learning projects.

Stay tuned for the next post, where we'll explore how to deal with missing data!

Guess you like

Origin blog.csdn.net/a871923942/article/details/131418778