How to solve the problem of category imbalance in machine learning classification tasks

1. Preparation

(1) Imblearn installation

What should we do when we encounter unbalanced data types? In Python, there is the Imblearn package, which is created to deal with data imbalances.

Install Imblearn, the default is python3.6 version and above. When installing, pay attention to using the administrator's authority, otherwise an error may be reported. If it is a windows system, if you open the cmd window as an administrator, if it is a linux environment, you need to add sudo

pip install imbalanced-learn

(2) Create unbalanced category data sets

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

#使用make_classification生成样本数据
X, y = make_classification(n_samples=5000, 
                           n_features=2,  # 特征个数= n_informative() + n_redundant + n_repeated 
                           n_informative=2,  # 多信息特征的个数
                           n_redundant=0,   # 冗余信息,informative特征的随机线性组合
                           n_repeated=0,  # 重复信息,随机提取n_informative和n_redundant 特征 
                           n_classes=3,  # 分类类别
                           n_clusters_per_class=1,  # 某一个类别是由几个cluster构成的
                           weights=[0.01, 0.05, 0.94],  # 列表类型,权重比
                           random_state=12)

data visualization:

View the number distribution of each category in the generated data through collections. Among the 5000 samples generated, there are 70 of category 0, 265 of category 1, and 4,665 of category 2.

from collections import Counter
print(Counter(y))

 

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

2. Oversampling method

Oversampling the minority classes in the training set, that is, adding some minority class samples to make the number of positive and negative examples close, and then learn.

(2) Random oversampling method

Realize random oversampling through code:

# 使用imblearn进行随机过采样
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
#查看结果
print(Counter(y_resampled))

#过采样后样本结果
# Counter({2: 4674, 1: 4674, 0: 4674})

# 数据集可视化
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
plt.show()

The purple dots in the graph from the data visualization become thicker, which is the result of random copy sampling.

Disadvantages:

  • For random oversampling, the complexity of model training increases due to the need to replicate minority samples to expand the data set .
  • On the other hand, it is easy to cause the over-fitting problem of the model , because random over-sampling is a simple copy sampling of the initial sample, which makes the rules learned by the learner too specific, which is not conducive to the generalization performance of the learner, resulting in Overfitting problem.

(2) Oversampling representative algorithm-SMOTE


# SMOTE过采样
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
Counter(y_resampled)

# 采样后样本结果
# [(0, 4674), (1, 4674), (2, 4674)]

# 数据集可视化
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
plt.show()

 

Three, under-sampling method

Direct the majority class training set samples "undersampled" (Undersampling), i.e., to the exception of some samples such that the majority of the positive class embodiment, the number of counter-examples proximity, then learning.

(1) Random under-sampling method

Code:

# 随机欠采样
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
Counter(y_resampled)

# 采样后结果
[(0, 64), (1, 64), (2, 64)]

# 数据集可视化
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
plt.show()

Disadvantages:

  • The random undersampling method achieves the purpose of modifying the sample distribution by changing the proportion of the majority of samples, so as to make the sample distribution more balanced, but this also has some problems. For random under-sampling, since the sample set sampled is less than the original sample set, some information will be missing. That is, deleting the majority class sample may cause the classifier to lose important information about the majority class.

 

 

 

Guess you like

Origin blog.csdn.net/qq_39197555/article/details/115297269