Handling extremely imbalanced datasets

 
 
num = 0
print("len(y_train_df):\n",len(y_train_df))
for i in range(len(y_train_df)):
    if y_train_df[i] == 1:
        num = num + 1

print("{}{}".format("The number of 0s in y_train_df is: ",num))
The displayed result is:
len(y_train_df):
 709903
The number of 0s in y_train_df is: 3293

3293/709903 = 0.00464

The training set of the Ping An Geek Challenge is really incomprehensible. After reading it in the morning, I probably know a few professional terms, SMOTE + KNN, undersampling, oversampling.

When the data classification is very unbalanced, the basic accuracy, mean square error, etc. are all deceived by the data.

To use the recall rate, F value to determine the reliability of the model.

Then I found that the sklearn library can pip install the additional package balance-learn, which can solve the problem and does not need to go into the library function.

http://contrib.scikit-learn.org/imbalanced-learn/stable/over_sampling.html

Follow the tutorial geek to successfully install it, and then see how well I do it.

I hope the recall rate can not be 0, come on!


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324690333&siteId=291194637