There are too many wrong labels and you don't want to check manually? Try Confidence Learning to Find Errors Automatically

As we all know, in machine learning, the test set is the benchmark we use to measure the performance of the model. However, in actual work, we may encounter such a problem, that is, no matter what method is used to obtain the labeled data, there are more or less labeling errors, which cannot be ignored for the improvement of model accuracy. question.

In an earlier paper, researchers from MIT CSAIL and Amazon conducted a study on 10 mainstream machine learning datasets [1], and found that an average of 3.3% of the data was wrongly labeled, ImageNet, CIFAR100 and other well-known The error rate of the data set is actually close to 6%.

Figure 1 Error cases of mainstream datasets

Figure 2 Error situation of mainstream datasets

Therefore, how to quickly and conveniently find the wrong or suspected wrong samples from the data set has become a very important thing.

This paper introduces a method of finding error samples by using confidence learning method [2], and selects the MNIST data set for experiments, and introduces the main process of using confidence learning method to find error samples. The following is the detailed content.

1. Method introduction

 NO.1 

What is Confidence Learning

The concept of confident learning comes from a paper jointly proposed by MIT and Google: Confident Learning: Estimating Uncertainty in Dataset Labels[2]. Confidence learning (CL) proposed in this paper is an emerging, principled framework that can be used to identify label errors, characterize label noise, and apply to noisy label learning.

Confident learning has the following advantages:

● It can directly estimate the joint distribution of noise labels and real labels, which is theoretically reasonable;

● No need for hyperparameters, just use cross-validation to get the predicted probability of the sample;

● No need to make the assumption of random uniform label noise (this assumption is usually unrealistic in practice);

● It has nothing to do with the model, and any model can be used, unlike many noisy learning and the strong coupling between the model and the training process;

● The author has open-sourced the toolkit cleanlab for confident learning, which is convenient and quick to call with one line of code;

 NO.2 

Confidence in the learning process

Confidence learning consists of three main steps:

● Count: estimate the joint distribution of noise labels and real labels;

● Clean: Find noise samples according to the joint distribution;

● Re-Training: Retrain after filtering noise samples;

In the Count stage, first perform cross-validation (the process of cross-validation is shown in Figure 3), to obtain the predicted probability of all samples, and then count the average probability of each manually calibrated category as the confidence threshold, as shown in formula 1;

Then calculate the predicted class of each sample (the class with the largest predicted probability, and the probability is greater than the confidence threshold of the class) as shown in formula 2;

Then count the count matrix (similar to the confusion matrix) between the predicted category and the given category, as shown in formula three;

Finally, the count matrix is ​​calibrated, so that the sum of the count matrix is ​​the same as the total amount of data, and normalized to obtain the joint distribution of the predicted label and the given label, as shown in formula 4.

Figure 3 Schematic diagram of cross-validation

formula one

formula two

formula three

formula four

The meanings of the variables appearing in the above formula are as follows:

  

Confidence threshold for judging whether the prediction result is category j

  

given label (original label, possibly noisy)

  

The predicted label, here as the real label

  

Sample space

  

single sample

  

Parameters of the predictive model

  

predicted probability

  

Statistics matrix for given labels and predicted labels

  

Joint distribution matrix of given and predicted labels

In the Clean stage, there are 5 methods for screening noise labels:

1. Filter the data whose predicted category is inconsistent with the manually marked category;

2. Filter the samples of off-diagonal units in the count matrix;

3. For category c, select N*p samples to filter, where N is the total number of samples of a given category c, and p is the probability sum except Q(c,c) in the joint distribution matrix;

4. For the off-diagonal unit of the counting matrix, select N*p samples to filter, where N is the total number of samples, and p is the probability corresponding to the unit of the counting matrix in the joint distribution matrix;

5. Combination of method 3 and method 4;

Among them, method 2 is the method that the author thinks is more reasonable from theoretical analysis, but at the same time, the author also conducted experiments, and the results of the five methods have little difference.

The above process is represented by a picture, as shown in Figure 4:

Figure 4 Schematic diagram of confidence learning process

 NO.3 

Belief in the effect of learning

The author of the paper has done a lot of ablation to verify the effect of confidence learning. Here we only look at the effect of confidence learning in the actual data set. Figure 5 is the result of the author's confidence learning on the ImageNet (ILSVRC 2012) data set:

Figure 5 The results of confident learning on the ILSVRC 2012 dataset

In Figure 5 (a), it can be seen that after filtering out noise labels through confidence learning, (compared with random removal of samples) the accuracy has increased by up to 0.6 percentage points. Compared with (b)(c)(d) group experiments, it can be seen that the data set itself contains The more wrong labels of , the more obvious the effect of confidence learning improvement.

2. Actual operation

The author of Confidence Learning has open-sourced its code library cleanlab. It only needs one command to install pip install cleanlab. We tried it on MNIST to introduce the detailed steps of the actual operation process of Confidence Learning. The code mainly includes the following parts:

 NO.1 

parameter definition

import numpy as np
import torch
import warnings

SEED = 123
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(SEED)
warnings.filterwarnings("ignore", "Lazy modules are a new feature.*")

 NO.2 

import dataset

from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784")  # 获取 MNIST 数据集

X = mnist.data.astype("float32") # 二维数组
X /= 255.0  # 将图片像素值归一化到0~1
X = X.reshape(len(X), 1, 28, 28)  # 改变图片尺寸为 [N, C, H, W] 

y = mnist.target.astype("int64")  # 一维标签
print(X.shape, y.shape)

​​​​​​The size of the printed data here is shown in the figure below, where 70000 is the number of pictures, 1 is the number of channels (that is, the grayscale picture), and 28*28 is the resolution of the picture:

 NO.3 

Define a classification model


from torch import nn

class ClassifierModule(nn.Module):
    def __init__(self):
        super().__init__()

        self.cnn = nn.Sequential(
            nn.Conv2d(1, 6, 3),
            nn.ReLU(),
            nn.BatchNorm2d(6),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(6, 16, 3),
            nn.ReLU(),
            nn.BatchNorm2d(16),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.out = nn.Sequential(
            nn.Flatten(),
            nn.Linear(400, 128),
            nn.ReLU(),
            nn.Linear(128, 10),
            nn.Softmax(dim=-1),
        )

    def forward(self, X):
        X = self.cnn(X)
        X = self.out(X)
        return X


from skorch import NeuralNetClassifier
model_skorch = NeuralNetClassifier(ClassifierModule)

​​​​​​Because the MNIST dataset is relatively simple, a simple two-layer convolutional layer + two-layer fully connected layer classification network is defined with pytorch, and packaged with skorch to facilitate subsequent calls with sklearn.

 NO.4 

K-fold cross-validation


from sklearn.model_selection import cross_val_predict

num_crossval_folds = 3 
pred_probs = cross_val_predict(
    model_skorch,
    X,
    y,
    cv=num_crossval_folds,
    method="predict_proba",
)

​​​​​​Here K=3 is set, and the result of cross-training is as shown in the figure below, where pred_probs is the predicted probability required for subsequent confidence learning.

 NO.5 

Overall Accuracy of Cross-Training

from sklearn.metrics import accuracy_score

predicted_labels = pred_probs.argmax(axis=1)
acc = accuracy_score(y, predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {acc}")

​​​​​​The results are as follows:

This result will be compared with the result after removing noisy labels.

 NO.6 

Find noisy labels through the cleanlab library

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
    y,
    pred_probs,
    return_indices_ranked_by="self_confidence",
)            
#可以通过输入filter_by参数选择筛选方法,默认选择的是方法一,其他一些细节也可以进行调整

print(f"Cleanlab found {len(ranked_label_issues)} label issues.")
print(f"Top 15 most likely label errors: \n {ranked_label_issues[:15]}")

As a result, the index list of the noise data is returned. Here, cleanlab found a total of 127 label errors, and the index of the error label with the top 15 error probability is as follows:

 NO.7 

Visualize some results


import matplotlib.pyplot as plt

def plot_examples(id_iter, nrows=1, ncols=1):
    plt.figure(figsize=(12,8))
    for count, id in enumerate(id_iter):
        plt.subplot(nrows, ncols, count + 1)
        plt.imshow(X[id].reshape(28, 28), cmap="gray")
        plt.title(f"id: {id} \n label: {y[id]}")
        plt.axis("off")

    plt.tight_layout(h_pad=5.0)
    
plot_examples(ranked_label_issues[range(50)], 5, 10)

The first 50 problematic samples are shown here, as follows:

It can be seen that most of them are really wrong or ambiguous labels, and other labels also contain some irregular writing.

 NO.8 

Re-training after removing noise labels

clean_X = np.delete(X, list(ranked_label_issues), 0)
clean_y = np.delete(y, list(ranked_label_issues), 0)
print(clean_X.shape, clean_y.shape)

clean_pred_probs = cross_val_predict(
    model_skorch,
    clean_X,
    clean_y,
    cv=num_crossval_folds,
    method="predict_proba",
)
clean_predicted_labels = clean_pred_probs.argmax(axis=1)
clean_acc = accuracy_score(clean_y, clean_predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {clean_acc}")

After removing noise labels, the size of the data is as shown in the figure below, which is 127 pieces of data less than the original data.

The final precision is as follows:

Compared with the previous accuracy of 0.9766, it can be seen that the accuracy is only slightly improved. This is because 127 of the 70,000 images in MNIST are removed, and the impact is relatively slight. Combining with the paper, we can know that the overall error rate of MNIST is relatively low. Confidence learning should perform better when there are more data sets.

 NO.9 

supplementary experiment

Since the above accuracy improvement is not very obvious, considering that there are 70,000 pictures in MNIST, and the influence of 127 noise pictures is low, a supplementary experiment was carried out to select a part (partial clean data + 127 noise data) from the MNIST data set, and test The effect of cleanlab on a data set with a slightly larger noise rate.

● Prepare the dataset

The previous cleanlab found 127 noisy pictures, here the noise rate of the new data set is maintained at 5% (considering that the 127 pictures found are not all wrong pictures, the actual noise rate should be lower than 5%), the total number of pictures It should be 127*20 pictures, and there are 127 noise pictures among them. The construction code is as follows:

import random
small_Num = 127*20
small_clean_index = random.sample(list(range(clean_X.shape[0])), small_Num-len(ranked_label_issues))
#新数据集由127张噪声数据和(2540-127)张干净数据组成
small_X = np.concatenate([clean_X[small_clean_index], X[ranked_label_issues]])
small_y = np.concatenate([clean_y[small_clean_index], y[ranked_label_issues]])

#打乱组合后的数据集
random_index = list(range(small_X.shape[0]))
random.shuffle(random_index)
small_X = small_X[random_index]
small_y = small_y[random_index]
print(small_X.shape, small_y.shape)

The dimensions of the new data set obtained here are as follows, the amount of data is reduced to 2540, and the others remain unchanged:

● Cross validation

model_skorch = NeuralNetClassifier(ClassifierModule)
num_crossval_folds = 3  
pred_probs = cross_val_predict(
    model_skorch,
    small_X,
    small_y,
    cv=num_crossval_folds,
    method="predict_proba",
)
predicted_labels = pred_probs.argmax(axis=1)
acc = accuracy_score(small_y, predicted_labels)
print("=============================================================")
print(f"Cross-validated estimate of accuracy on held-out data: {acc}")

The process of cross-training and the final accuracy are as follows:

It can be seen that as the proportion of noise data increases and the amount of data decreases, the accuracy of cross-validation is only 0.8236.

● cleanlab looks for noisy labels

Rerun confident learning on a small dataset:


ranked_label_issues = find_label_issues(
    small_y,
    pred_probs,
    return_indices_ranked_by="self_confidence",
)
print(f"Cleanlab found {len(ranked_label_issues)} label issues.")
print(f"Top 15 most likely label errors: \n {ranked_label_issues[:15]}")

The results of this search are as follows:​​​​​​​​

As the overall data set has changed, the noise data found has also changed. This time, 101 noise pictures were found.

● re-training

Retrain after removing 101 noise data:

small_clean_X = np.delete(small_X, list(ranked_label_issues), 0)
small_clean_y = np.delete(small_y, list(ranked_label_issues), 0)
print(small_clean_X.shape, small_clean_y.shape)

clean_small_pred_probs = cross_val_predict(
    model_skorch,
    small_clean_X,
    small_clean_y,
    cv=num_crossval_folds,
    method="predict_proba",
)
clean_small_predicted_labels = clean_small_pred_probs.argmax(axis=1)
clean_small_acc = accuracy_score(small_clean_y, clean_small_predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {clean_small_acc}")

​​​​​​The accuracy of the re-cross-validation is as follows:

After removing 101 noisy data, the accuracy came to 0.8396. Compared with 0.8236, the accuracy increased by 1.6 percentage points. It can be seen that when the noise rate of the data set is about 5%, confidence learning can play a more obvious role.

3. Postscript

This article introduces the basic process of using confident learning, and tries the use of cleanlab for the MNIST dataset, hoping to help readers understand the principle and actual use process of confident learning. In the future, we will continue to introduce other methods for finding noise labels, and try to experiment on the target detection data set.

references

[1] C. G. Northcutt, L. Jiang, and I. Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021.

[2] C. G. Northcutt, Anish Athalye, and Jonas Mueller. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks,  arXiv:2103.14749v4. 2021.

More data sets are on the shelves, more comprehensive data set content interpretation, the most powerful online Q&A, the most active circle of peers... Welcome to add WeChat opendatalab_yunying  to join the OpenDataLab official communication group.

Guess you like

Origin blog.csdn.net/OpenDataLab/article/details/126466403