Pseudo labels (semi-supervised learning)

    Using pseudo-labels for semi-supervised learning is a key point that is easier and faster to score in machine learning competitions. Let me introduce to you what is semi-supervised learning based on pseudo-labels. In traditional supervised learning, our training set has labels, and at the same time, the test set also has labels. In this way, the model we trained through the training set can verify the accuracy of the model on the test set.

    However, if we use pseudo-labels, we can use the training set to train the best model, then remove the real label of the test set, and then use this trained model to predict the label of the test set. Then pretend that the predicted label is a real label, that is, a "pseudo-label". Put it into the original training set, and start training a new model again.

    Finally, use this latest model to verify the correctness of the model with real labels on our test set. The overall process is shown in the figure below:

    In semi-supervised learning, the advantages of using unlabeled data are as follows:

1. Labeled data often means high cost and difficult to obtain, but unlabeled data is large and cheap.
2. By increasing the precision of the decision boundary, they increase the robustness of the model.
3. It is often used for scoring in machine learning competitions.
    The specific steps are organized as follows, let's take a look with you:

1. Divide the labeled part of the data into two parts: train_set&validation_set, and train the optimal model1.
2. Use model1 to predict the unknown label data (test_set), and give the pseudo-labeled result pseudo-labeled.
3. Extract a part of the train_set to make a new validation_set, fuse the remaining part with the pseudo-labeled part as a new train_set, and train the optimal model2.
4. Then use model2 to predict the unknown label data (test_set) to obtain the final final result label.
 

Guess you like

Origin blog.csdn.net/Starinfo/article/details/131744984