Machine Learning 10 - Semi-supervised Learning

1 Why do semi-supervised learning

Supervised machine learning has proven its effectiveness in many fields, such as ImageNet image classification tasks. As early as 2017, the accuracy of deep learning models has exceeded that of humans. Machine learning is considered to be a data-driven discipline. Data is divided into two parts, data and label. Under normal circumstances, data is relatively easy to obtain, and label needs to be labeled, which is much more precious. For example, for image classification tasks, there are thousands of image data on the Internet, but most of them are not labeled.

Under normal circumstances, we can label a small part of the data to form supervised data, and then make good use of other unlabeled data. This method is called semi-supervised learning. The value of semi-supervised learning lies in

  1. It greatly reduces the dependence of machine learning models on labeled data, especially when the project is in a cold start.
  2. The distribution of unlabeled data can provide a lot of information and guide the model iteration, even if they are unlabeled
  3. Unlabeled data is easy to obtain in most cases, and the amount of data is large. The quality is not enough, the quantity is to come together. Proper use can also produce great value.

 

2 Semi-supervised method

2.1 self-training self-learning

The general process of self-training is as follows

  1. Use labeled data to train a model, there are no restrictions on the training method
  2. Use the trained model to predict the unlabled data and get the label, which is called pseudo label
  3. Select some samples from pseudo label and put them into labelled data. The selection method is also customizable, for example, you can put a higher confidence
  4. Retrain the model, and then continue to iterate.

It should be noted that the pseudo label must use a hard label, not a soft label. What is a hard label? We imagine a binary classification task. After a certain sample passes the model predict, the probability distribution is [0.8, 0.2]. [0.8, 0.2] is soft label, [1, 0] is hard label. Why can't I use soft label? Because under self-training, the soft label is predicted by the model. If you use it as a label, the model will not learn anything, and the model parameters will remain unchanged. Therefore we must use hard label

image.png

2.2 Entropy-based Regularization 熵正则

Another way to make full use of unlabeled data is entropy-based regularization. The idea is that the model's predictive probability distribution of data should be concentrated on a certain category as much as possible, and the model has a high degree of discrimination and good results. This also makes sense. Imagine that if the model's predict has the same probability in each category, then the sample category cannot be determined at this time. Therefore, the probability distribution of our model predict must be in a certain category as much as possible.

image.png

So what is used to measure whether the probability distribution is concentrated in a certain category. That is entropy. The expression is as follows. The smaller the entropy, the more concentrated the probability distribution.

image.png

Now the problem is very simple. Our goal of semi-supervised learning includes two

  1. On labeled data, the model loss should be as small as possible. For example, cross-entropy for classification tasks
  2. On unlabeled data, the entropy should be as small as possible to make the probability distribution as concentrated as possible

Then the objective function can be changed to optimize the above two items, the expression is as follows

image.png

It can be seen that unlabeled data is still very valuable for machine learning models.

 

2.3 Clustering

There is also a method of using unlabeled data using similarity. Calculate the similarity between a certain unlabeled data and all labeled data, and use the category of the closest one. This method seems to make sense, but it has a big problem. The reason is that similarity does not mean the same category. As shown below

image.png

The number 2 in the middle is more similar to the 3 on the far right, but in fact it should be the same category as the 2 on the far left. Similarity cannot be used directly in semi-supervised learning, does that mean it is useless? the answer is negative. Imagine that when the amount of data is large, we will have many sample points of different shapes. Through their conduction, they can be located in the same dense interval, even if the similarity of two samples is not high. We can classify the samples of the same dense interval into one category. How do you get this dense interval? Of course, clustering is used.

The steps of using clustering algorithm to achieve semi-supervised learning are

  1. For clustering all data, I personally feel that density-based clustering algorithms such as DBScan and HDBScan are more effective.
  2. Those in the same cluster are considered to have the same category. Use the labeled data to obtain the category of the cluster, thereby obtaining the category of the unlabeled data in the cluster.
  3. Use this supervised data to train the model

The key point of this method is the homogeneity of clustering. Generally speaking, it is difficult to ensure that the data in the same cluster belong to the same category. Clustering generally includes two steps. First, the sample embedding is a vector and the pairwise distance is calculated, and then clustering algorithms (such as kmeans, HDBScan, optics, spectral clustering, etc.) are used for clustering. Personally think embedding is more important.

image.png

In addition, through the clustering algorithm, the accuracy of the annotation can also be verified. For data belonging to the same cluster but with different labels, you can re-mark. Thereby reducing annotation noise.

2.4 Graph-based Approach

Clustering was introduced before to use unlabeled data. In fact, we can also use graph to realize the connection between data. The relationship between data is sometimes easier to obtain, such as the jump between web links, the order of product clicks, and the citation relationship of papers. Using the relationship between data, combined with K nearest neighbor algorithms, discrete data can be constructed into a graph. Each data is a node of the graph, and the relationship between the data is the edge of the graph. The side can have weight.

image.png

Nodes connected together are considered to belong to the same category. From the labeled data nodes, you can know the category they belong to. The key to this method is

  1. The accuracy of graph construction, including node connection relationship and edge weight
  2. The amount of data must be large enough, otherwise a graph may not be connected due to missing some nodes

 

3 summary

Semi-supervised learning is still quite useful for data utilization efficiency, especially in the cold start link. Combining active learning, labeling noise learning, and small sample learning can greatly reduce labeling costs and speed up model iteration and online time.

Guess you like

Origin blog.csdn.net/u013510838/article/details/108539697