Supervised learning, semi-supervised learning, unsupervised learning, self-supervised learning, reinforcement learning, and contrastive learning

Table of contents

1. Supervised learning

2. Semi-supervised learning

3. Unsupervised learning

3.1. Clustering algorithm

3.2. Dimensionality reduction algorithm

3.3. Anomaly detection

3.4. Autoencoders

3.5. Generating Models

3.6. Association rule learning

3.7. Self-Organizing Map (SOM)

4. Self-supervised learning

4.1. Context based

4.2. Temporal Based

4.3. Contrastive Based

5. Reinforcement learning

6. Comparative study

6.1 Momentum Contrast 

6.2 SimCLR


This article is a staged summary, just pick the key points and read!

1. Supervised learning

The biggest feature of supervised learning is that its data sets are labeled. In other words, supervised learning is to learn a model in labeled training data, and then use the model to predict its label for a given new data. Specifically, the labels in this data set are like the standard answers for the homework, and the labels we predict are the answers we give. If our answer is different from the standard answer, teachers and parents will correct it, we will study, our understanding of the topic will be deepened, and the correct rate will be higher and higher.

Common algorithms: classification algorithms (KNN, naive Bayesian , SVM, decision tree, random forest, BP neural network algorithm, etc.) and regression algorithms (logistic regression, linear regression, etc.).

Application scenarios: classification and regression scenarios, such as spam classification, heart disease prediction, etc.

2. Semi-supervised learning

Let the learner automatically use unlabeled samples to improve learning performance without relying on external interaction, which is semi-supervised learning.

In many practical applications, it is easy to find a large number of unlabeled examples, but manual labeling with special equipment or expensive and time-consuming experimental processes is required to obtain class-labeled samples, resulting in a very small number of examples. The class-labeled examples and the excess class-free examples. Therefore, people try to add a large number of unlabeled samples to limited class-labeled samples for training together, expecting to improve the learning performance, resulting in semi-supervised learning (Semi-supervised learning). supervised Learning), as shown in Figure 1. Semi-supervised learning (Semi-supervised Learning) avoids the waste of data and resources, and at the same time solves the problems of weak generalization ability of supervised learning (Supervised Learning) model and imprecise model of unsupervised learning (Unsupervised Learning).

Semi-supervised learning can be further divided into pure semi-supervised learning and transductive learning. The former assumes that the unlabeled samples in the training data are not the data to be tested, while the latter assumes that the unlabeled samples considered in the learning process The samples are just the data to be predicted , and the purpose of learning is to obtain the optimal generalization performance on these unlabeled samples.

Generally speaking, semi-supervised learning algorithms can be divided into: self-training (self-training algorithm), Graph-based Semi-supervised Learning (graph-based semi-supervised algorithm), Semi-supervised supported vector machine (semi-supervised support vector machine, S3VM). A brief introduction is as follows:

1. Simple self-training : train a classifier with labeled data, and then use this classifier to classify unlabeled data, which will generate pseudo labels (pseudo labels) or soft labels (soft labels) , select unlabeled samples that you think are classified correctly (there should be a selection criterion here ), and use the selected unlabeled samples to train the classifier.

2. Co -training: It is actually a kind of self-training, but its idea is good. Assume that each data can be classified from different angles (view), and different classifiers can be trained from different angles, and then use these classifiers trained from different angles to classify unlabeled samples, and then select the ones that are considered credible. Unlabeled samples are added to the training set. Since these classifiers are trained from different angles, they can form a complementarity and improve classification accuracy; just as things can be better understood from different angles.

3. Semi-supervised dictionary learning : In fact, it is also a kind of self-training. First, use labeled data as a dictionary to classify unlabeled data, select unlabeled samples that you think are classified correctly, and add them to the dictionary (the dictionary at this time becomes a semi-supervised dictionary)

4. Label Propagation Algorithm (Label Propagation Algorithm): It is a semi-supervised algorithm based on graphs. It finds the labeled data and unlabeled data in the training data by constructing a graph structure (data points are vertices, and the similarity between points is edges). Label data relationship. Yes, just in the training data, this is a transductive semi-supervised algorithm, that is, it only classifies the unlabeled data in the training set, which actually feels a lot like a supervised classification algorithm..., but it is not , because the label propagation process will flow through unlabeled data, that is, the label information of some unlabeled data flows from other unlabeled data, which uses the connection between unlabeled data

5. Semi-supervised support vector machine : The supervised support vector machine uses the structural risk minimization to classify. The semi-supervised support vector machine also uses the spatial distribution information of the unlabeled data, that is, the decision hyperplane should be consistent with the unlabeled data. Consistent distribution (should go through places with low density of unlabeled data) ( this is actually an assumption , if not satisfied, the spatial distribution information of this unlabeled data will mislead the decision hyperplane, resulting in performance worse than when only labeled data is used )

Application scenarios: Some scenarios where labeled data is difficult to obtain.

3. Unsupervised learning

The characteristic of unsupervised learning is that the data learned by the model has no labels, so the goal of unsupervised learning is to reveal the inherent characteristics and laws of the data through the learning of these unlabeled samples, and its representative is clustering. Compared with supervised learning, supervised learning is learning according to a given standard (the standard here refers to the label), while unsupervised learning is learning according to the relative standard of the data (there are differences between the data). Take classification as an example. When you were a child, when you were distinguishing between cats and dogs, others told you that this is a cat and that is a dog. In the end, you can distinguish between cats and dogs (and know whether it is a cat or a dog). This is the result of supervised learning. But if no one taught you the difference between cats and dogs when you were young, but you find that there are differences between cats and dogs, it should be two kinds of animals (although you can distinguish them, but you don’t know the concept of cats and dogs), this is the result of unsupervised learning. Its typical algorithms include:

3.1. Clustering algorithm

This algorithm is used to group samples into clusters based on their similarity. The goal of clustering is to divide the data into groups such that examples in each group are more similar to each other than examples in other groups.

There are many clustering methods, including centroid-based methods, density-based methods, and hierarchical methods. Centroid-based methods, such as k-means, partition the data into K clusters, where each cluster is defined by a centroid (i.e., a representative example). Density-based methods, such as DBSCAN, divide the data into clusters based on the density of the examples. Hierarchical methods, such as agglomerative clustering, build a hierarchy of clusters where each example is initially considered to be its own cluster, and then the clusters are merged together based on their similarity.

3.2. Dimensionality reduction algorithm

Dimensionality reduction algorithms are used to reduce the number of features in a dataset while preserving as much information as possible. Dimensionality reduction is often used in machine learning to improve the performance of learning algorithms because it reduces the complexity of the data and prevents overfitting. It is also useful for data visualization, as it reduces the number of dimensions to a more manageable size, allowing data to be plotted in a lower dimensional space.

There are many methods of dimensionality reduction, including linear methods and nonlinear methods. Linear methods include techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA), which find the linear combination of features that captures the greatest variance in the data. Nonlinear methods include techniques such as t-SNE and ISOMAP, which preserve the local structure of the data.

In addition to linear and nonlinear methods, there are feature selection methods (selecting a subset of the most important features) and feature extraction methods (transforming data into a new space with fewer dimensions).

3.3. Anomaly detection

This is a type of unsupervised learning that involves identifying unusual or unexpected examples compared to the rest of the data. Anomaly detection algorithms are often used for fraud detection or to identify malfunctioning devices. There are many methods for anomaly detection, including statistical methods, distance-based methods, and density-based methods. Statistical methods involve computing statistical properties of data, such as the mean and standard deviation, and identifying examples that fall outside a certain range. Distance-based methods involve calculating the distance between examples and most of the data, and identifying examples that are too far away. Density-based methods involve identifying examples in low-density regions of the data

3.4. Autoencoders

An autoencoder is a type of neural network used for dimensionality reduction. It works by encoding the input data into a low-dimensional representation and then decoding it back to the original space. Autoencoders are commonly used for tasks such as data compression, denoising, and anomaly detection. They are particularly useful for datasets that are high-dimensional and have a large number of features, as they can learn low-dimensional representations of the data that capture the most important features.

3.5. Generating Models

These algorithms are used to learn the distribution of data and generate new examples similar to the training data. Some popular generative models include generative adversarial networks (GANs) and variational autoencoders (VAEs). Generative models have many applications, including data generation, image generation, and language modeling. They are also used for tasks such as style transfer and image super-resolution.

3.6. Association rule learning

This algorithm is used to discover relationships between variables in a dataset. It is commonly used in shopping cart analysis to identify frequently purchased items. A popular association rule learning algorithm is the Apriori algorithm.

3.7. Self-Organizing Map (SOM)

Self-Organizing Map (SOM) is a neural network architecture for visualization and feature learning. They are an unsupervised learning algorithm that can be used to discover structure in high-dimensional data. SOMs are commonly used for tasks such as data visualization, clustering, and anomaly detection. They are especially useful for visualizing high-dimensional data in two-dimensional spaces, as they can reveal patterns and relationships that might not have been apparent in the original data.

Application scenarios: clustering scenarios, such as aggregation news websites. 

4. Self-supervised learning

Self-supervised learning mainly uses auxiliary tasks (pretext) to mine its own supervisory information from large-scale unsupervised data, and trains the network through this structured supervisory information, so that it can learn valuable representations for downstream tasks.
That is, self-supervised learning does not require any externally labeled data, the labels are derived from the input data itself. The mode of self-supervised learning is still the mode of Pretrain-Fintune, that is, pre-training is performed on pretext first, and then the learned parameters are transferred to the downstream task network for fine-tuning to obtain the final network.
The methods of self-supervised learning can be mainly divided into three categories:

4.1. Context based

Many tasks can be constructed based on the context information of the data itself. Construct pretext in a jigsaw way, for example, divide a picture into 9 parts, and generate losses by predicting the relative positions of these parts; construct pretext in a cutout way, randomly delete part of the picture, and use the remaining part Predict the subtracted part; predict the color of the picture, such as the grayscale image of the input image, and predict the color of the picture.

Paper 1: "S4L: Self-Supervised Semi-Supervised Learning"

Self-supervised and semi-supervised learning (a large amount of data without labels, a small amount of data with labels) can also be combined, for unlabeled data self-supervised learning (rotation prediction), and for labeled data, self-supervised learning while using The idea of ​​joint training for supervised learning. Through the semi-supervised division of imagenet, experiment with 10% or 1% of the data, and finally analyze the impact of some hyperparameters on the final performance.

For labeled data, the model will predict both the rotation angle and the label. For unlabeled data, only the rotation angle will be predicted, and the "predicted rotation angle" can be replaced by any other unsupervised task (the author proposes two algorithms, one It is S^4L-Rotation, that is, the unsupervised loss is a rotation prediction task; the other is S^4L-Exemplar, that is, the unsupervised loss is a triplet loss based on image transformation (cropping, mirroring, color transformation, etc.)

In general, it is necessary to use unsupervised learning to create a pretext task for unlabeled data. This pretext task can enable the model to learn a good feature representation using a large number of unlabeled

4.2. Temporal Based

In fact, there are many constraint relationships between samples. Here we introduce the method of self-supervised learning using timing constraints. The data type that best reflects timing is video.

论文二:《Time-Contrastive Networks: Self-Supervised Learning from Video》

The first idea is based on the similarity of frames. For each frame in the video, there is actually a concept of similar features. In simple terms, we can think that the features of adjacent frames in the video are similar, while the far apart video Frames are dissimilar, and self-supervised constraints are performed by constructing such similar (position) and dissimilar (negative) samples. 【As shown below】

 In addition, there may be multiple viewing angles (multi-view) for the shooting of the same object. For the same frame in multiple viewing angles, the features can be considered similar, and different frames can be considered dissimilar. 【As shown below】

 论文三:《Unsupervised Learning of Visual Representations Using Videos》

Another idea is based on the unsupervised tracking method from ICCV 2015 Unsupervised Learning of Visual Representations Using Videos. First, unsupervised tracking is performed in a large number of unlabeled videos to obtain a large number of object tracking frames. Then the features of an object tracking frame in different frames should be similar (positive), and the features in the tracking frames of different objects should be dissimilar (negative).

 

论文四:《Shuffle and Learn:unsupervised learning using temporal order verification》

In addition to being based on feature similarity, the sequence of videos is also a kind of self-supervised information. For example, ECCV 2016, Misra, I. Shuffle and Learn: unsupervised learning using temporal order verification and others proposed a method based on order constraints, which can sample correct video sequences and incorrect video sequences from videos, and construct positive and negative sample pairs. Then train. In short, it is to design a model to judge whether the current video sequence is in the correct order.

4.3. Contrastive Based

Representations are constructed by learning to encode the similarity or dissimilarity of two things, that is, by constructing positive and negative samples, and then measuring the distance between positive and negative samples to achieve self-supervised learning . The similarity between core idea samples and positive samples is far greater than that between samples and negative samples.

论文五:《Representation Learning with Contrastive Predictive Coding》

We introduced the DIM of ICLR 2019. The specific idea of ​​DIM is that for the expression of the hidden layer, we can have global features (the final output of the encoder) and local features (the features of the middle layer of the encoder), and the model needs to classify global features and local features. Whether the features are from the same image. So here x is a global feature from one image, positive samples are local features of that image, and negative samples are local features of other images. This work is very groundbreaking and has been applied to other fields, such as graph.

CPC is a self-supervised framework based on comparison constraints, which can be applied to the comparison method of any form of data such as text, voice, video, image (image can be regarded as a sequence composed of pixels or image blocks).

CPC learns feature representations by encoding information shared across multiple time points while discarding local information. These features are called "slow features": features that do not change rapidly over time. For example: the identity of the speaker in the video, the activities in the video, the objects in the image, etc.

CPC mainly uses the idea of ​​autoregression to encode the information shared between data points separated by multiple time steps to learn representation. This representation c_t can represent the fusion of past information, and the positive sample is this sequence at time t After the input, negative samples are samples randomly sampled from other sequences. The main idea of ​​CPC is to predict future data based on past information and train by sampling.

Application scenarios: semantic segmentation, target detection, image classification and human action recognition, etc.

5. Reinforcement learning

Reinforcement learning (RL) discusses how an agent (agent)  can maximize the reward it can obtain in  a complex and uncertain  environment (environment) . By perceiving  the response (reward)  of  the state (state) of the environment  to  the action (reward) , to guide better actions, so as to obtain the greatest  return (return) , this is called learning in interaction, such a learning method It's called reinforcement learning.

During reinforcement learning, the agent is constantly interacting with the environment. The agent obtains the state in the environment, and the agent will use this state to output an action and a decision. Then this decision will be put into the environment, and the environment will output the next state and the reward for the current decision according to the decision taken by the agent. The purpose of the agent is to obtain as many rewards as possible from the environment.

Reinforcement learning mainly has the following characteristics:

1. Trial-and-error learning : Reinforcement learning generally has no direct guidance information, and the Agent needs to continuously interact with the Environment to obtain the best policy (Policy) through trial and error.

2. Delayed rewards : Instructions for reinforcement learning are sparse and often given after the fact (the last state). For example, in Go, the winner can only be known at the end.

 

1. Environment (Environment)  is an external system, the agent is in this system, can perceive this system and can take certain actions based on the perceived state.

2. Agent (Agent)  is a system embedded in the environment, which can change the state of the environment by taking actions.

3. State (State)/Observation (Observation) : The state is a complete description of the world and does not hide information about the world. Observations are partial descriptions of states, and some information may be missing.

4. Action (Action) : Different environments allow different types of actions. In a given environment, the set of valid actions is often called the action space (action space), including discrete action spaces (discrete action spaces) and continuous actions. Space (continuous action spaces), for example, if a maze-walking robot has only four ways of moving, east, west, and north, it is a discrete action space; if the robot can move to any angle in 360°, it is a continuous action space.

5. Reward : It is a scalar feedback signal given by the environment, which shows how well the agent has adopted a certain strategy at a certain step.

Common algorithms: Hidden Markov, Monte Carlo.

Application scenarios: For scenarios that require continuous reasoning in the process, such as unmanned car driving, AlphaGo playing Go, etc.

6. Comparative study

The basic idea is to provide a set of negative and positive samples. The goal of the loss function is to find representations that minimize the distance between positive samples while maximizing the distance between negative samples. The distance of the encoded image can be calculated by dot product, which is exactly what we want! So does this mean that SSL in computer vision is now solved? In fact, it has not been fully resolved.

Why do you say that? Because images are very high-dimensional objects, it is almost impossible to traverse all negative sample objects in high dimensions, and even if it is possible, it will be very inefficient, so the following method is derived.

Before describing the method, let us first discuss the contrastive loss which will help us understand the algorithm mentioned below. We can think of contrastive learning as a dictionary lookup task. Imagine an image/block is encoded (queried) and then matched against a set of random (negative - anything other than the original image) samples + a few positive (augmented view of the original image) samples. This set of samples can be viewed as a dictionary (each sample is called a key). Assuming there is only one positive example, it means that the query will match one of the keys well. In this way contrastive learning can be thought of as reducing the distance between a query and its compatible keys, while increasing the distance to other keys.

At present, the two key algorithms in contrastive learning are as follows:

6.1 Momentum Contrast 

MoCo is the work published by Kaiming He's team in CVPR2020. MoCo uses the method of contrastive learning to improve the performance of unsupervised learning in ImageNet classification beyond the performance of supervised learning. MoCo focuses on the effect of sample size on the quality of learning. Among the positive and negative sample generation methods used by MoCo, the positive sample generation method: random cropping, generating two regions, two regions of the same picture are positive samples, and two regions of different pictures are negative samples, that is, to judge whether the two regions are for the same picture. First introduce the simple end-to-end model structure, as shown below:

Define q and k as the keys of positive and negative samples, xq and xk respectively as positive and negative samples. The end-to-end model can use the same encoder or two encoders to encode xq and xk, and then calculate Loss through the inner product. It should be noted that in the end-to-end model, the dictionary size is the mini-batch size, and the negative samples in each batch will also contribute to the Loss. During the backpropagation process, there will be gradient feedback that needs to be learned The encoder function f(), so when it is implemented, the number of negative samples will inevitably be limited by the size of the batch_size, thereby limiting the performance of the model. In response to this problem, the author introduced the Memory Bank and MoCo models, the structure of which is shown in the figure below: 

The Memory Bank model decouples the dictionary size and mini-batch size, that is, the negative samples are not selected in each batch, but are sampled in the bank composed of the characteristics of all samples. Through random sampling, it can be regarded as a query sampling to a certain extent. The negative samples can represent all samples, but the problem is that the backpropagation of each mini-batch will update the encoder parameters. If all samples are re-encoded for each update, the memory requirement is large. If only the next sample is updated k samples, there is a certain lag between the resulting representation and parameter update. The MoCo proposed in the article is a fusion of end-to-end and Memory Bank, and solves the problems that existed before, adding a momentum encoder, and using the dictionary as a dynamic entry and exit queue. The goal is to build a large and can be trained In the process of maintaining a consistent dictionary, the author uses this queue to maintain the sample feature representations in the last few mini-batches, and uses the queue as a subset of all sample samples. For the encoder parameter θk of negative samples, the Momentum update method is used to copy The parameter θq of the positive example encoder, the formula is:

 Summary : The idea is that learning good representations requires a large dictionary with lots of negative examples, while keeping the encoder of the dictionary keys as consistent as possible. The core of this approach is to treat dictionaries as queues rather than static memory banks or small batches of processing. This provides a rich set of negative samples for dynamic dictionaries, while also decoupling the dictionary size from the mini-batch size, allowing negative samples to grow larger as needed.

6.2 SimCLR

SimCLR is published by Ting Chen et al. in ICML2020. Compared with MoCo, SimCLR focuses on the construction of positive and negative samples. At the same time, SimCLR also explores the role of nonlinear layers in comparative learning, and analyzes the batch_size The effect of hyperparameters such as size and number of training epochs on contrastive learning. The structure diagram of the SimCLR model is as follows:

 

Given that the input anchor point data is x, first generate positive and negative sample pairs through data enhancement (random cropping, color distortion, Gaussian blur, etc.), using ResNet50 as the encoder, that is, the function f(), after the encoding is expressed, the MLP will The representation is mapped to the space of contrastive learning loss. The goal is to hope that the different augmentation representations of the same image are similar, and the augmentation representations of other images in the mini-batch are far away.

The author generates in the following way in the step of data enhancement:

The author experimented with a variety of data enhancement methods, and finally concluded that data enhancement has a significant effect on improving the effect of contrastive learning, and the combination of multiple data enhancements is better; data enhancement improves contrastive learning compared to supervised learning. high. In addition, by adding nonlinear changes after obtaining the encoded representation, it is found that the encoded h of the encoder will retain information related to data enhancement transformation, and the function of the nonlinear layer is to remove these information, so that the representation returns to the essence of the data and improves contrastive learning. Effect. The author did not use methods such as memory bank, but only increased the batch_size to 8192 (using 128 TPUs) and increased the number of training rounds. It was verified that a larger batch_size and longer training time have a significant effect on improving comparative learning. In the end, the experimental effect of SimCLR can reach more than 7% of MoCo on ImageNet. 

Summary : SimCLR - The core idea is to use a larger batch size (8192, to get a rich set of negative samples), stronger data augmentation (cropping, color distortion and Gaussian blur), and embedded non-linearity before similarity matching Transform, using larger models and longer training times. These are obvious things that require trial and error, and the paper empirically shows that this helps significantly improve performance.

But contrastive learning also has limitations:

1. A large number of negative samples are required to learn better representations.

2. Training requires large batches or large dictionaries.

3. Scaling cannot be performed in higher dimensions.

4. Some kind of asymmetry is needed to avoid constant solutions.

 Application scenarios: face recognition, NLP, etc.

Guess you like

Origin blog.csdn.net/weixin_45684362/article/details/128683954
Recommended