Graphic self-supervised learning, artificial intelligence, the biggest piece of cake

Author: Amit Chaudhary
compilation: ronghuaiyang (AI park)
Original link:

Graphic self-supervised learning, artificial intelligence, the biggest piece of cake mp.weixin.qq.comicon

If artificial intelligence is a piece of cake, then most are self-supervised learning cake, the icing on the cake is supervised learning, the cherry on the cake is the reinforcement learning.

Yann Lecun introduced the "cake analogy" in his speech to illustrate the importance of learning from supervision. Although this analogy is controversial, but we have also seen the impact of self-supervised learning, the latest developments (Word2Vec, Glove, ELMO, BERT) in the field of natural language processing, he has accepted the self-supervision, and made the best of results.

"If artificial intelligence is a piece of cake, the cake is the most self-supervised learning, the icing on the cake is supervised learning, the cherry on the cake is the reinforcement learning (RL)."

Out of self-supervised learning applications in the field of computer vision curiosity, I went through a recent research paper Jing, who have been studying the literature on self-monitoring applications in the field of computer vision.

This article is my intuitive summary supervised learning problem patterns of self.

The key idea

In order to use supervised learning, we need adequate labeling data. To gain this information, manual annotation needs to manually tag data (image / text), which is a time-consuming and expensive process. There are some areas, such as the medical field, to get enough data itself is a challenge.

This is self-supervised learning into play. It raises the following questions to solve this problem:

Can we in such a way to design the task that we can almost unlimited number of tags from an existing image generation, and use these labels to learn expressions image?

Let's establish a monitoring mission by the creative use of certain attributes instead of manually labeled data block. For example, here, we can rotate the image 0/90/180/270 degrees, instead of marking them as cat / dog, and training a model to predict the rotation. We can generate a virtually unlimited number of training data from millions of images provided free.

Creative approach already exists

The following is the use of images and video attributes and a self-supervised learning method represented a variety of ways researchers raised.

Learning from the image

1. colored image

form:

Using gray image millions to prepare pairs (grayscale, color) images.

We use the L2 loss between a codec based on the whole structure of a convolutional neural network calculates the predicted and the actual color image.

To solve this problem, we must understand the different models of objects and images that appear in the relevant section, so that it can draw these parts with the same color. Therefore, it learns downstream tasks useful.

论文:Colorful Image Colorization|Real-Time User-Guided Image Colorization with Learned Deep Priors|Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification

2. Super-resolution images

form:

The use of sampling under way to prepare images for training (small, scaling).

As SRGAN popular GAN-based model in this task. Generating a low-resolution image, and acquires the full convolution using network output high resolution image. To simulate human-quality content using the mean square error and losses compared to the image actually generated and the generated image is compared. Discrimination of binary acquires images and classify, it determines the actual high resolution image (1) or false generated super-resolution image (0). Interaction between these two models leads generator generates the learning image with fine detail.

And generating a discriminator semantic features have learned the task may be used downstream.

论文Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

3. Image restoration

form:

We training image pairs may be generated in a portion of an image by randomly removed (damaged, recovery).

And similar super-resolution, we can take advantage of GAN-based architecture In this architecture, the generator can learn how to reconstruct the image, and the discriminator is possible to separate the real image and the generated image.

Downstream task, Pathak et al showed that, in the semantic segmentation PASCAL VOC 2012 match, the semantic feature generator compared to random initialization learned 10.2% increase, and the object for the classification detect <4% increase.

论文:Context encoders: Feature learning by inpainting

4. Image Puzzle

form:

Exchanging random training image block generation

Even if only nine pieces, there are 362 880 it is also possible puzzles. To overcome this problem, only a subset of possible permutations, for example 64 with the highest ranked Hamming distance.

Suppose we use the following rearrangement change the image. We use 64 arranged in the 64th order.

Now, in order to restore the original tile, Noroozi et al proposed a neural network called a context-free network (the CFN), as shown in FIG. Here, each tile passed through the same shared layer siamese convolution weights. Then, a combination of these features in a fully connected layer. In the output, the model must be used which are arranged in the predicted 64 possible permutations category. If we know the way the arrangement, we can solve this problem.

In order to solve the problem puzzles, you need to learn how to identify the model is incorporated in a part of the object, the shape and relative position of the different parts of the object of the object. Thus, these representations for the detection and classification tasks downstream are useful.

论文:Unsupervised learning of visual representations by solving jigsaw puzzles

5. context prediction

form:

We randomly selected one image block and a block which is close to the image as the training image pairs.

In order to solve the task in front of this text, Doersch, who uses a similar architecture jigsaw puzzle. We extracted feature image blocks transmitted through the two siamese convolutional neural network, wherein the connection classes are classified and 8, shows the eight possible positions of the neighbors.

论文:Unsupervised Visual Representation Learning by Context Prediction

6. geometric transformation identification

form:

We generates an image (image rotation, rotation angle) is denoted by a random rotation of the image.

In order to solve this task before the text, Gidaris et al. Proposes a framework in which the image is rotated by a convolutional neural network, the network it needs to be divided into four categories (0/90/270/360 degrees).

Although this is a very simple idea, but the model must understand the position of the object in the image, and the type of attitude to complete this task, therefore, to learn representation is useful for subsequent tasks.
Paper : Unsupervised Learning by Predicting Image Representation in rotations

7. Image Clustering

form:

The result of the clustering of training image samples to generate and label as the label image.

To solve this preparatory work, Caron et al. Proposed architecture called depth clustering. Here, the first image cluster, the cluster is used as a category of classification. The task is to predict the convolution neural network input cluster label image.

论文:Deep clustering for unsupervised learning of visual features

The image synthesis

form:

Generating a composite image by using the game engine and adjust it to the real image to prepare training for (images, properties).

To solve this preparatory work, either, who proposed an architecture using a shared network convolution weights training on synthetic and real images, then discriminator learn whether the classification composite image is a real image. Because the shared antagonism between the real image and the synthetic image representation for the better.

Learning from videos

1. Video frame sequence identification

form:

Generating a train of (video frames, the correct sequence) by disturbing the video frames in the video.

To solve this preparatory work, Misra et al proposed a framework in which the weight of the video frame by sharing rights ConvNets passed, the model must determine the correct order of the frames. In this process, the model is not only learning the spatial characteristics, but also takes into account the characteristics of the time.

论文:Shuffle and Learn: Unsupervised Learning using Temporal Order Verification

English original: Https://Amitness.Com/2020/02/illustrated-self-supervised-learning/

 

Posted 2020-3-2.

Published 482 original articles · won praise 789 · Views 1.71 million +

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/105266424