Unsupervised Visual Representation Learning by Context Prediction(2015

       2015

       Given only a large, unlabeled collection of images, we extract random slice pairs from each image and train a convolutional neural network to predict the position of the second slice relative to the first. We argue that doing this well requires the model to learn to recognize objects and their components. We demonstrate that feature representations learned using this intra-image context indeed capture visual similarity across images.

       Recently, new computer vision methods have leveraged large datasets of millions of labeled examples to learn rich, high-performance visual representations [29]. However, efforts to scale these methods to truly Internet-scale datasets (i.e., hundreds of billions of images) have been hampered by the enormous expense of the required human annotation . Unfortunately, despite decades of unremitting efforts, unsupervised methods are still unable to extract useful information from a large number of full-scale real images. After all, there is no label, and it is not even clear what should be represented.

       This turns an apparently unsupervised problem (finding a good measure of similarity between words) into a "self-supervised" problem: learning a function from a given word to the words around it.

       Our underlying assumption is that doing well at this task requires understanding both the scene and the objects, that is, a good visual representation for this task requires extracting objects and their roles in order to reason about their relative spatial locations.

       Despite being trained with the objective function operating on one image at a time, our representation generalizes to all images. That said, instance-level supervision appears to improve performance on category-level tasks.

       One way to think about good image representations is as latent variables for suitable generative models. An ideal generative model of natural images should both generate images according to their natural distribution, and be concise, that is, find common causes of different images and share information among them. Inferring the latent structure of a given image is difficult even for relatively simple models.

       Unsupervised representation learning can also be formulated as learning embeddings (i.e., feature vectors for each image) where semantically similar images are close together and semantically dissimilar images are far apart.

       We argue that current reconstruction-based algorithms struggle to deal with underlying phenomena, such as random textures, or even measure whether a model is well generated.

       The key problem that these methods must address is that predicting pixels is much more difficult than predicting words, since the same semantic object may produce a large number of different pixels. In the text domain, an interesting idea is to switch from a purely predictive task to a discriminative task [38, 9]. In this case, the q-pre-task is to discern real text fragments from identical fragments with randomly substituted words.

       However, in the domain of 2D images, such a task would be useless, since discerning low-level color statistics and illumination would suffice. To make the task more difficult and advanced, in this paper we instead classify between multiple possible configurations of patches sampled from the same image, which means they will share lighting and color statistics.

       Another line of work on unsupervised learning from images aims to discover target categories using hand-crafted features and various forms of clustering. This representation loses shape information, and it is easy to find clusters of leaves.

       We ultimately wish to learn feature embeddings for individual patches such that visually similar patches (across different images) are close in the embedding space.

       In this paper, low-level cues , such as edge patterns or continuous textures across patches, may serve as such shortcuts. Therefore, for relative prediction tasks, it is important to include a gap between patches (in this paper, about half the patch width). Even with gaps, adjacent patches spanning long distances may give the correct answer. Another trivial solution is chromatic aberration, which is caused by the way a lens focuses different wavelengths of light differently. Once the network has learned the absolute position on the shot, solving the relative position task becomes trivial.

       For computational efficiency, we only sample patches from the grid-like pattern, such that each sampled patch can participate in up to 8 individual pairings. We build robustness to pixelation by (1) mean subtraction, (2) projecting or discarding colors (see above), (3) randomly downsampling some patches to a total of 100 pixels and then upsampling sex. Therefore, our final implementation employs batch normalization , which forces network activations to vary across samples .

       The performance after fine-tuning is slightly worse than Imagenet, but still a considerable improvement over the model from scratch.

       Visual data mining [41, 13, 47, 42] or unsupervised object discovery [48, 44, 20], aims to use large image collections to discover image segments that happen to describe the same semantic object.

       The discovery of birds and torsos — which are notoriously deformable — provides further evidence of the invariance learned by our algorithm.

       The main disadvantages of our algorithm relative to [12] are 1) some loss of purity, and 2) we currently cannot automatically determine object occlusions (although it is conceivable to dynamically add more sub-patches to each candidate). As shown in Figure 9, we make substantial progress in coverage, which indicates an increase in the invariance of our learned features.

       One possible reason why the prior task is so difficult is that the task is nearly impossible for most patches in each image. Thus, while our algorithm is sensitive to objects, it is almost equally sensitive to the layout of the rest of the image.

Guess you like

Origin blog.csdn.net/YoooooL_/article/details/130876292