Three basic tasks of computer vision: classification, detection (localization), segmentation (semantic and instance)

foreword

When you are new to computer vision, you may be confused about the distinction between different tasks and the choice of network architecture . Therefore, the relevant basic knowledge is summarized here. In this paper, we try to answer two questions:

  1. What are the different tasks to do, and what is the scope of research?
  2. What type of network do you need to choose for different tasks?

Classification, detection (localization), segmentation (semantic and instance)

Computer vision tasks can be divided into 4 or 3 categories. This article divides them into 3 categories based on personal understanding. Task complexity and difficulty: instance segmentation > semantic segmentation > object detection > classification.

First of all, let’s intuitively feel and understand the differences and connections between different tasks from a picture:

Figure 1. (a) Image classification; (b) Target detection and localization; (c) Semantic segmentation; (d) Instance segmentation
The picture comes from Zhihu Zhang Hao: Intuitive combing of deep learning - four basic tasks of computer vision

Classification task (Classification)

Classification task: structure the image into a certain category of information, and describe the picture with a predetermined category or instance ID. Classification tasks can be divided into: binary classification tasks and multi-classification tasks. The classification task pays more attention to the content description of the picture as a whole .

Binary classification task : There are only two types of target classes, positive or negative; for each input photo, there is something in it that is positive and nothing in it is negative. The output layer is 1 neuron, the sigmod function is used as the activation function to judge, and the cross entropy is used as the loss function.

Multi-classification task : In contrast to the binary classification task, there are n types of target classes, such as mouse, cat, dog, wolf, tiger, elephant and other labels. There are n neurons in the output layer of the multi-classification task, corresponding to n categories. The softmax function gives the probability of each class, and the cross entropy is used as the loss function.

Localization and Detection Tasks (Localization and Deection)

The detection task pays more attention to a specific object target, and requires the category information and location information of this target . Object detection includes two problems, one is to judge whether an object belonging to a certain class appears in the picture; the other is to locate the object, and the positioning is used to represent the bounding box of the object, which is usually represented by the coordinates of the rectangular detection box.

Semantic Segmentation

Semantic segmentation: It is necessary to label the image pixel by pixel as a certain object category, but different instances of the same object do not need to be segmented separately. As shown in Figure 1c, there are 1 bottle, 1 cup, and 3 cubes in the figure. Only bottle, cup, and cube need to be marked, and there is no need to mark cube1, cube2, and cube3.

instance segmentation

Instance segmentation is a hybrid of object detection and semantic segmentation. (1) Compared with the rectangular detection frame of object detection, instance segmentation can be accurate to the edge of the object; (2) Relative to semantic segmentation, instance segmentation can mark different instances of the same object, such as cube1, cube2, and cube3.

Network Architecture Selection

The Development of Classical Convolutional Neural Networks

LeNet-5 (Yann LeCun, 1989): One of the earliest released convolutional neural networks, the effect is comparable to support vector machines.

AlexNet (Alex Krizhevsky, 2012): The first modern (21st century) deep convolutional neural network.

The key points of AlexNet: (1). The ReLU activation function is used to make it have better gradient characteristics and faster training. (2). Random inactivation (dropout) is used . (3). Extensive use of data augmentation techniques. The significance of AlexNet is that it won the ILSVRC competition that year with 10% higher performance than the second place, which made people aware of the advantages of convolutional neural networks. In addition, AlexNet also made people realize that GPU can be used to accelerate convolutional neural network training.

VGG ( Simonyan & Zisserman, 2014 ): Introduce the idea of ​​VGG blocks.

The key points of VGG: (1).  The structure is simple , there are only two configurations of 3×3 convolution and 2×2 confluence, and the same module combination is repeatedly stacked . The convolutional layer does not change the size of the space, and the size of the space is halved every time it passes through the confluence layer. (2).  The amount of parameters is large , and most of the parameters are concentrated in the fully connected layer. The 16 in the network name means it has 16 conv/fc layers. (3). Proper network initialization and the use of batch normalization layers are important for training deep networks.

NiN [ Lin et al., 2013 ]: Networks of Networks.

GoogLeNet [ Szegedy et al., 2015 ]: A network with parallel connections, Inception blocks.

The key points of GoogLeNet are: (1).  Multiple branches are processed separately and the results are cascaded. (2). In order to reduce the amount of calculation, a 1×1 convolution is used for dimensionality reduction. GoogLeNet uses the global average convergence instead of the fully connected layer, which greatly reduces the network parameters.

Inception's name comes from the "we need to go deeper" stalk in Inception 

ResNet [ He et al., 2016a ]: ResNet aims to use residual connections to solve the phenomenon that the training difficulty increases after the network is deepened.

The key points of ResNet are: (1). Use short-circuit connections to make it easier to train deep networks and repeatedly stack the same combination of modules. (2). ResNet uses a lot of batch normalization . (3). For very deep networks (more than 50 layers), ResNet uses a more efficient bottleneck structure

DenseNet [ Huang et al., 2017 ]: A logical extension of ResNet, whose purpose is also to avoid gradient disappearance.

Choice of Network Architecture for Different Tasks

"To be continued"

References

Guess you like

Origin blog.csdn.net/yohangzhang/article/details/127664621