[Computer Vision | Image Model] An introduction collection of common computer vision image models (CNNs & Transformers) (1)


Image models are methods for building image representations for downstream tasks such as classification and object detection. The most popular subcategory is convolutional neural networks. Below you can find a constantly updated list of image models.

1. Residual Network

The residual network (ResNet) learns the residual function by referring to the layer input, rather than learning an unreferenced function. Residual networks do not expect every few stacked layers to fit directly into the required base mapping, but instead have these layers fit into the residual mapping. They stack remaining blocks together to form a network: ResNet-50 for example has 50 layers that use these blocks.

Insert image description here
There is empirical evidence that these types of networks are easier to optimize and can achieve accuracy with significantly increased depth.

Insert image description here

二、Vision Transformer

Vision Transformer (ViT) is an image classification model that adopts a Transformer-like architecture on image patches. Split the image into fixed-size blocks, then linearly embed each block, add positional embeddings, and feed the resulting vector sequence to a standard Transformer encoder. To perform classification, the standard method of adding additional learnable "classification markers" to the sequence is used.

Insert image description here

3. VGG

VGG is a classic convolutional neural network architecture. It is based on an analysis of how to increase the depth of such networks. The network uses small 3 x 3 filters. Apart from this, the network is characterized by simplicity: the only other components are the pooling layer and the fully connected layer.

Insert image description here

4. DenseNet

DenseNet is a convolutional neural network that exploits dense connections between layers via dense blocks, where we connect all layers (with matching feature map sizes) directly to each other. To maintain the feed-forward nature, each layer takes additional input from all previous layers and passes its own feature map to all subsequent layers.

Insert image description here

5. VGG-16

六、MobileNetV2

MobileNetV2 is a convolutional neural network architecture designed to perform well on mobile devices. It is based on an inverse residual structure, where residual connections are located between bottleneck layers. The intermediate expansion layer uses lightweight deep convolutions to filter features that are sources of nonlinearity. Overall, the architecture of MobileNetV2 consists of an initial fully convolutional layer with 32 filters, followed by 19 residual bottleneck layers.

Insert image description here

7. AlexNet

AlexNet is a classic convolutional neural network architecture. It consists of convolution, max pooling, and dense layers as basic building blocks. Use grouped convolutions to fit the model on two GPUs.

Insert image description here

8. EfficientNet

Insert image description here
The intuition of the compound scaling method is that if the input image is larger, then the network needs more layers to increase the receptive field and more channels to capture finer-grained patterns on the larger image.

In addition to the squeeze and excitation blocks, the base EfficientNet-B0 network is also based on the inverse bottleneck residual block of MobileNetV2.

EfficientNet also transfers well and achieves state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.

Insert image description here

9. Darknet-53

Darknet-53 is a convolutional neural network that serves as the backbone network for the YOLOv3 object detection method. Improvements over its predecessor Darknet-19 include the use of residual connections and more layers.

Insert image description here

十、Swin Transformer

Swin Transformer is a type of visual transformer. It builds hierarchical feature maps by merging image patches in deeper layers (shown in gray) and has computational complexity linear in the input image size since self-attention is only computed within each local window (shown in red) . Therefore, it can serve as a general backbone for image classification and dense recognition tasks. In contrast, previous visual Transformers generate a single low-resolution feature map, and the computational complexity is quadratic in input image size due to the calculation of global self-attention.

Insert image description here

11. Xception

Xception is a convolutional neural network architecture that relies solely on depthwise separable convolutional layers.

12. GoogLeNet

GoogLeNet is a convolutional neural network based on the Inception architecture. It leverages the Inception module, allowing the network to choose between multiple convolutional filter sizes in each block. The Inception network stacks these modules together, occasionally using max-pooling layers with a stride of 2 to halve the resolution of the grid.

Insert image description here

13. ResNeXt

ResNeXt repeats a building block that aggregates a set of transformations with the same topology. Compared with ResNet, it exposes a new dimension: the cardinality (size of the transformation set) C, as an important factor in addition to depth and width dimensions.

Insert image description here
Insert image description here

14. Detection Transformer

Detr, or Detection Transformer, is an ensemble-based object detector that uses transformers on top of a convolutional backbone. It uses a traditional CNN backbone to learn a 2D representation of the input image. The model flattens it and supplements it with position encoding, which it then passes to a transformer encoder. The transformer decoder then takes as input a small fixed number of learned position embeddings (which we call object queries) and additionally focuses on the encoder output. We pass each output embedding of the decoder to a shared feedforward network (FFN), which predicts detections (class and bounding box) or “no object” class.

Insert image description here

15. CSPDarknet53

CSPDarknet53 is a convolutional neural network and backbone network using DarkNet-53 for object detection. It adopts the CSPNet strategy to divide the feature map of the base layer into two parts and then merge them through a cross-stage hierarchy. Using a split and merge strategy allows more gradients to flow through the network.

This CNN is used as the backbone of YOLOv4.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132869846