Article directory
Image models are methods for building image representations for downstream tasks such as classification and object detection. The most popular subcategory is convolutional neural networks. Below you can find a constantly updated list of image models.
1. Residual Network
The residual network (ResNet) learns the residual function by referring to the layer input, rather than learning an unreferenced function. Residual networks do not expect every few stacked layers to fit directly into the required base mapping, but instead have these layers fit into the residual mapping. They stack remaining blocks together to form a network: ResNet-50 for example has 50 layers that use these blocks.
There is empirical evidence that these types of networks are easier to optimize and can achieve accuracy with significantly increased depth.
二、Vision Transformer
Vision Transformer (ViT) is an image classification model that adopts a Transformer-like architecture on image patches. Split the image into fixed-size blocks, then linearly embed each block, add positional embeddings, and feed the resulting vector sequence to a standard Transformer encoder. To perform classification, the standard method of adding additional learnable "classification markers" to the sequence is used.
3. VGG
VGG is a classic convolutional neural network architecture. It is based on an analysis of how to increase the depth of such networks. The network uses small 3 x 3 filters. Apart from this, the network is characterized by simplicity: the only other components are the pooling layer and the fully connected layer.
4. DenseNet
DenseNet is a convolutional neural network that exploits dense connections between layers via dense blocks, where we connect all layers (with matching feature map sizes) directly to each other. To maintain the feed-forward nature, each layer takes additional input from all previous layers and passes its own feature map to all subsequent layers.
5. VGG-16
六、MobileNetV2
MobileNetV2 is a convolutional neural network architecture designed to perform well on mobile devices. It is based on an inverse residual structure, where residual connections are located between bottleneck layers. The intermediate expansion layer uses lightweight deep convolutions to filter features that are sources of nonlinearity. Overall, the architecture of MobileNetV2 consists of an initial fully convolutional layer with 32 filters, followed by 19 residual bottleneck layers.
7. AlexNet
AlexNet is a classic convolutional neural network architecture. It consists of convolution, max pooling, and dense layers as basic building blocks. Use grouped convolutions to fit the model on two GPUs.
8. EfficientNet
The intuition of the compound scaling method is that if the input image is larger, then the network needs more layers to increase the receptive field and more channels to capture finer-grained patterns on the larger image.
In addition to the squeeze and excitation blocks, the base EfficientNet-B0 network is also based on the inverse bottleneck residual block of MobileNetV2.
EfficientNet also transfers well and achieves state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.
9. Darknet-53
Darknet-53 is a convolutional neural network that serves as the backbone network for the YOLOv3 object detection method. Improvements over its predecessor Darknet-19 include the use of residual connections and more layers.
十、Swin Transformer
Swin Transformer is a type of visual transformer. It builds hierarchical feature maps by merging image patches in deeper layers (shown in gray) and has computational complexity linear in the input image size since self-attention is only computed within each local window (shown in red) . Therefore, it can serve as a general backbone for image classification and dense recognition tasks. In contrast, previous visual Transformers generate a single low-resolution feature map, and the computational complexity is quadratic in input image size due to the calculation of global self-attention.
11. Xception
Xception is a convolutional neural network architecture that relies solely on depthwise separable convolutional layers.
12. GoogLeNet
GoogLeNet is a convolutional neural network based on the Inception architecture. It leverages the Inception module, allowing the network to choose between multiple convolutional filter sizes in each block. The Inception network stacks these modules together, occasionally using max-pooling layers with a stride of 2 to halve the resolution of the grid.
13. ResNeXt
ResNeXt repeats a building block that aggregates a set of transformations with the same topology. Compared with ResNet, it exposes a new dimension: the cardinality (size of the transformation set) C, as an important factor in addition to depth and width dimensions.
14. Detection Transformer
Detr, or Detection Transformer, is an ensemble-based object detector that uses transformers on top of a convolutional backbone. It uses a traditional CNN backbone to learn a 2D representation of the input image. The model flattens it and supplements it with position encoding, which it then passes to a transformer encoder. The transformer decoder then takes as input a small fixed number of learned position embeddings (which we call object queries) and additionally focuses on the encoder output. We pass each output embedding of the decoder to a shared feedforward network (FFN), which predicts detections (class and bounding box) or “no object” class.
15. CSPDarknet53
CSPDarknet53 is a convolutional neural network and backbone network using DarkNet-53 for object detection. It adopts the CSPNet strategy to divide the feature map of the base layer into two parts and then merge them through a cross-stage hierarchy. Using a split and merge strategy allows more gradients to flow through the network.
This CNN is used as the backbone of YOLOv4.