[Computer Vision | Image Model] An introduction collection of common computer vision image models (CNNs & Transformers) (3)

1. MnasNet

MnasNet is a convolutional neural network optimized for mobile devices that was discovered through the Mobile Neural Architecture Search, which explicitly incorporates model latency as a primary goal so that the search can identify models that achieve a good trade-off between accuracy and latency. The main building block is the inverted residual block (from MobileNetV2).

Insert image description here

2. GhostNet

GhostNet is a convolutional neural network built using the Ghost module that aims to generate more features (thus improving efficiency) by using fewer parameters.

GhostNet mainly consists of a bunch of Ghost bottlenecks, with Ghost modules as building blocks. The first layer is a standard convolutional layer with 16 filters, followed by a series of Ghost bottlenecks with increasing channels. These Ghost bottlenecks are divided into different stages based on the size of their input feature maps. All Ghost bottlenecks have stride=1, but the last bottleneck in each stage has stride=2. Finally, global average pooling and convolutional layers are used to convert the feature map into a 1280-dimensional feature vector for final classification. The Squeeze and Excitation (SE) module is also applied to residual layers in some ghost bottlenecks.

Compared with MobileNetV3, GhostNet does not use the Hard-swish nonlinear function due to the large delay.

Insert image description here

三、Compact Convolutional Transformers(CCT)

Compact convolutional transformers utilize sequence pooling and replace patch embeddings with convolutional embeddings, enabling better inductive bias and making positional embeddings optional. CCT achieves better accuracy than ViT-Lite (smaller ViT) and increases the flexibility of input parameters.

Insert image description here

4. NesT

NesT stacks canonical Transformer layers, performs local self-attention on each image patch independently, and then "nests" them hierarchically. The coupling of processing information between spatially adjacent blocks is achieved through the proposed block aggregation between each two hierarchies. The overall hierarchy can be determined by two key hyperparameters: patch size and the number of block hierarchies. All blocks within each hierarchy share a set of parameters. Given an image input, each image is linearly projected to an embedding. All embeddings are divided into blocks and flattened to generate the final input. Each Transformer layer consists of a multi-head self-attention (MSA) layer, followed by a feedforward fully connected network (FFN) with skip connections and layer normalization. Positional embeddings are added to encode spatial information before feeding into blocks. Finally, a nested hierarchy with block aggregation was constructed - every four spatially connected blocks were merged into one.

Insert image description here

5. Res2Net

Res2Net is an image model that employs a variant of the bottleneck residual block. The motivation is to be able to represent features at multiple scales. This is achieved through a novel CNN building block that builds hierarchical residual-like connections within a single residual block. This represents multi-scale features at the granular level and increases the receptive field range of each network layer.

Insert image description here

六、EfficientNetV2

EfficientNetV2 is a class of convolutional neural networks that offers faster training and better parameter efficiency than previous models. To develop these models, the authors combined training-aware neural architecture search and scaling to jointly optimize training speed. These models are searched from a search space rich in new operations such as Fused-MBConv.

Architecturally, the main differences are:

EfficientNetV2 makes extensive use of MBConv and the newly added fused-MBConv in early layers.
EfficientNetV2 prefers smaller expansion ratios of MBConv because smaller expansion ratios tend to have less memory access overhead.
EfficientNetV2 prefers the smaller 3x3 kernel size, but it adds more layers to compensate for the reduced receptive field caused by the smaller kernel size.
EfficientNetV2 completely removes the last stride-1 stage in the original EfficientNet, probably due to its larger parameter size and memory access overhead.

Insert image description here

七、Capsule Network

A capsule network is a machine learning system, a type of artificial neural network that can be used to better model hierarchical relationships. This approach attempts to more closely mimic biological neural tissue.

Insert image description here

八、Pyramid Vision Transformer

PVT (Pyramid Vision Transformer) is a visual transformer that utilizes a pyramid structure making it an effective backbone for dense prediction tasks. Specifically, it allows the use of finer-grained inputs (4 x 4 pixels per patch) while reducing the sequence length of the Transformer as it deepens, thereby reducing computational cost. In addition, a spatial reduced attention (SRA) layer is used to further reduce resource consumption when learning high-resolution features.

The entire model is divided into four stages, each stage consists of a patch embedding layer and a Transformer encoder layer. Following the pyramid structure, the output resolution of the four stages gradually shrinks from high (4 steps) to low (32 steps).

Insert image description here

九、Dual Path Network(DPN)

Dual Path Network (DPN) is a convolutional neural network that presents a new connection path topology internally. The intuition is that ResNets support feature reuse, while DenseNet supports new feature exploration, and both are important for learning good representations. To enjoy the benefits of both path topologies, dual-path networks share common functionality while maintaining the flexibility to explore new capabilities through the dual-path architecture.

We formulate such a dual-path architecture as follows:

Insert image description here
Insert image description here

十、Dense Prediction Transformer(DPT)

The Dense Prediction Transformer (DPT) is a visual transformer for dense prediction tasks.

The input image is converted into markers (orange) by extracting non-overlapping patches and then linearly projecting their flattened representations (DPT-Base and DPT-Large), or by applying the ResNet-50 feature extractor (DPT-Hybrid) . Image embeddings are enhanced with positional embeddings and patch-independent readout markers (red) are added. The token passes through multiple transformer stages. Tokens are reassembled from different stages into image-like representations (green) at multiple resolutions. The fusion module (purple) progressively fuses and upsamples representations to generate fine-grained predictions.

Insert image description here

11. Inception v2

Inception v2 is the second generation of Inception convolutional neural network architecture, specifically using batch normalization. Other changes include removing dropout and removing local response normalization due to the benefits of batch normalization.

Insert image description here

12. Inception-ResNet-v2

Inception-ResNet-v2 is a convolutional neural architecture that builds on the Inception family of architectures but incorporates residual connections (replacing the filter cascade stage of the Inception architecture).

Insert image description here

13. RegNetY

Insert image description here
For RegNetY, one change we made was to include squeeze and excitation modules.

Insert image description here

14. CheXNet

CheXNet is a 121-layer DenseNet trained on ChestX-ray14 for pneumonia detection.

Insert image description here

15. R(2+1)D

R(2+1)D convolutional neural network is a network for action recognition that employs R(2+1)D convolution in a ResNet-inspired architecture. Compared with regular 3D convolutions, using these convolutions reduces computational complexity, prevents overfitting, and introduces more nonlinearities, allowing better functional relationships to be modeled.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132875186