[Computer Vision | Image Model] An introduction collection of common computer vision image models (CNNs & Transformers) (10)

1. GreedyNAS-A

GreedyNAS-A is a convolutional neural network discovered using the GreedyNAS neural architecture search method. The basic building blocks used are the inverse residual block (from MobileNetV2) and the squeeze and excitation block.

Insert image description here

2. ASLFeat

ASLFeat is a convolutional neural network for learning local features that uses deformable convolutional networks to densely estimate and apply local transformations. It also exploits the inherent feature hierarchy to recover spatial resolution and low-level details for accurate keypoint localization. Finally, it uses peak measurements to correlate feature responses and derive a more indicative detection score.

Insert image description here

三、GreedyNAS-B

GreedyNAS-B is a convolutional neural network discovered using the GreedyNAS neural architecture search method. The basic building blocks used are the inverse residual block (from MobileNetV2) and the squeeze and excitation block.

Insert image description here

4. Twins-PCPVT

Twins-PCPVT is a vision transformer that combines global attention (especially global subsampling attention proposed in Pyramid Vision Transformer) with conditional position encoding (CPE) to replace the absolute position encoding used in PVT.

The Positional Encoding Generator (PEG) that generates the CPE is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e. 2D depth convolution without batch normalization. For image-level classification, after CPVT, class labels are removed and global average pooling is used at the end of the stage. For other vision tasks, the design of PVT is followed.

Insert image description here

5.MoGA-A

MoGA-A is a convolutional neural network optimized for mobile latency, discovered through the Mobile GPU-Aware (MoGA) Neural Architecture Search. The basic building block is MBConvs (inverted residual blocks) from MobileNetV2. Experiments were also performed on squeeze and excitation layers.

Insert image description here

6. MoGA-C

MoGA-C is a convolutional neural network optimized for mobile latency and discovered through Mobile GPU-Aware (MoGA) Neural Architecture Search. The basic building block is MBConvs (inverted residual blocks) from MobileNetV2. Experiments were also performed on squeeze and excitation layers.

Insert image description here

7. Visformer

Visformer, or visually friendly Transformer, is an architecture that combines features of Transformer-based architectures with features of convolutional neural network architectures. Visformer features a graded design for higher base performance. But self-attention is only used in the last two stages, considering that even if FLOP is balanced, self-attention in the high-resolution stage is relatively inefficient. Visformer adopts a bottleneck block in the first stage and uses 3 × 3 sets of convolutions in the bottleneck block inspired by ResNeXt. It also introduces BatchNorm to patch the embedding module, just like in CNN.

Insert image description here

八、Multi-Heads of Mixed Attention

Hybrid attention multi-heads combine self-attention and cross-attention, encouraging advanced learning of interactions between entities captured in various attention features. It is built from multiple attention heads, each of which can implement self-attention or cross-attention. Self-attention means that the key features and query features are the same or come from the same domain features. Cross-attention means that key features and query features are generated from different features. MHMA modeling allows the model to identify relationships between features from different domains. This is useful in tasks involving relational modeling, such as human-object interaction, tool-organization interaction, human-computer interaction, human-computer interface, etc.

Insert image description here

9. LocalViT

LocalViT aims to introduce deep convolution to enhance the local feature modeling capabilities of ViT. As shown in Figure (c), this network introduces local mechanisms into the transformer through deep convolutions (denoted by "DW"). To cope with convolution operations, a dialogue between sequence and image feature maps is added via "Seq2Img" and "Img2Seq". The calculation is as follows:

Insert image description here

The input (labeled sequence) is first reshaped into feature maps rearranged on a 2D lattice. Two convolutions and one depthwise convolution are applied to the feature map. The feature map is reshaped into a series of markers, which are used by the self-attention of the network transformer layer.

10. SPP-Net

SPP-Net is a convolutional neural architecture that uses spatial pyramid pooling to remove the fixed size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools features and produces a fixed-length output, which is then fed into a fully connected layer (or other classifier). In other words, we perform some information aggregation deeper in the network hierarchy (between convolutional and fully connected layers) to avoid clipping or distortion at the beginning.

Insert image description here

十一、The Ikshana Hypothesis of Human Scene Understanding Mechanism

Insert image description here

12. DetNASNet

DetNASNet is a convolutional neural network designed to be the backbone of object detection and search discovery through the DetNAS architecture. It uses ShuffleNet V2 blocks as its basic building blocks.

Insert image description here

13. TResNet

TResNet is a variant of ResNet designed to improve accuracy while maintaining GPU training and inference efficiency. They include a variety of design techniques, including SpaceToDepth stems, anti-aliased downsampling, in-place activation of BatchNorm, block selection, and squeeze and excitation layers.

Insert image description here

14. MoGA-B

MoGA-B is a convolutional neural network optimized for mobile latency and discovered through Mobile GPU-Aware (MoGA) Neural Architecture Search. The basic building block is MBConvs (inverted residual blocks) from MobileNetV2. Experiments were also performed on squeeze and excitation layers.

Insert image description here

15. Colorization Transformer

Insert image description here
For coarse low-resolution shading, a conditional variant of the Axial Transformer was applied. The author takes advantage of the semi-parallel sampling mechanism of Axial Transformers. Finally, a fast parallel deterministic upsampling model is employed to super-resolve the coarse color image into a final high-resolution output.

Insert image description here

16. CSPDenseNet-Elastic

CSPDenseNet-Elastic is a convolutional neural network and object detection backbone, and we apply the cross-stage partial network (CSPNet) method to DenseNet-Elastic. CSPNet divides the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. Using a split and merge strategy allows more gradients to flow through the network.

Insert image description here

17. Harm-Net

A harmonic network, or Harm-Net, is a convolutional neural network that replaces convolutional layers with "harmonic blocks" using discrete cosine transform (DCT) filters. These blocks can be used to truncate high frequency information (possibly due to redundancy in the spectral domain).

Insert image description here

18. PReLU-Net

PReLU-Net is a convolutional neural network whose activation function uses parameterized ReLU. It also used a powerful initialization scheme - later called Kaiming initialization - to account for non-linear activation functions.

Insert image description here

19. Twins-SVT

Twins-SVT is a visual transformer that utilizes spatially separable attention mechanism (SSAM), which consists of two types of attention operations: (i) local grouped self-attention (LSA) and (ii) global Subsampling Attention (GSA), where LSA captures fine-grained and short-range information, and GSA handles long-range and global information. In addition, it also utilizes conditional positional encoding and the architectural design of Pyramid Vision Transformer.

Insert image description here

20. EsViT

EsViT proposes two techniques for developing efficient self-supervised visual transformers for visual representation learning: a multi-stage architecture with sparse self-attention and a new region matching pre-training task. Multi-level architectures reduce modeling complexity at the cost of losing the ability to capture fine-grained correspondences between image regions. New pre-training tasks allow models to capture fine-grained regional dependencies, significantly improving the quality of learned visual representations.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132900155