[Computer Vision | Image Model] An introduction collection of common computer vision image models (CNNs & Transformers) (9)

1. GreedyNAS-C

GreedyNAS-C is a convolutional neural network discovered using the GreedyNAS neural architecture search method. The basic building blocks used are the inverse residual block (from MobileNetV2) and the squeeze and excitation block.

Insert image description here

2. RegionViT

RegionViT consists of two tokenization processes that convert images into region tokens (upper path) and local tokens (lower path). Each tokenization is a convolution with a different patch size, the patch size for region labeling is 2 8 2 28^2282although 42 4^242 local tag CCfor size projection asC , which means a zone token covers7 2 7^272Local labeling based on spatial locality, resulting in a window size of7 2 7^272 . In phase 1, two sets of tokens are passed through the proposed region-to-native converter encoder. However, in the later stage, in order to balance the computational load and obtain feature maps of different resolutions, the method uses a downsampling process to halve the spatial resolution while doubling the channel size on regional and local markers before entering the next stage. . Finally, at the end of the network, it simply averages the remaining region labels as the final embedding for classification, while detection uses all local labels at each stage as it provides more fine-grained location information. Through the pyramid structure, ViT can generate multi-scale features and thus can be easily extended to more vision applications, such as object detection, not just image classification.

Insert image description here

3. DenseNAS-B

DenseNAS-B is a mobile convolutional neural network discovered through the DenseNAS neural architecture search method. The basic building block is MBConvs (or reverse bottleneck residuals) in the MobileNet architecture.

Insert image description here

4. DenseNAS-C

DenseNAS-C is a mobile convolutional neural network discovered through the DenseNAS neural architecture search method. The basic building block is MBConvs (or reverse bottleneck residuals) in the MobileNet architecture.

Insert image description here

5. DiCENet

DiCENet is a convolutional neural network architecture that utilizes dimensional convolution (and dimensional fusion). Dimensional convolution applies lightweight convolutional filtering on each dimension of the input tensor, while dimensional fusion effectively combines these dimensional representations; allowing DiCE units in the network to efficiently encode the spatial and channel information contained in the input tensor.

Insert image description here

6. uNetXST

uNet neural network architecture, taking multiple (X) tensors as input and including spatial transformation units (ST)

Insert image description here

7. CSPPeleeNet

CSPPeleeNet is a convolutional neural network and object detection backbone, and we apply the cross-stage partial network (CSPNet) method to PeleeNet. CSPNet divides the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. Using a split and merge strategy allows more gradients to flow through the network.

Insert image description here

8. PocketNet

PocketNet is a family of face recognition models discovered through neural architecture search. Training is based on multi-step knowledge distillation.

Insert image description here

9. OODformer

OODformer is a Transformer-based OOD detection architecture that leverages the contextualization capabilities of Transformer. Using the transformer as the main feature extractor enables the exploitation of object concepts and their discriminative properties as well as their co-occurrence through visual attention.

OODformer uses ViT and its data-efficient variant DeiT. Each encoder layer consists of multi-head self-attention and multi-layer perceptual blocks. The combination of MSA and MLP layers in the encoder jointly encodes attribute importance, association relevance and co-occurrence. [class] tags (representations of images) integrate multiple attributes and their associated characteristics through a global context. The [class] tag of the last layer is used for OOD detection in two ways; firstly, it is passed for softmax confidence score and secondly for latent space distance calculation.

Insert image description here

10. DeepSIM

DeepSIM is a generative model for conditional image processing based on a single image. The network learns to map the original representation of the image to the image itself. In operation, the generator allows complex image changes to be made by modifying the original input representation and mapping it through the network. The choice of raw representation affects the ease and expressiveness of the operation, and can be automatic (e.g. edges), manual, or hybrid (e.g. edges splitting the top).

Insert image description here

十一、Conditional Position Encoding Vision Transformer(CPVT)

CPVT (Conditional Positional Coding Visual Transformer) is a visual transformer that utilizes conditional positional coding. Apart from the new encoding, it follows the same architecture as ViT and DeiT.

Insert image description here

12. ESPNetv2

ESPNetv2 is a convolutional neural network that utilizes sets of point-wise and depthwise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters.

Insert image description here

Thirteen, Shuffle Transformer

The Shuffle Transformer module consists of the Shuffle Multi-Head Self-Attention module (ShuffleMHSA), Neighbor-Window Connection module (NWC) and MLP module. In order to introduce cross-window connections while maintaining efficient computation of non-overlapping windows, a strategy of using WMSA and Shuffle-WMSA alternately in consecutive Shuffle Transformer blocks is proposed. The first window-based Transformer block uses a regular window partitioning strategy, and the second window-based Transformer block uses window-based self-attention and spatial shuffling. In addition, a neighbor window connection module (NWC) is added in each block to enhance the connection between neighbor windows. Therefore, the proposed shuffling transformer block can build rich cross-window connections and enhance representation. Finally, successive Shuffle Transformer blocks are calculated as follows:

Insert image description here
Insert image description here

14. ECA-Net

ECA-Net is a convolutional neural network that utilizes efficient channel attention modules.

Insert image description here

15. CSPDenseNet

CSPDenseNet is a convolutional neural network and object detection backbone, and we apply the cross-stage partial network (CSPNet) method to DenseNet. CSPNet divides the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. Using a split and merge strategy allows more gradients to flow through the network.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132899849