[Computer Vision | Image Model] An introduction collection of common computer vision image models (CNNs & Transformers) (7)

1. CSPResNeXt

CSPResNeXt is a convolutional neural network, and we apply the cross-stage partial network (CSPNet) method to ResNeXt. CSPNet divides the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. Using a split and merge strategy allows more gradients to flow through the network.

Insert image description here

二、ProxylessNet-Mobile

ProxylessNet-Mobile is a convolutional neural architecture learned through the ProxylessNAS neural architecture search algorithm, which is optimized for mobile devices. It uses reverse residual blocks (MBConvs) in MobileNetV2 as its basic building block.

Insert image description here

三、ProxylessNet-CPU

ProxylessNet-CPU is an image model learned through the ProxylessNAS neural architecture search algorithm, which is optimized for CPU devices. It uses reverse residual blocks (MBConvs) in MobileNetV2 as its basic building block.

Insert image description here

4. RandWire

RandWire is a convolutional neural network generated from a randomly wired neural network sampled from a random network generator, where the generation is defined by a human-designed random process.

Insert image description here

5. MCKERNEL

McKernel introduces a framework that can use kernel approximation in a mini-batch setting and uses stochastic gradient descent (SGD) as an alternative to deep learning.

The core library was developed in 2014 as part of a Master of Science thesis at Carnegie Mellon University and City University of Hong Kong [1,2]. The original intention was to achieve random kitchen sink (Rahimi and Recht 2007) speedup by writing a very efficient HADAMARD transformation, which was the main bottleneck of the build. However, the code was later extended at ETH Zurich (Curtó et al. in McKernel 2017), proposing a framework that can interpret kernel methods and neural networks. This manuscript and corresponding paper constituted one of the first usages (if not the first) of Fourier features and deep learning literature; it subsequently attracted widespread research attention and interest from the community.

More information can be found in the first author's talk at ICLR 2020 iclr2020_DeCurto.

Insert image description here

6. Assemble-ResNet

Assemble-ResNet is a modification of the ResNet architecture with several adjustments, including the use of ResNet-D, channel attention, anti-aliasing downsampling, and large and small networks.

Insert image description here

七、Convolution-enhanced image Transformer(CeiT)

Convolutional Enhanced Image Transformer (CeiT) combines the advantages of CNN in extracting low-level features and enhancing locality with the advantages of Transformer in establishing long-range dependencies. Three modifications were made to the original Transformer: 1) We designed an Image-to-Token (I2T) module that extracts patches from generated low-level features instead of tokenizing directly from the original input image; 2) Each The feedforward network in the encoder block is replaced by a locally enhanced feedforward (LeFF) layer, which promotes the correlation between adjacent tokens in the spatial dimension; 3) Hierarchical Token Attention (LCA) is attached in Utilize the top of a Transformer represented by multiple layers.

Insert image description here

8. IICNet

Invertible Image Conversion Network (IICNet) is a general framework for reversible image conversion tasks. Unlike previous encoder-decoder based methods, IICNet maintains a highly reversible structure based on Invertible Neural Networks (INN) to better retain information during the conversion process. It uses relation modules and channel squeeze layers to improve INN nonlinearity to extract cross-image relations and network flexibility respectively.

Insert image description here

9. RegNetX

Insert image description here
Insert image description here

10. SCARLET

SCARLET is a convolutional neural architecture learned through the SCARLET-NAS neural architecture search method. The three variants are SCARLET-A, SCARLET-B, and SCARLET-C. The basic building block is MBConvs from MobileNetV2. Experiments were also performed on squeeze and excitation layers.

Insert image description here

11. ProxylessNet-GPU

ProxylessNet-GPU is a convolutional neural network architecture learned through the ProxylessNAS neural architecture search algorithm, which is optimized for GPU devices. It uses reverse residual blocks (MBConvs) in MobileNetV2 as its basic building block.

Insert image description here

12. VATT

Video-Audio-Text Transformer (VATT) is a framework for learning multi-modal representations from unlabeled data using a convolution-free Transformer architecture. Specifically, it takes a raw signal as input and extracts a rich enough multi-dimensional representation to benefit a variety of downstream tasks. VATT borrows the exact architecture of BERT and ViT, except that tokenization layers and linear projections are reserved for each modality separately. The design follows the same spirit as ViT, making minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks.

VATT linearly projects each modality into a feature vector and feeds it into the Transformer encoder. A semantic hierarchical common space is defined to consider the granularity of different modalities, and noise contrastive estimation is employed to train the model.

Insert image description here

13. GPU-Efficient Network

GENet (or GPU Efficient Network) is a family of efficient models found through neural architecture search. The search occurs on several types of convolutional blocks, including depthwise convolution, batch normalization, ReLU, and inverse bottleneck structures.

Insert image description here

14. DELG

DELG is a convolutional neural network for image retrieval that combines generalized mean pooling of global features with careful selection of local features. By carefully balancing the gradient flow between the two heads, the entire network can be learned end-to-end—requiring only image-level labels. This enables efficient inference by extracting the global features of the image, detected keypoints, and local descriptors in a single model.

The model is implemented by exploiting the hierarchical image representation emerging from CNNs, which is combined with generalized mean pooling and local feature detection. Second, using a convolutional autoencoder module, low-dimensional local descriptors can be successfully learned. This can be easily integrated into a unified model and avoids the need for commonly used post-processing learning steps such as PCA. Finally, a program is used to train the proposed model end-to-end using only image-level supervision. This requires careful control of the gradient flow between global and local network heads during backpropagation to avoid destroying the desired representation.

Insert image description here

15. HaloNet

HaloNet is a self-attention based model for efficient image classification. It relies on a local self-attention architecture that can be efficiently mapped to existing hardware with halo. This formulation breaks translational equivariance, but the authors observe that it improves throughput and accuracy compared to the centralized local self-attention used in conventional self-attention. The method also utilizes the strided self-attention downsampling operation for multi-scale feature extraction.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132896275