[Computer Vision | CNN] Collection of common algorithms introduction to Image Model Blocks (3)

1. Inception-ResNet-v2-C

Inception-ResNet-v2-C is an image model nugget of 8 x 8 grid used in the Inception-ResNet-v2 architecture. It largely follows the ideas of Inception modules and grouped convolutions, but also includes residual connections.

Insert image description here

2. Inception-ResNet-v2-B

Inception-ResNet-v2-B is an image model nugget of 17 x 17 grid used in the Inception-ResNet-v2 architecture. It largely follows the ideas of Inception modules and grouped convolutions, but also includes residual connections.

Insert image description here

3. Inception-ResNet-v2-A

Inception-ResNet-v2-A is an image model nugget of 35 x 35 grid used in the Inception-ResNet-v2 architecture.

Insert image description here

四、Inception-ResNet-v2 Reduction-B

Inception-ResNet-v2 Reduction-B is the image model block used in the Inception-ResNet-v2 architecture.

Insert image description here

五、Convolutional Block Attention Module

Convolutional Block Attention Module (CBAM) is the attention module of convolutional neural networks. Given an intermediate feature map, this module sequentially infers attention maps along two independent dimensions (channel and spatial) and then multiplies the attention maps by the input feature maps for adaptive feature refinement.

Insert image description here
Insert image description here

六、Efficient Channel Attention

Efficient Channel Attention is an architectural unit based on squeeze and excitation blocks that can reduce model complexity without reducing dimensionality. It is proposed as part of the ECA-Net CNN architecture.

After channel-level global average pooling without dimensionality reduction, ECA captures local cross-channel interactions by considering each channel and its k-neighbors. ECA can be implemented quickly and efficiently by implementing 1D size convolution k, where the kernel size k represents the coverage of local cross-channel interactions, i.e. how many neighbors participate in the attention prediction of one channel.

Insert image description here

七、ShuffleNet V2 Downsampling Block

ShuffleNet V2 Downsampling Block is the spatial downsampling block used in the ShuffleNet V2 architecture. Unlike regular ShuffleNet V2 blocks, the channel splitting operator is removed, so the number of output channels is doubled.

Insert image description here

8. FBNet Block

FBNet Block is an image model block used in the FBNet architecture discovered through DNAS Neural Architecture Search. The basic building blocks used are depthwise convolutions and residual connections.

Insert image description here

九、Neural Attention Fields

NEAT, or Neural Attention Domain, is the feature representation of the end-to-end imitation learning model. NEAT is a continuous function that maps positions in bird's-eye view (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into compact representations. This enables the model to selectively focus on relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating images with BEV representations. Furthermore, visualizing the model’s attention map using NEAT intermediate representation can improve interpretability.

Insert image description here

Ten, Inception-A

Inception-A is the image model block used in the Inception-v4 architecture.

Insert image description here

11. Inception-B

Inception-B is the image model block used in the Inception-v4 architecture.

Insert image description here

12. Inception-C

Inception-C is the image model block used in the Inception-v4 architecture.

Insert image description here

Thirteen, Reduction-B

Reduction-B is the image model block used in the Inception-v4 architecture.

Insert image description here

14. Global Context Block

Global context blocks are image model blocks used for global context modeling. The aim is to combine the advantages of a simplified non-local block with efficient modeling of long-range dependencies and a squeeze excitation block with lightweight computation.

Insert image description here
Insert image description here

Fifteenth, One-Shot Aggregation

One-Shot Aggregation is an image model nugget that replaces dense blocks by aggregating intermediate features. It is proposed as part of the VoVNet architecture. Bidirectional connections are used between each convolutional layer. One way is connected to subsequent layers to produce features with larger receptive fields, while the other way is aggregated only once into the final output feature map. The difference from DenseNet is that the output of each layer is not routed to all subsequent intermediate layers, which makes the input size of the intermediate layers constant.

Insert image description here

16. Pyramidal Residual Unit

A pyramid residual unit is a type of residual unit in which the number of channels gradually increases with the depth at which the layer occurs, similar to a pyramid structure whose shape gradually becomes wider from the top downwards. It was introduced as part of the PyramidNet architecture.

Insert image description here

十七、Pyramidal Bottleneck Residual Unit

A pyramid bottleneck residual unit is a residual unit in which the number of channels gradually increases with the depth at which the layer occurs, similar to a pyramid structure whose shape gradually becomes wider from top to bottom. It also contains the bottleneck of using 1x1 convolutions. It was introduced as part of the PyramidNet architecture.

Insert image description here

18. Fractal Block

Insert image description here
Insert image description here

十九、Transformer in Transformer

Transformer is a self-attention-based neural network originally applied to NLP tasks. Recently, purely transformer-based models have been proposed to solve computer vision problems. These visual transformers typically treat images as a series of patches, ignoring the intrinsic structural information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling block-level and pixel-level representations. In each TNT block, the outer transformer block is used to process patch embeddings, and the inner transformer block extracts local features from pixel embeddings. Pixel-level features are projected into the patch embedding space through a linear transformation layer and then added to the patch. By stacking TNT blocks, we build a TNT model for image recognition.

Insert image description here

20. Dilated Bottleneck Block

Dilated Bottleneck Block is the image model block used in the DetNet convolutional neural network architecture. It adopts a bottleneck structure with dilated convolution to effectively expand the receptive field.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132914935