[Computer Vision | CNN] Collection of common algorithms introduction to Image Model Blocks (5)

1. Harmonic Block

The harmonic block is an image model component that utilizes discrete cosine transform (DCT) filters. Convolutional neural networks (CNN) learn filters to capture local correlation patterns in feature space. In contrast, DCT has preset spectral filters that compress information better (due to redundancy in the spectral domain).

DCT has been successfully used in JPEG encoding to convert image blocks into spectral representations, thereby capturing the most information with a small number of coefficients. The harmonic module learns how to optimally combine the spectral coefficients of each layer to produce a fixed-size representation defined as a weighted sum of DCT filter responses. The use of DCT filters can solve the task of model compression.

Insert image description here

二、Spatial Group-wise Enhance

Spatial Group-wise Enhance is a module of convolutional neural networks that adjusts the importance of each sub-feature by generating attention factors for each spatial location in each semantic group, making each individual group autonomous enhance its learned expressions and suppress possible expression noise

Within each feature group, we model the spatial enhancement mechanism within each feature group by scaling the feature vectors at all locations using an attention mask. This attention mask aims to suppress possible noise and highlight the correct semantic feature regions. Different from other popular attention methods, it utilizes the similarity between global statistical features and local statistical features of each location as the generation source of the attention mask.

Insert image description here

3. Residual SRM

Residual SRM is a module of convolutional neural networks that uses a style-based recalibration module in the residual block structure. The style-based recalibration module (SRM) adaptively recalibrates intermediate feature maps by exploiting their styles.

Insert image description here

4. DiCE Unit

DiCE units are image model blocks built using dimension-wise convolution and dimension-wise fusion. Dimensional convolution applies lightweight convolutional filtering on each dimension of the input tensor, while dimensional fusion effectively combines these dimensional representations; allowing DiCE units to efficiently encode the spatial and channel information contained in the input tensor.

Standard convolutions encode both spatial and channel information, but they are computationally expensive. To improve the efficiency of standard convolutions, separable convolutions are introduced, where spatial and channel information are encoded separately using depth convolutions and point convolutions, respectively. Although this decomposition is efficient, it imposes a huge computational burden on point-wise convolutions and makes them computational bottlenecks.

DiCE units utilize dimensional convolutions to independently encode depth, width, and height information. Dimensional convolutions encode local information from different dimensions of the input tensor but do not capture global information. One method is pointwise convolution, but it is computationally expensive, so dimensional fusion decomposes pointwise convolution into two steps: (1) local fusion and (2) global fusion.

Insert image description here

五、Dimension-wise Fusion

Insert image description here
Insert image description here

6. Strided EESP

Strided EESP units are based on EESP units but modified to learn representations more efficiently at multiple scales. Depthwise dilated convolutions are given strides, average pooling operations are added instead of identity connections, and element-level addition operations are replaced by join operations, which helps to effectively expand the dimensionality of feature maps.

Insert image description here

七、Compact Global Descriptor

A compact global descriptor is an image model nugget that models the interaction between locations in different dimensions (e.g. channels, frames). This descriptor enables subsequent convolutions to access information-rich global features. This is a form of attention.

Insert image description here

八、OSA (identity mapping + eSE)

One-shot aggregation with identity mapping and eSE is an image model block that extends one-shot aggregation with residual connections and efficient squeeze and excitation blocks. It is proposed as part of the VoVNetV2 CNN architecture.

This module adds an identity map to the OSA module - the input path is connected to the end of the OSA module, which is able to backpropagate the gradients of each OSA module in an end-to-end manner at each stage like ResNet. In addition, a channel attention module - efficient squeeze excitation - is used, which is similar to regular squeeze and excitation, but uses only one FC layer C channel instead of two FC, without channel dimensionality reduction, thus preserving channel information.

Insert image description here

九、Effective Squeeze-and-Excitation Block

The effective squeeze and excitation module is an image model module based on squeeze and excitation, except that one less FC layer is used. The authors point out that the SE module has a limitation: channel information is lost due to dimensionality reduction. To avoid the burden of high model complexity, the two FC layers of the SE module need to reduce the channel dimension.

Insert image description here
Insert image description here

十、Elastic ResNeXt Block

Elastic ResNeXt Block is a modified version of ResNeXt Block that adds downsampling and upsampling in parallel branches at each layer. It is called "elastic" because each layer in the network has the flexibility to choose the optimal size through soft policies.

Insert image description here

11. Depthwise Fire Module

Depthwise Fire Module is a modification of the Fire Module with depthwise separable convolutions to improve inference time performance. It is used for object detection in CornerNet-Lite architecture.

Insert image description here

Twelve, CornerNet-Squeeze Hourglass Module

CornerNet-Squeeze Hourglass Module is the image model block used in CornerNet-Lite, which is based on the hourglass module but uses a modified fire module instead of a residual block. In addition to replacing the residual blocks, further modifications include: reducing the maximum feature map resolution of the hourglass module by adding a downsampling layer before the hourglass module, removing one downsampling layer in each hourglass module, replacing it with 1x 3 × 3 filter CornerNet uses 1 filter in the prediction module, and finally replaces the nearest neighbor upsampling in the hourglass network with a transposed convolution of 4 × 4 kernels.

Insert image description here

Thirteen, Contextual Residual Aggregation

Contextual Residual Aggregation (CRA) is an image inpainting module. It can generate high-frequency residuals of missing content by weighted aggregation of residuals of context patches, thus requiring only low-resolution predictions from the network. Specifically, it involves a neural network to predict a low-resolution inpainting result and upsample it to produce a large blurred image. We then generate the high-frequency residuals of the in-hole patches by aggregating the weighted high-frequency residuals of the context patches. Finally, we add the aggregated residuals to the large blurred image to obtain sharp results.

Insert image description here

14. Content-Conditioned Style Encoder

Content Conditional Style Encoder (COCO) is a style encoder for image-to-image conversion in the COCO-FUNIT architecture. Unlike the style encoder in FUNIT, COCO takes content and style images as input. With this content conditioning scheme, we create a direct feedback path during the learning process, allowing content images to influence how style codes are calculated. It also helps reduce the direct impact of style images on extracting style code.

Insert image description here
Insert image description here

15. Hierarchical-Split Block

Hierarchical segmentation blocks are representation blocks for multi-scale feature representation. It contains many hierarchical splits and concatenated connections within a residual block.

Specifically, ordinary feature maps in deep neural networks are divided into s groups, and each group has w channels. As shown, only the first set of filters can be directly connected to the next layer. The second set of feature maps is sent to the convolution 3 × 3 3 \times 33×3 First use filters to extract features, and then divide the output feature map into two subgroups in the channel dimension. The feature maps of one subgroup are directly connected to the next layer, while the other subgroup is connected to the next set of input feature maps in the channel dimension. The connected feature map consists of a set of operations3 × 3 3 \times 33×3 convolution filters. This process is repeated several times until the remaining input feature maps are processed. Finally, the feature maps of all input groups are concatenated and sent to another layer1 × 1 1 \times 11×1 filter to reconstruct features.

Insert image description here

16. Attentional Liquid Warping Block

Attentional Liquid Warping Block (AttLWB) is a module for human image synthesis GAN that propagates source information (such as texture, style, color and facial identity) in image and feature space to the synthesis reference. It first learns the global feature similarities among all multi-source features, and then fuses the multi-source features through the linear combination of the learned similarities with the multi-sources in the feature space. Finally, to better propagate the source identity (style, color, and texture) into the global stream, the fused source features are warped to the global stream via spatial adaptive normalization (SPADE).

Insert image description here

17. Patch Merger Module

PatchMerger is a module of Vision Transformers that reduces the number of tokens/patches passed to each individual Transformer encoder block while maintaining performance and reducing computational effort. PatchMerger linearly transforms an input of shape N patch × D dimensions through an output patch × D learnable weight matrix of shape M. This generates M scores, with the Softmax function applied to each score. The resulting output has shape M × N and is multiplied with the original input to get an output of shape M × D.

Mathematically speaking:

Insert image description here
Insert image description here

18. Global Local Attention Module

The Global Local Attention Module (GLAM) is an image model block that locally focuses on the channel and spatial dimensions of the feature map, and also globally focuses on the channel and spatial dimensions of the feature map. The local participation feature map, the global participation feature map and the original feature map are then fused through a weighted sum (with learnable weights) to obtain the final feature map.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132915391