Paper translation: search center difference convolutional network and implementation code for face detection


Original article and model source code: https://github.com/ZitongYu/CDCN

Summary

Face Liveness Detection (FAS) plays a vital role in face recognition systems. Most of the most advanced FAS technology. 1) Rely on stacked convolutional layers and professionally designed networks, which are weak in describing detailed fine-grained information and prone to failure when the environment changes (e.g., under different lighting conditions). 2) They tend to use long sequences as output to extract dynamic features, which makes them difficult to deploy in scenarios that require fast response. In this paper, we propose a frame-level FAS method based on center-difference convolution (CDC), which can capture intrinsically detailed models by aggregating intensity and gradient information. Networks built using CDC are called central difference convolutional networks (CDCNs), which are able to provide more powerful modeling capabilities than their counterparts built using ordinary convolutions. In addition, on the specifically designed CDC search space, a more powerful network structure (CDCN++) is discovered by using Neural Architecture Search (NAS), which can be combined with a multi-scale attention fusion module (MAFM) to further improve the performance. Comprehensive experiments are conducted on six benchmark datasets, and the results show that: 1) The proposed method not only achieves superior performance in internal dataset tests (especially 0.2 in Protocol-1 of OULU-NPU dataset). % ACER); 2) also has good generalization performance in cross-dataset tests (especially 6.5% HTER from CASIA-MFSD to Replay-Attack dataset). The code is available at https://github.com/ZitongYu/CDCN .

1 Introduction

Face recognition has been widely used in many interactive artificial intelligence systems because of its convenience. However, vulnerability to presentation attack (PA) limits its reliable deployment. Simply presenting a printed photo or video to a biometric sensor can fool a facial recognition system. Printed photos, video replays and 3D masks while showing typical examples of the attack. In order to ensure the reliability of the face recognition system, the Face Liveness Detection (FAS) method is an important method to detect such display attacks.
insert image description here

Notes:
VanillaConv is a basic Convolutional Neural Network that consists of only convolutional and pooling layers without any regularization or other tricks.

Figure 1: Feature responses of vanilla convolution (VanillaConv) and center difference convolution (CDC) for spoofing faces in different offset domains (illumination and input camera). Ordinary convolutions fail to capture consistent spoofing patterns, while CDC is able to extract invariant detailed spoofing features, such as lattice artifacts.

In recent years, some methods based on [7, 8, 15, 29, 45, 44] and deep learning based [49, 64, 36, 26, 62, 4, 19, 20] have been proposed for representation attack detection (PAD ). On the one hand, classic handcrafted descriptors (e.g., Local Binary Patterns (LBP) [7] ) utilize local relations among neighbors as discriminative features, which are useful for describing detailed invariant information between living faces and spoofed faces (e.g. , color textures, moiré patterns, and noise artifacts) are robust. On the other hand, due to the stacked convolution operation with nonlinear activation, convolutional neural network (CNN) has strong representation ability to distinguish real from PA. "However, CNN-based methods focus on deeper semantic features, which are weak in describing detailed fine-grained information between live and spoofed faces, and are easily invalidated when the environment changes (e.g., different lighting). How to combine local descriptors with convolution operations to achieve robust feature representation is a worthy research problem.

Recent deep learning based FAS methods are usually built on backbones based on image classification tasks [61, 62, 20], such as VGG [54], ResNet [22] and DenseNet [23]. Networks are usually supervised by a binary cross-entropy loss, which easily learns unimportant information, such as screen borders, instead of deceiving the nature of the pattern. To address this issue, we have developed several deeply supervised FAS methods [4, 36] utilizing pseudo depth map labels as auxiliary supervision signals. Therefore, automatic discovery of the most suitable network for the FAS task by auxiliary deep supervision should be considered.

“Most existing state-of-the-art FAS methods [36, 56, 62, 32] require multiple frames as input to extract dynamic spatiotemporal features (e.g., motion [36, 56] and rPPG [62, 32]) for PAD. However, long video sequences may not be suitable for specific deployment conditions where decisions need to be made quickly. Therefore, despite inferior performance compared to video-level methods, frame-level based PAD methods have advantages from a usability perspective .Designing high-performance frame-level methods is crucial for practical FAS applications."

Based on the above discussion, we propose a new convolution operator called center-difference convolution (CDC), which is good at describing fine-grained invariant information. As shown in Figure 1, CDC is more likely than vanilla convolutions to extract intrinsic deceiving patterns (e.g., lattice artifacts) in different environments. Furthermore, on a specially designed CDC search space, Neural Architecture Search (NAS) is utilized to discover excellent frame-level networks for deeply supervised face anti-spoofing tasks. Our contributions include:

  1. We design a new convolution operator, called center-difference convolution (CDC), suitable for the FAS task due to its remarkable representation ability on invariant fine-grained features in different environments. Without introducing any additional parameters, CDC can replace common ordinary convolutions in existing neural networks and plug-and-play to form a central difference convolutional network (CDCN) with more robust modeling capabilities.
  2. We propose CDCN++, an extended version of CDCN, consisting of a searched backbone network and a multi-scale attention fusion module (MAFM), which can efficiently aggregate multi-level CDC features.
  3. To our knowledge, this is the first method to search for the neural architecture of the FAS task. Unlike previous NAS classification tasks supervised by softmax loss, we search for well-suited frame-level networks for deeply supervised FAS tasks on a specially designed CDC search space.
  4. Our proposed method achieves state-of-the-art performance on all 6 benchmark datasets with intra- and cross-dataset tests.

2. Related work

Face Liveness Detection

Traditional face anti-spoofing methods usually extract hand-crafted features from facial images to capture spoofing patterns. Some classic local descriptors, such as LBP [7, 15], SIFT [44], SURF [9], HOG [29] and DoG [45] are used to extract frame-level features, while video-level methods usually capture dynamic cues , such as dynamic texture [28], micro-motion [53] and eye blinking [41]. Recently, several methods for frame-level and video-level face anti-spoofing based on deep learning have been proposed. In contrast, auxiliary deeply supervised FAS methods [4, 36] are introduced to efficiently learn more detailed information. On the other hand, several video-level CNN methods are proposed to exploit the dynamic spatiotemporal [56, 62, 33] or rPPG [31, 36, 32] features of PAD. Despite achieving state-of-the-art performance, methods based on video-level deep learning require long sequences as input. Furthermore, compared with traditional descriptors, CNNs are prone to overfitting, making it difficult to generalize well on unseen scenes.

convolution operator

Convolution operators are a common method for extracting basic visual features in deep learning frameworks. Recently, extensions to the common convolution operator have been proposed. In one direction, classical local descriptors (e.g., LBP [2] and Gabor filters [25]) are considered in the convolution design. Representative works include local binary convolution [27] and Gabor convolution [38], which are proposed to save computational cost and improve the ability to resist spatial variation, respectively. Another direction is to modify the spatial extent of aggregation. Two related works are dialed convolution [63] and deformable convolution [14]. However, these convolution operators may not be suitable for the FAS task due to their limited ability to represent invariant fine-grained features.

Neural Architecture Search

Our work is motivated by recent work on NAS [11, 17, 35, 47, 68, 69, 60], while we focus on finding a high-performance deep supervised model rather than a face anti-spoofing task. Binary classification model. Existing NAS methods mainly fall into three categories: 1) Reinforcement learning based [68,69], 2) Evolutionary algorithms based [51,52], 3) Gradient based [35,60,12]. Most NAS methods search the network on a small proxy task and transfer the found architectures to another large target task. From the perspective of computer vision applications, NAS has been developed for face recognition [67], action recognition [46], person ReID [50], object detection [21] and segmentation [65] tasks. To the best of our knowledge, there is currently no NAS-based method for face anti-spoofing tasks. To overcome the above shortcomings and fill in the gaps, we search for a deep supervised FAS task on a specially designed convolution operator.

3. Methodology

In this section, we will first introduce the center-difference convolution in Section 3.1, then introduce the center-difference convolutional network (CDCN) for face anti-spoofing in Section 3.2, and finally introduce the search with attention mechanism in Section 3.3. Network (CDCN++).

3.1 Central difference convolution

In modern deep learning frameworks, feature maps and convolutions are represented in three-dimensional shapes (two-dimensional spatial domain and additional channel dimension). Since the convolution operation remains the same across the channel dimension, for simplicity, in this subsection the convolution is described in 2D, while the extension to 3D is straightforward.

basic convolution

Since 2D spatial convolution is a fundamental operation in CNNs for vision tasks, here we denote it as ordinary convolution and review it first. Two-dimensional convolution has two main steps:

1) Sample the local receptive domain area R on the input feature map x;
2) Aggregate the sampled values ​​by weighted summation. Therefore, the output feature map y can be expressed as
y ( p 0 ) = ∑ pn ∈ R w ( pn ) x ( p 0 + pn ) ( 1 ) y(p_0)=\sum_{p_n\in R}w(p_n) x(p_0+p_n)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)and ( p0)=pnRw(pn)x(p0+pn)                   (1)

where p0 denotes the current position on the input and output feature maps, and pn enumerates the position in R. For example, the local receptive field region 1 of a 3×3 kernel is R = {(−1, −1), (−1, 0), . . . , (0, 1), (1, 1)}.

Basic convolution combined with central difference operation

Inspired by the well-known Local Binary Pattern (LBP) [7], which describes local relations in a binary centered difference, we also introduce a centered difference operation in the basic convolution to enhance its representation and generalization capabilities.
Similarly, the central difference convolution also includes two steps, namely sampling and aggregation. The sampling step size is similar to the basic convolution, but the aggregation step is different: as shown in Figure 2, the central difference convolution is more inclined to aggregate the center-wise gradient of the sampled values. Formula (1) becomes:
y ( p 0 ) = ∑ pn ∈ R w ( pn ) ( x ( p 0 + pn ) − x ( p 0 ) ) ( 2 ) y(p_0)=\sum_{p_n\in R}w(p_n)(x(p_0+p_n)-x(p_0))\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)and ( p0)=pnRw(pn)(x(p0+pn)x(p0)) ( 2 )                 
Whenpn = ( 0 , 0 ) p_n = (0,0)pn=(0,0 ) , relative to the center positionp 0 p_0p0By itself, the gradient value is always equal to zero.

Figure 2 is as follows:insert image description here

Central difference
In convolutional neural networks, the central difference operation can be implemented by introducing a difference kernel in the convolution kernel. Specifically, a difference kernel can be added to the center of the convolution kernel to calculate the difference between the central pixel and the adjacent pixels, and then add this value to the weight of the original convolution kernel to To achieve the effect of better capturing the change trend of image pixels.
Taking the first-order difference as an example, suppose the size of the convolution kernel is 3 × 3 3\times 33×3 , then the size of the difference kernel is also3 × 3 3\times 33×3 , the weight of the center pixel is0 00 , and the weights of the remaining pixels are− 1 -11 or1 11 , specifically as follows:
[ − 1 0 1 − 1 0 1 − 1 0 1 ] \begin{bmatrix} -1 & 0 & 1 \\ -1 & 0 & 1 \\ -1 & 0 & 1 \end {bmatrix} 111000111
When performing a convolution operation, the difference kernel and the convolution kernel can be superimposed in a certain way, and then the convolution operation is performed on the image. Specifically, suppose the convolution kernel weight matrix is ​​WWW , then the output of the convolution operation can be expressed as:
y ( p 0 ) = ∑ pn ∈ R ( ∑ i , j W ( i , j ) K ( pn + i , pn + j ) ) x ( p 0 + pn ) y(p_0)=\sum_{p_n\in R}\left(\sum_{i,j}W(i,j)K(p_n+i,p_n+j)\right)x(p_0+p_n)and ( p0)=pnR(i,jW(i,j)K(pn+i,pn+j))x(p0+pn)
whereKKK represents the differential core,W ( i , j ) W(i,j)W(i,j ) represents the weight value of the convolution kernel,x ( p 0 + pn ) x(p_0+p_n)x(p0+pn) means that the coordinates in the input image arep 0 + pn p_0+p_np0+pnThe pixel value of y ( p 0 ) y(p_0)and ( p0) represents the output of the convolution operation.

For the task of live face detection, whether it is the semantic information at the intensity level or the detail information at the gradient level, it plays a key role in distinguishing real faces from fraudulent faces, which shows that the combination of ordinary convolution and central difference convolution Might be a viable way to provide stronger modeling capabilities. Therefore, we generalize the central difference convolution as:
y ( p 0 ) = θ ∑ pn ∈ R w ( pn ) ( x ( p 0 + pn ) − x ( p 0 ) ) + ( 1 − θ ) ∑ pn ∈ R w ( pn ) x ( p 0 + pn ) ( 3 ) y(p_0)=\theta \sum_{p_n\in R}w(p_n)(x(p_0+p_n)-x(p_0))\\+ (1-\theta)\sum_{p_n\in R}w(p_n)x(p_0+p_n) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)and ( p0)=ipnRw(pn)(x(p0+pn)x(p0))+(1i )pnRw(pn)x(p0+pn)                 (3)

The first half is the central difference convolution, and the second half is the basic convolution, which captures gradient-level details and semantic information respectively.

Among them, the hyperparameter θ ∈ [0,1] weighs the contribution between intensity level and gradient level information. The higher the value of θ, the greater the importance of the central differential gradient information. Henceforth, we will refer to this generalized center-difference convolution as CDC, which should be easily identifiable from its context.

Implementation of central difference convolution

To efficiently implement CDC in modern deep learning frameworks, we decompose and merge Eq. (3) into a basic convolution with an additional central difference term: y ( p 0 ) = ∑ pn ∈ R w ( pn
) x ( p 0 + pn ) + θ ( − x ( p 0 ) ∑ pn ∈ R w ( pn ) ) ( 4 ) y(p_0)= \sum_{p_n\in R}w(p_n)x(p_0+p_n )+\theta(-x(p_0)\sum_{p_n\in R}w(p_n)) \ \ \ \ \ (4)and ( p0)=pnRw(pn)x(p0+pn)+θ ( x ( p0)pnRw(pn))     (4)

According to equation (4), CDC can be easily implemented with a few lines of code in PyTorch [42] and TensorFlow [1]. The derivation of the equation (4) is based on the Pytorch code as follows:
insert image description here

import torch . nn as nn
import torch . nn. functional as F
class CDC (nn.Module):
	def __init__ (self,IC,OC,K=3,P=1,theta =0.7):
		# IC, OC: in channels , out channels
		# K, P: kernel size , padding
		# theta : hyperparameter in CDC
		super(CDC, self).init ()
		self.vani = nn.Conv2d(IC,OC,kernel_size=K, padding=P)
		self.theta = theta
		def forward(self,x):
			# x: input features with shape [N,C,H,W]
			out vanilla = self.vani (x)
			kernel diff = self.conv.weight.sum(2).sum(2)
			kernel diff = kernel diff [:, :, None, None]
			out CD = F.conv2d(input=x, weight= kernel_diff , padding=0)
			return out vanilla − self.theta ∗ out CD

This code defines a central difference convolution (CDC) PyTorch model class. It mainly includes the following components:

init method: defines the initialization function of the CDC class, and the input parameters include the number of input channels IC, the number of output channels OC, the size of convolution kernel K, padding P, CDC hyperparameter theta, etc. In this function, first call the initialization function of nn.Module, then define a normal convolution vani, and save the hyperparameter theta.

forward method: defines the forward propagation function of the CDC model, the input parameter x is the input feature tensor, and its shape is [N,C,H,W], where N is the batch size, C is the number of input channels, and H and W are the height and width of the input feature map, respectively. In the forward propagation process, the input feature is first subjected to the ordinary convolution operation vani(x) to obtain out_vanilla, and then the differential kernel kernel_diff is calculated by self.conv.weight, and applied to the input feature x to obtain the central differential volume The result of the product out_CD. Finally, subtract theta * out_CD from out_vanilla to get the final output.

Among them, F.conv2d is the convolution function in PyTorch, self.conv.weight represents the convolution kernel in the CDC model, and None represents inserting a new dimension at this position. super(CDC, self).init() calls the init function of the parent class nn.Module, and nn.Conv2d defines the operation of ordinary convolution.

self.conv.weight represents the convolution kernel of the normal convolution operation in the CDC model, and sum(2).sum(2) represents the summation operation on the third and fourth dimensions (ie height and width dimensions) respectively , the obtained kernel_diff is the difference kernel. Among them, the summation on the third dimension makes the difference kernel have only one column in the width direction, and the summation on the fourth dimension makes the difference kernel have only one row in the height direction, which becomes a shape [OC, IC, 1, 1] tensor.

Next, kernel_diff[:, :, None, None] is used to add new dimensions to the third and fourth dimensions of the kernel_diff tensor, that is, the shape becomes [OC, IC, 1, 1], which is the same as The input features have the same shape and can be applied to the central difference convolution operation.

relationship to previous work

This paper discusses the relationship between CDC and basic convolution, local binary convolution [27] and Gabor convolution [38], which have similar design ideas but different emphases. Ablation studies in Section 4.3 show the superior performance of CDC on face anti-spoofing tasks.

Relationship to basic convolution. CDC is more general. It can be seen from Equation (3) that basic convolution is a special case of CDC when θ = 0, i.e. when aggregating local intensity information without gradient information.

Relation to Gabor convolution [38]. Gabor convolution (GaborConv) focuses on enhancing the representation ability of spatial transformations (i.e., orientation and scale changes), while CDC focuses more on representing fine-grained robust features in different environments.

3.2 Women's health

Deeply supervised face liveness detection methods [36, 4] utilize 3D shape-based discrimination of spoofed and live faces, providing pixel-level details for FAS models to capture spoofing cues. On this basis, this paper builds a similar deep supervised network depth "[36]" as a baseline. In order to extract more fine-grained and robust features to estimate the face depth map, CDC is introduced to form a central difference convolutional network (CDCN). Note that DepthNet is a special case of the proposed CDCN when θ = 0 is used for all CDC operators.

The details of CDCN are shown in Table 1. Given a single RGB facial image of size 3×256×256, multi-level (low, medium, high-level) fusion features are extracted to predict grayscale facial depth of size 32×32. We use θ = 0.7 as the default setting, and an ablation study on θ will be shown in Section 4.3.
insert image description here

Table 1. Architecture of DeepNet and CDCN. Inside square brackets are the filter size and feature dimensions. All convolutional layers have stride=1 followed by a BN-ReLU layer, while max pooling layers have stride=2.

For the loss function, use the mean square error loss LMSE L_{MSE}LMSEPerform pixel-level supervision. In addition, for the fine-grained supervision requirements in the FAS task, consider the contrastive depth loss LCDL L_{CDL}LCDL[56] can help the network learn more detailed features. Therefore, the overall loss can be expressed as L overall = LMSE + LCDL. L_{overall} = L_{MSE} + L_{CDL}.Loverall=LMSE+LCDL

LCDL L_{CDL}LCDL
The CDL loss function is a loss function for deep learning image classification tasks. Its full name is "Central Difference Loss", and the Chinese translation is the central difference loss.

The goal of the CDL loss function is to improve the ability of the classification model to describe the similarity between different categories, so that the model can better distinguish different categories during classification. Specifically, the CDL loss function measures the similarity between different categories by introducing a hyperparameter (ie, the central difference coefficient) to measure the difference between the model prediction results and the real label. In the CDL loss function, if two classes are closer in the feature space, their similarity is penalized more strongly, which improves the classification accuracy of the model.

Specifically, the calculation process of the CDL loss function is as follows: first calculate the prediction result of the model for the input sample, and then calculate a classification loss according to the difference between the real label and the prediction result. Then, by performing central difference operation on the feature map of the middle layer of the model, the similarity between different categories is calculated, and finally the similarity is added to the classification loss to obtain the final CDL loss function.

LCDL = − 1 N ∑ i = 1 N ∑ j = 1 K yi , j log ⁡ ( exp ⁡ ( si , j / τ ) ∑ k = 1 K exp ⁡ ( si , k / τ ) ) \mathcal{L_{ CDL}}=-\frac{1}{N} \sum{i=1}^{N} \sum_{j=1}^{K} y_{i, j} \log \left(\frac{\ exp \left(s_{i, j} / \tau\right)}{\sum_{k=1}^{K} \exp \left(s_{i, k} / \tau\right)}\right)LCDL=N1i=1Nj=1Kyi,jlog(k=1Kexp(si,k/ t )exp(si,j/ t ).)
whereLCDL \mathcal{L_{CDL}}LCDLRepresents the loss function of CDL, NNN represents the sample size,KKK represents the number of categories,yi , jy{i,j}yes yes _j means theiiThe jjth of the i sampleThe real labels of j categories (value 0 or 1),si , j s_{i,j}si,jIndicates the iiThe jjth of the i sampleThe output results of j categories,τ \tauτ represents the temperature parameter,exp ⁡ \expexp means exponential function,∑ \sum represents the summation operation,log ⁡ \loglog stands for natural logarithm.

3.3 CNDC++

From Table 1, we can see that CDCN has a rough architectural design (e.g., simply repeating the same block structure for different levels), which may be sub-optimized for the face anti-spoofing task. Inspired by the classic visual object understanding model [40], we propose an extended version of CDCN++ (see Fig. 5), which consists of a NAS-based backbone network and a multi-scale attention fusion module with selective attention capacity ( MAFM) composition.

Backbone Network Search for Face Anti-Spoofing Tasks

Our search algorithm is based on two gradient-based NAS methods [35, 60], and more technical details can be referred to the original papers. Here we mainly introduce new contributions on the backbone of FAS task search.

NAS is the abbreviation of Neural Architecture Search, and the Chinese translation is neural network architecture search. It is an automated deep learning method that improves the performance of deep learning models by searching for optimal neural network architectures. In NAS, methods such as heuristic algorithms, evolutionary algorithms, and reinforcement learning are used to explore a large number of neural network structure spaces, and to select the best performing network structure. Through NAS, the design and optimization of deep learning models can be accelerated to make them more suitable for specific tasks and data sets.

insert image description here
Figure 3. Architecture search space using CDC. (a) A network consisting of three stacked cells, where stem and head layers employ CDC with 3x3 kernels and θ = 0.7. (b) A cell contains 6 nodes, including an input node, four intermediate nodes B1, B2, B3, B4 and an output node.
(c) Edges between two nodes represent possible operations. The operation space includes eight candidates, where CDC_2_r means first using two stacked CDCs to increase the number of channels by a ratio r, and then drop back to the original channel size. The size of the total search space is 3 × 8 × 10 = 240.

insert image description here

As shown in Figure 3(a), its goal is to search for cells in three levels (low, middle, and high levels) that form the network backbone for the FAS task. Inspired by the dedicated hierarchically organized neurons in the human visual system [40], we prefer to freely search these multi-layered cells (i.e., cells with different structures), which is more flexible and generalized. We named this configuration "variable cell" and will study its impact in Sec. 4.3 (see Table 2). Different from previous work [35, 60], we only adopt one output of the latest input unit as the input of the current unit.

For the unit-level structure, Figure 3(b) shows that each unit is represented as N nodes { X } i = 0 N − 1 {\{X\}}^{N-1}_{i=0}{ X}i=0N1A directed acyclic graph (DAG) of , where each node represents a network layer. We denote the operation space as OOO , Figure 3© shows 8 designed candidate operations (none, skip connection and CDC connection). In particular, each edge( i , j ) (i, j)( i , j ) can use a functiono ˜ ( i , j ) o˜^{(i,j)}o˜( i , j ) future representation, whereo ˜ ( i , j ) ( xi ) = ∑ o ∈ O η o ( i , j ) ⋅ o ( xi ) o ˜^{(i,j)}(xi)= \sum_{o∈O}η^{(i, j)}_o o(x_i)o˜( i , j ) (xi)=oOtheo(ij)o(xi) . The Softmax function is used to transform the architecture parameterα ( i , j ) \alpha^{(i,j)}a( i , j ) is mapped to the operation weighto ∈ O o ∈ OoO ,即η o(i,j) = exp(α(i,j)) ∑ o′ ∈ O exp(αo′(i,j)) η^{(i,j)}_o=\frac{ exp(\alpha^{(i,j)})}{\sum{o'\in O}exp(\alpha^{(i,j)}_{o'})}theo(i,j)=oOexp(ao(i,j))e x p ( a(i,j)). The intermediate node can be expressed as xj = ∑ i < jo ˜ ( i , j ) ( xi ) x_j=\sum_{i<j}o˜^{(i,j)}(x_i)xj=i<jo˜(i,j)(xi) , while the output nodex N − 1 x_{N−1}xN1Then all intermediate nodes xi x_ixi的加权和,即 x N − 1 = ∑ 0 < i < N − 1 β i ( x i ) x_{N-1}=\sum_{0<i<N-1}β_i(x_i) xN1=0<i<N1bi(xi) . Here, we propose a node attention strategy to learn the importance weight β between intermediate nodes, namelyβ i = exp ( β ′ i ) ∑ 0 < j < N − 1 exp ( β i ′ ) \beta_i =\frac{exp(\beta'i)}{\sum{0<j<N-1}exp(\beta'_i)}bi=0<j<N1exp(βi)e x p ( bi), where β i β_ibiis the middle node xi x_ixiThe original learnable weights β i ′ β'_ibiThe softmax.

In the search phase, L train L_{train}Ltrainand L val L_{val}LvalDenote the training loss and validation loss, respectively, both of which are based on the deeply supervised loss function L overall L_{overall} described in Section 3.2Loverall. The network parameter w and the architecture parameter α are learned by the following two-level optimization problem:
min α L val ( w ∗ ( α ) , α ) s . t . w ∗ ( α ) = argminw L train ( w , α ) ( 5 ) min_\alpha \ \ \ \ \ \ \ \ L_{val}(w^*(\alpha),\alpha)\\ st \ \ \ w^*(\alpha)=argmin_w \ L_{train}(w, \alpha)\ \ \ \ \ \ \ (5)mina        Lval(w (a),a )s.t.   w (a)=argminw Ltrain(w,a ) ( 5 )       

This formulation describes a two-level optimization problem where the outer problem minimizes the validation loss L val ( w ∗ ( α ) , α ) L_{val}(w^*(\alpha),\alpha)Lval(w (a),a ), among themw ∗ ( a ) w^*(\alpha)w (α)represents the network parameters optimized by the inner layer problem solution, that is, by minimizing the training lossL train ( w , α ) L_{train}(w,\alpha)Ltrain(w,α ) The parameters found, namelyw ∗ ( α ) = argminw L train ( w , α ) w^*(\alpha)=argmin_w L_{train}(w,\alpha)w (a)=argminwLtrain(w,α ) . The solution to the inner problem is usually achieved by optimization algorithms such as gradient descent. Therefore, the network parameters wware learned by solving the inner layer problemw , which is then used to solve the outer problem to minimize the validation lossL val L_{val}Lval. This type of optimization is called "learning how to learn".

After convergence, the final discrete architecture is obtained by the following steps: 1) Set o ( i , j ) = argmaxo ∈ O , o ≠ nonepo ( i , j ) o^{(i,j)}=argmax_{ o\in O,o\neq none} p_o^{(i,j)}o(i,j)=argmaxoO,o=nonepo(i,j);2) For each intermediate node, select the node with the maximum value maxo ∈ O , o ≠ nonepo ( i , j ) max_{o \in O,o\neq none}p_o^{(i,j)}maxoO,o=nonepo(i,j)An incoming edge of , 3) ​​For each output node, select a maximum value max 0 < i < N − 1 β i max_{0<i<N-1}\beta_{i}max0<i<N1bi(denoted as "node attention") takes an intermediate node as input. In contrast, it is more intuitive to choose the last intermediate node as the output node. We compare these two settings in Section 4.3 (see Table 2).

MAFM

Although simply fusing low-medium-high-level features can improve the performance of the searched CDC architecture, it is still difficult to find important important regions, which is not conducive to learning more discriminative features. Inspired by selective attention in the human visual system [40, 55], neurons at different levels may have various attentional stimuli within their receptive fields. Therefore, we propose a multi-scale attention fusion module (MAFM), which is able to refine and fuse low-medium-high level CDC features via spatial attention.
insert image description here
As shown in Fig. 4, the features F of different levels pass the kernel size related to the receptive field (i.e., in our case, the high/semantic level should have a small attention kernel size, while the low level has a large attention kernel size) are refined by spatial attention [58] and then stitched together. The refined feature F ′ F'F可以表示为:
F i ′ = F i ⊙ ( σ ( C i ( [ A ( F i ) , M ( F i ) ] ) ) ) , i ∈ { l o w , m i d , h i g h }         ( 6 ) F'_i=F_i\odot (σ(C_i([A(F_i),M(F_i)]))),i\in\{low,mid,high\} \ \ \ \ \ \ \ (6) Fi=Fi( p ( Ci([A(Fi),M(Fi)]))),i{ low,mid,hi g h } ( 6 )       
where,⊙ \odot means Hadamard product. A and M denote the average pooling and max pooling layers, respectively. σ represents the sigmoid function, and C represents the convolutional layer. For Clow, use a 7×7 ordinary convolution kernel; forC mid C_{mid}Cmid, using a 5×5 ordinary convolution kernel; for C high C_{high}Chigh, a 3×3 ordinary convolution kernel is used. CDC was not selected here because of its limited global semantic cognition ability, which is very important in spatial attention. The corresponding ablation studies will be performed in Section 4.3.

4. Experiment

In this section, we conduct extensive experiments to demonstrate the effectiveness of our method. Next, we describe the datasets and metrics used (Section 4.1), implementation details (Section 4.2), results (Sections 4.3–4.5) and analysis (Section 4.6), in order.

4.1 Datasets and metrics

database

Our experiments use six databases, namely OULU-NPU [10], SiW [36], CASIA-MFSD [66], Replay-Attack [13], MSU-MFSD [57] and SiW-M [37] . OULU-NPU and SiW are high-resolution databases containing four and three protocols, respectively, to verify the generalization of the models (e.g., unseen lighting and attack vectors), for internal testing. CASIA-MFSD, Replay-Attack, and MSU-MFSD are databases containing low-resolution videos for cross-testing. SiW-M is designed to conduct cross-type testing to verify the robustness against unknown attacks, which contains up to 13 attack types.

Performance

In the OULU-NPU and SiW datasets, we follow the original protocol and metrics, namely Attack Presentation Classification Error Rate (APCER), True Presentation Classification Error Rate (BPCER), and ACER [24] for a fair comparison. Cross-testing of cases and repeated attacks employs half total error rate (HTER). CASIA-MFSD, Replay-Attack, and MSU-MFSD were tested across-type within the database using the area under the curve (AUC). For cross-type testing of SiW-M, APCER, BPCER, Acer and Equal Error Rate (EER) were used.

4.2 Implementation Details

deep generation

The dense face alignment PRNet [18] is employed to estimate the 3D shape of live faces and generate facial depth maps of size 32×32. More details and examples can be found in [56]. To distinguish real faces from spoofed ones, during the training phase, we normalize the real depth maps to the range [0,1], while set the spoofed ones to 0, which is similar to [36].

Training and Test Setup

Our proposed method is implemented using Pytorch. In the training phase, the model is trained using the Adam optimizer with an initial learning rate (lr) and weight decay (wd) of 1e-4 and 5e-5, respectively. We train for a maximum of 1300 epochs, halving the learning rate every 500 epochs. The batch size is 56 and eight 1080Ti GPUs are used. In the testing phase, we compute the average of the predicted depth maps as the final score.

search settings

Similar to [60], partial channel connections and edge normalization are adopted. The initial number of channels of the network is {32, 64, 128, 128, 128, 64, 1}, which is doubled after the search (see Figure 3(a)). Adam optimizer with lr=1e-4 and wd=5e-5 is used when training model weights. The architecture parameters are trained using the Adam optimizer with lr=6e-4 and wd=1e-3. On Protocol-1 of OULU-NPU, a batch size of 12 is used to search for 60 epochs, and the first 10 epochs do not update the architecture parameters. The whole process took a day on three 1080Tis.

4.3 Ablation experiment

In this subsection, we conduct all ablation experiments on our proposed CDC, CDCN, and CDCN++ to explore their details, all experiments are performed on Protocol-1 of OULU-NPU [10], which is trained on the training set and There are different lighting conditions and positions between the test sets.

i iThe effect of θ on CDCN

According to Equation (3), θ controls the contribution of gradient-based details, i.e., the higher θ is, the more local details are included. As shown in Fig. 6(a), when θ > 0.3, CDC always achieves better performance than ordinary convolution (θ = 0, ACER = 3.8%), which shows that the central difference based on fine-grained information is very important for the FAS task. Very helpful. Since the best performance (ACER = 1.0%) is obtained when θ = 0.7, we use this setting for the following experiments. In addition to keeping θ constant for all layers, we also explore an adaptive CDC method to learn θ for each layer, as shown in Appendix B.

insert image description here
Figure 6. (a) Effect on θ in CDCN. (b) Comparison between different convolutions (only the best performing hyperparameters are shown). The lower the ACER performance, the better the performance.

CDC vs other convolutions

The relationship between CDC and previous convolutions is discussed in Section 3.1, we believe that for face anti-spoofing tasks, CDC is more suitable, since detailed fraud artifacts in different environments should be represented by gradient-based invariant features . Figure 6(b) shows that CDC outperforms other convolutions by more than 2% ACER. It is interesting to find that LBConv outperforms vanilla convolutions, suggesting that for the FAS task, local gradient information is important. GaborConv performs the worst because it is designed to capture spatially invariant features, which does not help in the face anti-spoofing task.

Impact of NAS configuration

insert image description here
Table 2 presents ablation studies on the two NAS configurations described in Section 3.3, i.e. with different cell and node attention. Both configurations improve search performance compared to the baseline setup (shared cells and last intermediate node as output node). There are two reasons: 1) Using a more flexible search constraint, NAS is able to find specialized cells at different levels, which is closer to the human visual system [40], 2) It may not be optimal to take the last intermediate node as the output, while choosing the most The important one is more reasonable.

Effectiveness of NAS-based Backbone and MAFM

The proposed CDCN++ includes NAS-based backbone network and MAFM, as shown in Fig. 5. Clearly, cells from multiple layers are very different, with cells in the middle layer having deeper (four CDC) layers. Table 3 shows the ablation studies on NAS-based backbone and MAFM. As can be seen from the first two rows, the base network with NAS with direct multi-stage fusion performs better than the base network without NAS (0.3% ACER), indicating the effectiveness of our searched architecture. Meanwhile, the ACER of the backbone network with MAFM is 0.5% lower than that of the backbone network with direct multi-level fusion, which shows the effectiveness of MAFM. We also analyze the convolution types and kernel sizes in MAFM, and find that ordinary convolutions are more suitable for capturing semantic spatial attention. Furthermore, the size of the attention convolution kernel should be large enough (7x7) for low-level features and small enough (3x3) for high-level features.
insert image description here

4.4 Internal testing

Internal tests are performed on both OULU-NPU and SiW datasets. We strictly follow the four protocols of OULU-NPU and the three protocols of SiW for evaluation. All comparison methods including STASN [62] are trained without additional datasets for fair comparison.

Results on OULU-NPU

insert image description here
In Table 4, our proposed CDCN++ ranks first on all 4 protocols (with 0.2%, 1.3%, 1.8% and 5.0% ACER), which shows that the method is highly effective in external environments, attack vectors and input cameras. The generalization aspect of the variation performs well. Unlike other state-of-the-art methods for extracting multi-frame dynamic features (Auxiliary [36], STASN [62], GRADIANT [6] and FAS-TD [56]), our method only requires frame-level input, which is very suitable for practical deployment. It is worth noting that although the NAS infrastructure is searched on Protocol-1, it is transferable to all protocols and has good generalizability.
Results on SiW. Table 5 compares the performance of our method with three state-of-the-art methods Auxiliary [36], STASN [62] and FAS-TD [56] on the SiW dataset. As can be seen from Table 5, our method performs best on all three protocols, revealing CDC in (1) changes in facial pose and expression, (2) changes in different spoofing vectors, and (3) crossed/unknown representations Excellent generalization ability in terms of attack.

Results on SiW

insert image description here
Table 5 compares the performance of our method with three state-of-the-art methods Auxiliary [36], STASN [62] and FAS-TD [56] on the SiW dataset. As can be seen from Table 5, our method performs best among all three protocols, revealing the excellent generalization ability of CDC for (1) variations of facial pose and expression, (2) different spoofing mediums. Variations, (3) cross/unknown attacks presented.

4.5 Cross-testing

To further demonstrate the generalization ability of our model, we conduct cross-type and cross-dataset tests respectively to verify its generalization ability to unknown representation attacks and unseen environments.

cross-type testing

insert image description here
Following the protocol proposed in [3], we use CASIA-MFSD [66], Replay-Attack [13] and MSU-MFSD [57] to perform in-dataset cross-type testing between replay attacks and print attacks. As shown in Table 6, our proposed CDC-based method performs best in overall performance (even surpassing the zero-shot learning-based method DTN [37]), indicating that we have consistent good generalization ability to unknown attacks. In addition, we also conduct cross-type tests on the latest SiW-M [37] dataset and achieve the best average ACER (12.7%) and EER (11.9%) among 13 attacks. See Appendix C for detailed results.

Test across datasets

insert image description here

In this experiment, there are two protocols tested across datasets. One is to train on CASIA-MFSD and then test on Replay-Attack, which is named protocol CR; the second is to exchange training data set and test data set, which is named protocol RC. As shown in Table 7, our proposed CDCN++ achieves an HTER of 6.5% on Protocol CR, a huge advantage of 11% over the previous state-of-the-art method. For protocol RC, we also outperform state-of-the-art frame-level methods (see the second half of Table 7). The performance may be further improved by introducing similar temporal dynamic features in Auxiliary [36] and FAS-TD [56].

4.6 Analysis and visualization

In this subsection, we provide two perspectives to demonstrate the analysis of why CDC performs well.

Robustness to domain drift.

insert image description here

Protocol 1 of OULU-NPU is used to verify the robustness of CDC when encountering domain drift (i.e., large illumination differences between training/dev sets and test sets). Figure 7 shows that the network using ordinary convolution has a low ACER (blue curve) on the development set and a high ACER (gray curve) on the test set, which shows that ordinary convolution is easy in the known domain Overfitting, but poor generalization when lighting changes. In contrast, the model with CDC is able to achieve more consistent performance on the development set (red curve) and test set (yellow curve), indicating that CDC is robust to domain drift.

feature visualization

insert image description here
Figure 8 shows the distribution of multi-level fusion features on test videos using t-SNE [39] on OULU-NPU protocol 1. It can be seen from Fig. 8(a) that features using CDC exhibit better clustering behavior than those using plain convolution, while Fig. 8(b) shows the opposite. This illustrates the CDC's ability to distinguish real faces from fake ones. Visualizations of feature maps (with/without CDC) and attention maps for MAFM can be found in Appendix D.

5. Summary and future work

This paper proposes a novel operator for face anti-spoofing tasks, called Central Difference Convolution (CDC). Based on CDC, a Central Difference Convolutional Network (CDCN) is designed. We also propose CDCN++, including a searched CDC backbone and Multiscale Attention Fusion Module (MAFM). Extensive experiments demonstrate the effectiveness of the proposed method. We note that the CDC's research is still in its early stages. Future directions include: 1) designing a context-aware adaptive CDC for each layer/channel; 2) exploring other properties (e.g. domain generalization) and improving performance in other vision tasks (e.g. image quality assessment [34] and face suspiciousness detection) Applicability in .

6. Acknowledgments

This research was supported by the MiGA project of the Academy of Finland (grant number 316765), the ICT 2023 project (grant number 328115) and Infotech Oulu. In addition, the authors would like to thank the Finnish CSC IT Center for providing computing resources.

quote

Too many references will not be displayed here. For details, please refer to the original text of the article and the source code of the model: https://github.com/ZitongYu/CDCN **

appendix

A. Derivation and code of CDC

Here, we show the detailed derivation of CDC in Equation (7) (see Equation (4) in the draft), and the code for CDC in Pytorch (see Figure 9).
insert image description here
Reproducible code see above
insert image description here

B. Adaptive θ of CDC

Although for the face anti-spoofing task, the optimal hyperparameter θ = 0.7 can be measured manually, it is still cumbersome to find the most suitable θ when applying center difference convolution (CDC) to other datasets/tasks. Here, we consider θ as the data-driven learnable weights of each layer. A simple way to implement it is to use Sigmoid(θ) to ensure that the output range is within [0,1].

As shown in Figure 10(a), it is interesting that the learned weights of low (2nd to 4th layers) and high (8th to 10th layers) levels are relatively small, while the medium (5th to 7th layers) The learning weight of the level is larger. This suggests that for mid-level features, center-difference gradient information may be more important. In terms of performance comparison, it can be seen from Fig. 10(b) that adaptive CDC has comparable results (1.8% vs. 1.0% ACER), while CDC uses a constant θ = 0.7.

insert image description here

C. Cross-type testing on the SiW-M dataset

insert image description here
On the SiW-M dataset [37], we compare our proposed method with three state-of-the-art face anti-spoofing methods [10, 36, 37] following the same cross-type testing protocol (13 attacks tested separately). comparison to verify the generalization ability to unseen attacks. As shown in Table 8, our CDCN++ achieves better ACER and EER by 24% and 26%, respectively, over the previous state-of-the-art method [37]. Specifically, we detect almost all "identity impersonation" and "partial paper" attacks (EER = 0%), while previous methods perform poorly on "identity impersonation" attacks. Obviously, we significantly reduce the EER and ACER of mask attacks (“HalfMask”, “SiliconeMask”, “TransparentMask” and “MannequinHead”), which shows that our CDC-based method has good generalization on 3D non-planar attacks sex.

D. Feature Visualization

insert image description here
Figure 11 shows the low-level features of MAFM and the corresponding spatial attention maps. It is clear that the features and attention maps are quite different between live and spoofed faces. 1) For low-level features (see the second and third rows of Figure 11), the neural activations from spoofed faces seem to be more uniform between the face and background regions, while the neural activations from live faces are more concentrated in the face region. It is worth noting that detailed fraud patterns (such as grid artifacts in "Print1" and reflection artifacts in "Replay2") are easier to capture using CDC's features. 2) For the spatial attention maps of MAFM (see the fourth row of Fig. 11), all regions (hair, face and background) of live faces have relatively strong activations, while facial regions of spoofed faces contribute weakly to the activations.

Code

CDC block

For example, the CDC formula:
y ( p 0 ) = ∑ pn ∈ R w ( pn ) x ( p 0 + pn ) + θ ( − x ( p 0 ) ∑ pn ∈ R w ( pn ) ) y(p_0)= \sum_{p_n\in R}w(p_n)x(p_0+p_n)+\theta(-x(p_0)\sum_{p_n\in R}w(p_n))and ( p0)=pnRw(pn)x(p0+pn)+θ ( x ( p0)pnRw(pn))

code show as below:

class Conv2d_cd(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1,
                 padding=1, dilation=1, groups=1, bias=False, theta=0.7):

        super(Conv2d_cd, self).__init__() 
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, dilation=dilation, groups=groups, bias=bias)
        self.theta = theta

    def forward(self, x):
        out_normal = self.conv(x)

        if math.fabs(self.theta - 0.0) < 1e-8:
            return out_normal 
        else:
            #pdb.set_trace()
            [C_out,C_in, kernel_size,kernel_size] = self.conv.weight.shape
            kernel_diff = self.conv.weight.sum(2).sum(2)
            kernel_diff = kernel_diff[:, :, None, None]
            out_diff = F.conv2d(input=x, weight=kernel_diff, bias=self.conv.bias, stride=self.conv.stride, padding=0, groups=self.conv.groups)

            return out_normal - self.theta * out_diff

spatial attention mechanism

class SpatialAttention(nn.Module):
    def __init__(self, kernel = 3):
        super(SpatialAttention, self).__init__()


        self.conv1 = nn.Conv2d(2, 1, kernel_size=kernel, padding=kernel//2, bias=False)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = torch.mean(x, dim=1, keepdim=True)
        max_out, _ = torch.max(x, dim=1, keepdim=True)
        x = torch.cat([avg_out, max_out], dim=1)
        x = self.conv1(x)
        
        return self.sigmoid(x)

First, the average pooling and maximum pooling operations are performed on the input x to obtain the average feature map avg_out and the maximum feature map max_out. These two feature maps are then concatenated along the channel dimension into a tensor x of shape (N, 2, H, W).

Next, input x to the convolutional layer self.conv1 for convolution operation to obtain a tensor with shape (N, 1, H, W). Finally, the convolution result is input into the sigmoid function for activation, and the attention weight of each pixel is obtained, that is, the output result.

This operation can be understood as performing a weighting process on each pixel of the input feature map, assigning higher weights to more important pixels, thereby improving the recognition accuracy of the model. Specifically, the spatial attention mechanism can help the model to better focus on important regions in the image, reduce the interference of irrelevant information, and improve the robustness and generalization ability of the model.

CDCN

insert image description here
The following code implements the CDCN neural network architecture

class CDCN(nn.Module):

    def __init__(self, basic_conv=Conv2d_cd, theta=0.7):   
        super(CDCN, self).__init__()
        
        
        self.conv1 = nn.Sequential(
            basic_conv(3, 64, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(64),
            nn.ReLU(),    
        )
        
        self.Block1 = nn.Sequential(
            basic_conv(64, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),   
            basic_conv(128, 196, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(196),
            nn.ReLU(),  
            basic_conv(196, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),   
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            
        )
        
        self.Block2 = nn.Sequential(
            basic_conv(128, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),   
            basic_conv(128, 196, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(196),
            nn.ReLU(),  
            basic_conv(196, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),  
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        
        self.Block3 = nn.Sequential(
            basic_conv(128, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),   
            basic_conv(128, 196, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(196),
            nn.ReLU(),  
            basic_conv(196, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),   
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        
        self.lastconv1 = nn.Sequential(
            basic_conv(128*3, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),    
        )
        
        self.lastconv2 = nn.Sequential(
            basic_conv(128, 64, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(64),
            nn.ReLU(),    
        )
        
        self.lastconv3 = nn.Sequential(
            basic_conv(64, 1, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.ReLU(),    
        )
        
        
        self.downsample32x32 = nn.functional.interpolate(size=(32, 32), mode='bilinear')

 
    def forward(self, x):	    	# x [3, 256, 256]
        
        x_input = x
        x = self.conv1(x)		   
        
        x_Block1 = self.Block1(x)	    	    	# x [128, 128, 128]
        x_Block1_32x32 = self.downsample32x32(x_Block1)   # x [128, 32, 32]  
        
        x_Block2 = self.Block2(x_Block1)	    # x [128, 64, 64]	  
        x_Block2_32x32 = self.downsample32x32(x_Block2)   # x [128, 32, 32]  
        
        x_Block3 = self.Block3(x_Block2)	    # x [128, 32, 32]  	
        x_Block3_32x32 = self.downsample32x32(x_Block3)   # x [128, 32, 32]  
        
        x_concat = torch.cat((x_Block1_32x32,x_Block2_32x32,x_Block3_32x32), dim=1)    # x [128*3, 32, 32]  
        
        #pdb.set_trace()
        
        x = self.lastconv1(x_concat)    # x [128, 32, 32] 
        x = self.lastconv2(x)    # x [64, 32, 32] 
        x = self.lastconv3(x)    # x [1, 32, 32] 
        
        map_x = x.squeeze(1)
        
        return map_x, x_concat, x_Block1, x_Block2, x_Block3, x_input

The size of the input x is 3×256×256, representing an RGB image. First input x to self.conv1 for convolution operation, and get a tensor x with a shape of 64×256×256.

Next, input x into the convolution block self.Block1 for convolution operation, and obtain a tensor x_Block1 with a shape of 128×128×128. Then x_Block1 is upsampled to a size of 32×32, resulting in a tensor x_Block1_32x32 of shape 128×32×32.

Input x_Block1 into the convolution block self.Block2 for convolution operation, and get a tensor x_Block2 with a shape of 128×64×64. Then x_Block2 is upsampled to a size of 32×32, resulting in a tensor x_Block2_32x32 of shape 128×32×32.

Input x_Block2 into the convolution block self.Block3 for convolution operation, and get a tensor x_Block3 with a shape of 128×32×32. Then x_Block3 is upsampled to a size of 32×32, resulting in a tensor x_Block3_32x32 of shape 128×32×32.

Concatenate x_Block1_32x32, x_Block2_32x32, and x_Block3_32x32 along the channel dimension into a tensor x_concat with a shape of 128×3×32×32. Input x_concat to self.lastconv1 for convolution operation, and get a tensor x with a shape of 128×32×32.

Input x to self.lastconv2 for convolution operation, and get a tensor x with a shape of 64×32×32. Input x to self.lastconv3 for convolution operation, and get a tensor x with a shape of 1×32×32.

Finally, the first dimension of x is compressed, resulting in a tensor map_x of shape 32×32. At the same time, return x_concat, x_Block1, x_Block2, x_Block3 and x_input.

CDCN++

The network structure is as follows:
insert image description here

class CDCNpp(nn.Module):

    def __init__(self, basic_conv=Conv2d_cd, theta=0.7 ):   
        super(CDCNpp, self).__init__()
        
        
        self.conv1 = nn.Sequential(
            basic_conv(3, 64, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(64),
            nn.ReLU(),    
            
        )
        
        self.Block1 = nn.Sequential(
            basic_conv(64, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),  
            
            basic_conv(128, int(128*1.6), kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(int(128*1.6)),
            nn.ReLU(),  
            basic_conv(int(128*1.6), 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(), 
            
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            
        )
        
        self.Block2 = nn.Sequential(
            basic_conv(128, int(128*1.2), kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(int(128*1.2)),
            nn.ReLU(),  
            basic_conv(int(128*1.2), 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),  
            basic_conv(128, int(128*1.4), kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(int(128*1.4)),
            nn.ReLU(),  
            basic_conv(int(128*1.4), 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),  
            
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        
        self.Block3 = nn.Sequential(
            basic_conv(128, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(), 
            basic_conv(128, int(128*1.2), kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(int(128*1.2)),
            nn.ReLU(),  
            basic_conv(int(128*1.2), 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(), 
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        
        # Original
        
        self.lastconv1 = nn.Sequential(
            basic_conv(128*3, 128, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            basic_conv(128, 1, kernel_size=3, stride=1, padding=1, bias=False, theta= theta),
            nn.ReLU(),    
        )
        
      
        self.sa1 = SpatialAttention(kernel = 7)
        self.sa2 = SpatialAttention(kernel = 5)
        self.sa3 = SpatialAttention(kernel = 3)
        self.downsample32x32 = nn.Upsample(size=(32, 32), mode='bilinear')

 
    def forward(self, x):	    	# x [3, 256, 256]
        
        x_input = x
        x = self.conv1(x)		   
        
        x_Block1 = self.Block1(x)	    	    	
        attention1 = self.sa1(x_Block1)
        x_Block1_SA = attention1 * x_Block1
        x_Block1_32x32 = self.downsample32x32(x_Block1_SA)   
        
        x_Block2 = self.Block2(x_Block1)	    
        attention2 = self.sa2(x_Block2)  
        x_Block2_SA = attention2 * x_Block2
        x_Block2_32x32 = self.downsample32x32(x_Block2_SA)  
        
        x_Block3 = self.Block3(x_Block2)	    
        attention3 = self.sa3(x_Block3)  
        x_Block3_SA = attention3 * x_Block3	
        x_Block3_32x32 = self.downsample32x32(x_Block3_SA)   
        
        x_concat = torch.cat((x_Block1_32x32,x_Block2_32x32,x_Block3_32x32), dim=1)    
        
        #pdb.set_trace()
        
        map_x = self.lastconv1(x_concat)
        
        map_x = map_x.squeeze(1)
        
        return map_x, x_concat, attention1, attention2, attention3, x_input
		

Here is an explanation of each part of the code:

init method: initialize the CDCNpp class. This method creates the layers and submodules of the network.

The basic_conv parameter is a custom convolution layer, the default is Conv2d_cd.
The theta parameter is a hyperparameter passed to the custom convolutional layer.
conv1: defines a convolutional layer sequence, including a convolutional layer, batch normalization (BatchNorm2d) layer and activation function (ReLU).

Block1, Block2, and Block3: Three similar subnetworks are defined, each subnetwork contains multiple convolutional layers, batch normalization layers, activation functions, and pooling layers.

lastconv1: Defines a sequence of convolutional layers to process the results of the three subnetworks.

sa1, sa2 and sa3: Three spatial attention modules are defined.

downsample32x32: Defines an upsampling layer that resizes feature maps to 32x32.

forward method: defines the forward propagation process of the network.

Pass the input x to the conv1 layer.
Pass the results to the Block1, Block2, and Block3 subnetworks.
A spatial attention module is applied on the output of each sub-network, and the result is upsampled to 32x32 size.
Concatenate the outputs of the three sub-networks along the channel dimension.
Pass the concatenated result to the lastconv1 layer.
Compress the final output tensor into one channel.
The forward method returns six outputs:

map_x: The final depth estimation map.
x_concat: Feature map after splicing.
attention1, attention2 and attention3: Spatial attention module outputs of the three sub-networks.
x_input: Raw input image.

Guess you like

Origin blog.csdn.net/qq_51957239/article/details/129776820