Dynamic ReLU: ReLU dynamically determined according to the input

Here's a paper I just saw recently that proposes Dynamic ReLU (Dynamic ReLU, DY-ReLU), which encodes global context as a hyperfunction and adjusts the piecewise linear activation function accordingly. Compared with the traditional ReLU, the extra computational cost of DY-ReLU is negligible, but the representation ability is significantly enhanced, and the implementation is simple, so our existing model can be modified very simply.

Dynamic ReLU (DY-ReLU)

For a given input vector (or tensor) x, DY-ReLU is defined as a function fθ(x)(x) with a learnable parameter θ(x) adapted to the input x, which consists of two functions:

Hyperfunction θ(x): Parameters used to calculate the activation function.

Activation function fθ(x)(x): Generates activations for all channels using parameters θ(x).

1. Function definition

Let a traditional or static ReLU be y = max(x, 0). ReLU can be generalized as a piecewise linear function of the parameters of each channel c.

where coefficients ( akc , bkc ) is the output of the hyperfunction (x), as follows:

Where K is the number of functions and C is the number of channels. K=2 in the paper.

2. Realization of hyperfunction θ(x)

A lightweight network is used to model the hyperfunction, which is similar to the SE module in SENet (described later).

The output has 2KC elements, corresponding to the residuals of a and b. 2σ(x)-1 is used to normalize the residuals between -1 and 1, where σ(x) represents a sigmoid function. The final output is computed as the sum of initialization and residuals as follows:

where λ is a scalar, and this formula is our graph above

3. Relationship to previous research

It can be seen that the three special cases of DY-ReLU are equivalent to ReLU, Leaky ReLU and PReLU.

4. Variant of DY-ReLU

DY-ReLU-A: The activation function is space and channel shared.

DY-ReLU-B: The activation function is spatially shared and channel-dependent.

DY-ReLU-C: The activations are spatially and channel-wise separated.

Result display

1. Ablation research

All three variations improve over the baseline, but channel-separated DY-ReLU (variations B and C) significantly outperforms channel-shared DY-ReLU (variation A).

So according to the above results, use DY-ReLU-B for ImageNet classification, and use DY-ReLU-C for COCO key point detection.

2. ImageNet classification

Use MobileNetV2 (×0.35 and ×1.0), replacing ReLU with different activation functions. The proposed method significantly outperforms all previous work including Maxout with more computational cost. This shows that DY-ReLU not only has stronger representation ability, but also has high computational efficiency.

The figure above plots the DY-ReLU input and output values of 50,000 verification images in different blocks (from low to high). It can be seen that the learned DY-ReLU is characteristically dynamic, because for a given input x, the activation value (y) varies within a range (the range covered by the blue dots).

The figure below analyzes the angle between the two segments in DY-ReLU (that is, the slope difference |a1c-a2c|). The activation function has lower curvature at higher levels.

3. COCO key point estimation

When using MobileNetV3 as the backbone, after removing the SENet module and replacing ReLU and h-Swish with DY-ReLU, the results are also improved.

Summarize

It can be seen that just replacing the existing activation function with DY-ReLU, the model performs significantly better than the baseline model.

This is a paper of ECCV2020, we directly post the address of paperswithcode here, so that it can be used directly

https://avoid.overfit.cn/post/8db206f03cd54167b9eb2d06ebaffc6b

Author: sh-tsang