Table of contents
Document name: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Published: 2018
Download address: https://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf
1 Introduction
Mobile Net V1 has the following issues:
- no residual structure
- After the training is over, it is found that the values of many parameters of the depth-wise convolution are 0 for the following three reasons:
- The number of depth-wise convolution weights is too small; the number of channels is too thin, and can only process two-dimensional information, not three-dimensional information
- ReLU results in output less than 0 being 0, and the parameters are also 0
- Low-precision data representation for mobile devices
Mobile Net V2 has made further improvements on the above problems.
2. Innovation points of the paper
1) Inverted residual structure – Inverted residual block
The similarities and differences between ResNet's residual block and Mobile Net's Inverted residual block:
- Similarity: Structures are both
conv 1x1
→ \rightarrow→conv 3x3
→ \rightarrow →conv 1x1
, and finally make a cross-layer connectionshortcut
- the difference:
- The channel number of the residual block is decreased first and then increased, while the channel number of the Inverted residual block is increased first and then decreased
- The middle layer of the residual block
conv 3x3
is a standard 3x3 convolution, while the middle layer of the Inverted residual block is a depth-wise convolution - The shortcut of the residual block connects two high-dimensional layers (two layers with a large number of channels), while the Inverted residual block connects two low-dimensional layers (some places will
write Inverted residual block to connect two bottleblocks across layers, and bottleblock refers to A bottleneck layer with a small number of channels.) - The residual block is followed by each layer
ReLU 激活函数
, and the firstconv 1x1
andconv 3x3
middle the Inverted residual block are followed by ReLU6, and the last oneconv 1x1
is followed by one after dimension reductionlinear 线性激活函数
. That is, the layer indicated by the diagonal slash in the figure below is followed by the linear activation layer
Inverted residual block structure description:
- First do conv 1x1, and change the number of channels from kkk- dimension up totk tkt k dimension,ttt is the expansion factor (expansion factor). The calculation amount of this layer ish × w × k × ( tk ) h \times w \times k \times (tk)h×w×k×(tk)
- Then do depth-wise convolution, keeping the number of channels unchanged. The calculation amount of this layer is h × w × ( tk ) × 3 2 h \times w \times (tk) \times 3^2h×w×(tk)×32
- Finally do conv 1x1, dimension from tk tkt k down tok ′ k'k' . The calculation amount of this layer ish × w × ( tk ) × k ′ h \times w \times (tk) \times k'h×w×(tk)×k′
So, the calculation amount of an Inverted residual block is:h × w × ( tk ) × ( k + 3 2 + k ′ ) h \times w \times (tk) \times (k + 3^2 + k' )h×w×(tk)×(k+32+k′)
Why use an inverted residual structure?
(Let’s talk about the conclusion first and then elaborate) Because the nonlinear transformation (ReLU) will cause information loss, it is necessary to increase the dimension first, create redundant dimensions, then perform nonlinear transformation (ReLU) in the redundant dimensions, and finally convert the dimension Drop back and extract only the necessary information that is useful.
The author did an experiment, taking a spiral line X 2 × n X_{2 \times n} composed of n points in a 2-dimensional spaceX2×n, put it through the matrix T m × 2 T_{m \times 2}Tm×2Map to m dimensions and do ReLU activation. The expression is: Y = R e LU ( T m × 2 ⋅ X 2 × n ) Y = ReLU( T_{m \times 2} \cdot X_{2 \times n})Y=R e LU ( Tm×2⋅X2×n) . ThenYYY throughT − 1 T^{-1}T− 1 is converted back to two-dimensional space, denoted asX ^ \hat XX^ , vs.XXX andX ^ \hat XX^ , observe information loss
** The m dimension corresponds to dim = 2 / 3 / 5 / 15 / 30 in the figure below
** T − 1 T^{-1}T− 1 isTTgeneralized inverse matrix of T
The experimental conclusion is that if the dimension m of the mapping is relatively low, a lot of information will be lost after the ReLU change. If the dimension m is relatively high, the lost information will be much less after ReLU transformation.This is why after conv 1x1 dimensionality reduction, ReLU is not used, but a linear activation function is used.
2) ReLU6
Why use ReLU6?
Because Mobile Net is designed to be applied to devices with relatively small memory such as mobile devices or embedded devices, it generally requires low-precision representation, such as using 8 bits to represent a number (there should be no general data type such as float8, This data type is used for an entirely specific purpose).
(The following sentence is my personal understanding, please point out if it is wrong)
Therefore, using ReLU6 can limit the value to 6, the integer part only occupies 3 bits, and the remaining bits are used to represent the decimal part, which can only be The expression is less accurate. ( See )
Then why is it 6, not 5, 7, 8?
The author said: the value 6 that fits best in 8 bits, which probably is the most use-case.
Most usage scenarios on the mobile end use float8, and using 6 is the most appropriate.
3. Network structure
t: expansion rate
c: the number of dimensions of the output
n: the number of repetitions of the bottleneck
s: (in repeated n modules) the step size of the first module, and the step size of the remaining modules is 1
Among them, some blocks need to be connected across layers, and some do not. The criterion for judging is whether it is downsampled. The downsampled block cannot be shortcutd because the input and output sizes are different. Only blocks without downsampling (blocks with stride=1) have shorcut. As shown below:
Code address: https://github.com/Enzo-MiMan/cv_related_collections/blob/main/classification/MobileNet/model_MobileNet_v2.py