Classic neural network (7) DenseNet and its application on the Fashion-MNIST data set
1 Brief description of DenseNet
-
DenseNet
Not through deeper or wider structures, but through feature reuse to improve the learning ability of the network. -
ResNet
The idea is: create a direct connection from "layers near the input" to "layers near the output". AndDenseNet
do it more thoroughly: connect all the layers in a feed-forward form, so this kind of network is calledDenseNet
. -
DenseNet
Has the following advantages:- Alleviate the problem of vanishing gradients. Because each layer can directly obtain the gradient from the loss function and obtain information from the original input, it is easy to train.
- Dense connections also have a regularizing effect, alleviating overfitting for small training set tasks.
- Encourage feature reuse.
feature map
The network combines what has been learned by different layers . - Significantly reduces the number of parameters. Because the convolution kernel size of each layer is relatively small, the number of output channels is small (determined by the growth rate).
-
DenseNet
Has fewer parameters than traditional convolutional networks because it does not require redundant relearningfeature map
.-
Traditional feedforward neural networks can be viewed as
状态
algorithms that pass between layers. Each layer receives the information from the previous layer状态
and then passes the new状态
information to the next layer.This changes
状态
, but also conveys information that needs to be preserved. -
ResNet
The information that needs to be retained is directly transferred through identity mapping, so only transfer between layers is required状态的变化
. -
DenseNet
All layers of all layers will状态
be saved to集体知识
, while each layer will add a small number of layersfeture map
to the network集体知识中
.
-
-
DenseNet
The layers are very narrow (that is,feature map
the number of channels is small), for example, the output of each layer has only 12 channels. -
In terms of cross-layer connections, unlike ResNet where input and output are added, dense connection networks (DenseNet) are in the channel dimension
连结输⼊与输出
. The main building blocks of DenseNet are稠密块和过渡层
. When building DenseNet, we need to添加过渡层来
reduce the number of channels again by controlling the dimensionality of the network. -
Although
DenseNet
has high computational efficiency and relatively few parameters, itDenseNet
is not memory-friendly. You can consider sharing memory to solve this problem. -
Paper download address: https://arxiv.org/pdf/1608.06993.pdf
1.1 Dense block (dense block)
The key difference between ResNet and DenseNet is that the output of DenseNet is a connection (represented by [ , ] in the figure below), rather than a simple addition like ResNet.
The name DenseNet comes from the "dense connections" between variables, with the last layer being closely connected to all previous layers.
注意
: When feature map
the size of is changed, splicing cannot be performed along the channel direction. At this time, the network is divided into multiple DenseNet
blocks, the internal size of each block feature map
is the same, and the size between blocks feature map
is different.
1.1.1 Growth rate
-
DenseNet
In the block, the number of channels output by each layer of H (that is, BN-ReLU-Conv)feature map
is the same, which is k. k is an important hyperparameter called the growth rate of the network.The number of channels of the input [feature map] of the l-th layer is: k 0 + k ( l − 1 ). where k 0 is the number of channels of the input layer. The number of channels of the input [feature map] of the l-th layer is: k_0 + k(l-1). where k_0 is the number of channels of the input layer.The number of channels of the input [feature map] of layer l is : k0+k(l−1 ) . Among them k0is the number of channels in the input layer.
-
DenseNet
An important difference from existing networks is thatDenseNet
the network is very narrow, that is, the number of outputfeature map
channels is small, such as: k = 12.-
A small growth rate can achieve good results. One explanation is that
DenseNet
each layer of a block has access to the outputs of all previous layers within the blockfeature map
, whichfeature map
can be regarded asDenseNet
the global state of the block. The output of each layerfeature map
will be added to the global state of the block, which can be understood as the [collective knowledge] of the network block and is shared by all layers in the block. The growth rate determines the proportion of new features in the global state. -
Therefore,
feature map
there is no need for layer-by-layer replication (because it is globally shared), which is alsoDenseNet
different from traditional network structures. This facilitates feature reuse across the network and produces more compact models.
-
1.1.2 Nonlinear transformation
-
H can be a composite function including Batch Normalization (BN), ReLU unit, pooling or convolution and other operations.
-
The structure in the paper is: first perform BN, then perform ReLU, and finally follow a 3 x 3 convolution, that is: BN-ReLU-Conv(3x3)
-
pytorch is implemented as follows
import torch.nn as nn
import torch
'''
DenseNet使⽤了ResNet改良版的“批量规范化、激活和卷积”架构
卷积块:BN-ReLU-Conv
'''
def conv_block(input_channels, num_channels):
return nn.Sequential(
nn.BatchNorm2d(input_channels),
nn.ReLU(),
nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1)
)
1.1.3 bottleneck
1.1.4 pytorch implements dense blocks
import torch.nn as nn
import torch
'''
DenseNet使⽤了ResNet改良版的“批量规范化、激活和卷积”架构
卷积块:BN-ReLU-Conv
'''
def conv_block(input_channels, num_channels):
return nn.Sequential(
nn.BatchNorm2d(input_channels),
nn.ReLU(),
nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1)
)
'''
⼀个稠密块由多个卷积块组成,每个卷积块使⽤相同数量的输出通道。
然⽽,在前向传播中,我们将每个卷积块的输⼊和输出在通道维上连结。
'''
class DenseBlock(nn.Module):
def __init__(self, num_convs, input_channels, num_channels):
super(DenseBlock, self).__init__()
layer = []
for i in range(num_convs):
layer.append(
conv_block(num_channels * i + input_channels, num_channels) # 一个稠密块由多个卷积块组成
)
self.net = nn.Sequential(*layer)
def forward(self, X):
for blk in self.net:
Y = blk(X)
# 连接通道维度上每个块的输⼊和输出
X = torch.cat((X, Y), dim=1)
return X
if __name__ == '__main__':
'''
1、稠密块 dense block
我们定义⼀个有2个输出通道数为10的DenseBlock。
使⽤通道数为3的输⼊时,我们会得到通道数为3 + 2 × 10 = 23的输出。
卷积块的通道数控制了输出通道数相对于输⼊通道数的增⻓,因此也被称为增⻓率(growth rate)。
'''
blk = DenseBlock(2, 3, 10)
# X经过第一个卷积块后变为(4, 10, 8, 8),然后和原始X(4, 3, 8, 8)进行在维度1进行拼接,X变成(4, 13, 8, 8)
# 然后输入到第二个卷积块,第二个卷积块将channels由(10+3)变为10,因此输出Y(4, 10, 8, 8)
# 然后X和Y在维度1进行拼接,得到最终输出(4, 23, 8, 8)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
print(Y.shape) # (4, 23, 8, 8)
1.2 Transition layer
1.2.1 Introduction to transition layer
-
A
DenseNet
network has multipleDenseNet
blocksDenseNet
connected by transition layers.DenseNet
The layers between blocks are called transition layers, and their main role is to connect differentDenseNet
blocks. -
Transition layers can contain convolution or pooling operations, thereby changing the size (including size, number of channels) of
DenseNet
the output of the previous block .feature map
- The transition layer in the paper consists of a
BN
layer, a1x1
convolutional layer, and an2x2
average pooling layer. Among them1x1
, the convolutional layer is used to reduceDenseNet
the number of output channels of the block and improve the compactness of the model. - If
DenseNet
the number of output channels of the block is not reduced, after blocksDenseNet
, the number of channels of the networkfeature map
will become very large (the number of channels is calculated by the formula shown in the figure below)
- The transition layer in the paper consists of a
- If the number of channels
Dense
output by the block is m, the number of channelsfeature map
output by the transition layer can be theta ✖ m, where 0< theta <=1 is the compression factor.feature map
- When theta = 1,
feature map
the number of channels passing through the transition layer remains unchanged. - When theta < 1,
feature map
the number of channels passing through the transition layer decreases. At this timeDenseNet
it is calledDenseNet-C
. - The improved network that combines
DenseNet-C
and is calledDenseNet-B
DenseNet-BC
- When theta = 1,
1.2.2 Implementation of transition layer
'''
由于每个稠密块都会带来通道数的增加,使⽤过多则会过于复杂化模型。
⽽过渡层可以⽤来控制模型复杂度。它通过1 × 1卷积层来减⼩通道数,并使⽤步幅为2的平均汇聚层减半⾼和宽,从⽽进⼀步降低模型复杂度。
'''
def transition_block(input_channels, num_channels):
return nn.Sequential(
nn.BatchNorm2d(input_channels),
nn.ReLU(),
nn.Conv2d(input_channels, num_channels, kernel_size=1), # 1×1卷积层来减⼩通道数
nn.AvgPool2d(kernel_size=2, stride=2) # 步幅为2的平均汇聚层减半⾼和宽
)
if __name__ == '__main__':
'''
1、稠密块 dense block
我们定义⼀个有2个输出通道数为10的DenseBlock。
使⽤通道数为3的输⼊时,我们会得到通道数为3 + 2 × 10 = 23的输出。
卷积块的通道数控制了输出通道数相对于输⼊通道数的增⻓,因此也被称为增⻓率(growth rate)。
'''
blk = DenseBlock(2, 3, 10)
# X经过第一个卷积块后变为(4, 10, 8, 8),然后和原始X(4, 3, 8, 8)进行在维度1进行拼接,X变成(4, 13, 8, 8)
# 然后输入到第二个卷积块,第二个卷积块将channels由(10+3)变为10,因此输出Y(4, 10, 8, 8)
# 然后X和Y在维度1进行拼接,得到最终输出(4, 23, 8, 8)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
print(Y.shape) # (4, 23, 8, 8)
'''
2、过渡层 transition layer
'''
blk = transition_block(23, 10)
print(blk(Y).shape) # torch.Size([4, 10, 4, 4])
1.3 DenseNet network performance
1.3.1 Network structure
Network structure: ImageNet
trained DenseNet
network structure with growth rate k = 32.
- in the table
conv
representsBN-ReLU-Conv
the combination of . For example1x1 conv
, it means: execute firstBN
, then executeReLU
, and finally execute1x1
the convolution of . DenseNet-xx
Indicates thatDenseNet
the block hasxx
layers. For example:DenseNet-169
it means thatDenseNet
the block has L=169 layers.- All
DenseNet
useDenseNet-BC
the structure, the input image size is224x224
, the initial convolution size is7x7
, the output channel2k
, the step size is2
, and the compression factor theta=0.5. - At
DenseNet
the end of all blocks, there is a global average pooling layer, and the result of this pooling layer is used assoftmax
the input of the output layer.
1.3.2 ImageNet
Error rate on validation set
The figure below is a comparison of the error rates of DenseNet
and ResNet
on ImageNet
the validation set ( single-crop
). The picture on the left shows the number of parameters, and the picture on the right shows the amount of calculation.
It can be seen from the experiment that DenseNet
the number of parameters and the amount of calculation are relatively ResNet
significantly reduced.
- The validation error of with
20M
parameters is close toDenseNet-201
that of with40M
parameters .ResNet-101
- The computational effort of is close to the
ResNet-101
verification error of is close to , almost half of.DenseNet-201
ResNet-50
ResNet-101
1.3.3 Implementation of a simple version of DenseNet
We implement a simple version of DenseNet, using DenseNet instead of DenseNet-BC, for application on the Fashion-MNIST dataset.
稠密块和过度层
import torch.nn as nn
import torch
'''
DenseNet使⽤了ResNet改良版的“批量规范化、激活和卷积”架构
卷积块:BN-ReLU-Conv
'''
def conv_block(input_channels, num_channels):
return nn.Sequential(
nn.BatchNorm2d(input_channels),
nn.ReLU(),
nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1)
)
'''
⼀个稠密块由多个卷积块组成,每个卷积块使⽤相同数量的输出通道。
然⽽,在前向传播中,我们将每个卷积块的输⼊和输出在通道维上连结。
'''
class DenseBlock(nn.Module):
def __init__(self, num_convs, input_channels, num_channels):
super(DenseBlock, self).__init__()
layer = []
for i in range(num_convs):
layer.append(
conv_block(num_channels * i + input_channels, num_channels) # 一个稠密块由多个卷积块组成
)
self.net = nn.Sequential(*layer)
def forward(self, X):
for blk in self.net:
Y = blk(X)
# 连接通道维度上每个块的输⼊和输出
X = torch.cat((X, Y), dim=1)
return X
'''
由于每个稠密块都会带来通道数的增加,使⽤过多则会过于复杂化模型。
⽽过渡层可以⽤来控制模型复杂度。它通过1 × 1卷积层来减⼩通道数,并使⽤步幅为2的平均汇聚层减半⾼和宽,从⽽进⼀步降低模型复杂度。
'''
def transition_block(input_channels, num_channels):
return nn.Sequential(
nn.BatchNorm2d(input_channels),
nn.ReLU(),
nn.Conv2d(input_channels, num_channels, kernel_size=1), # 1×1卷积层来减⼩通道数
nn.AvgPool2d(kernel_size=2, stride=2) # 步幅为2的平均汇聚层减半⾼和宽
)
if __name__ == '__main__':
'''
1、稠密块 dense block
我们定义⼀个有2个输出通道数为10的DenseBlock。
使⽤通道数为3的输⼊时,我们会得到通道数为3 + 2 × 10 = 23的输出。
卷积块的通道数控制了输出通道数相对于输⼊通道数的增⻓,因此也被称为增⻓率(growth rate)。
'''
blk = DenseBlock(2, 3, 10)
# X经过第一个卷积块后变为(4, 10, 8, 8),然后和原始X(4, 3, 8, 8)进行在维度1进行拼接,X变成(4, 13, 8, 8)
# 然后输入到第二个卷积块,第二个卷积块将channels由(10+3)变为10,因此输出Y(4, 10, 8, 8)
# 然后X和Y在维度1进行拼接,得到最终输出(4, 23, 8, 8)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
print(Y.shape) # (4, 23, 8, 8)
'''
2、过渡层 transition layer
'''
blk = transition_block(23, 10)
print(blk(Y).shape) # torch.Size([4, 10, 4, 4])
DenseNet
import torch.nn as nn
import torch
from _08_dense_block import DenseBlock,transition_block
class DenseNet(nn.Module):
def __init__(self):
super(DenseNet, self).__init__()
'''
1、DenseNet⾸先使⽤同ResNet⼀样的单卷积层和最⼤汇聚层。
'''
b1 = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
'''
2、接下来,类似于ResNet使⽤的4个残差块,DenseNet使⽤的是4个稠密块。
与ResNet类似,我们可以设置每个稠密块使⽤多少个卷积层。这⾥我们设成4,从⽽之前的ResNet-18保持⼀致。
稠密块⾥的卷积层通道数(即增⻓率)设为32,所以每个稠密块将增加128个通道。
3、在每个模块之间,ResNet通过步幅为2的残差块减⼩⾼和宽,DenseNet则使⽤过渡层来减半⾼和宽,并减半通道数。
'''
# num_channels为当前的通道数
num_channels, growth_rate = 64, 32
num_convs_in_dense_blocks = [4, 4, 4, 4]
blks = []
for i, num_convs in enumerate(num_convs_in_dense_blocks):
# 添加稠密块
blks.append(DenseBlock(num_convs, num_channels, growth_rate))
# 上⼀个稠密块的输出通道数
num_channels += num_convs * growth_rate
# 在稠密块之间添加⼀个转换层,使通道数量减半
if i != len(num_convs_in_dense_blocks) - 1:
blks.append(transition_block(num_channels, num_channels // 2))
num_channels = num_channels // 2
'''
4、与ResNet类似,最后接上全局汇聚层和全连接层来输出结果。
'''
self.model = nn.Sequential(
b1,
*blks,
nn.BatchNorm2d(num_channels),
nn.ReLU(),
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten(),
nn.Linear(num_channels, 10)
)
def forward(self, X):
return self.model(X)
if __name__ == '__main__':
net = DenseNet()
X = torch.rand(size=(1, 1, 224, 224), dtype=torch.float32)
for layer in net.model:
X = layer(X)
print(layer.__class__.__name__, 'output shape:', X.shape)
Sequential output shape: torch.Size([1, 64, 56, 56])
DenseBlock output shape: torch.Size([1, 192, 56, 56])
Sequential output shape: torch.Size([1, 96, 28, 28])
DenseBlock output shape: torch.Size([1, 224, 28, 28])
Sequential output shape: torch.Size([1, 112, 14, 14])
DenseBlock output shape: torch.Size([1, 240, 14, 14])
Sequential output shape: torch.Size([1, 120, 7, 7])
DenseBlock output shape: torch.Size([1, 248, 7, 7])
BatchNorm2d output shape: torch.Size([1, 248, 7, 7])
ReLU output shape: torch.Size([1, 248, 7, 7])
AdaptiveAvgPool2d output shape: torch.Size([1, 248, 1, 1])
Flatten output shape: torch.Size([1, 248])
Linear output shape: torch.Size([1, 10])
1.4 The problem of excessive memory or video memory consumption of DenseNet
Although DenseNet
has high computational efficiency and relatively few parameters, it DenseNet
is not memory-friendly. Considering GPU
the limitation of video memory size, it is impossible to train deeper ones DenseNet
.
1.4.1 Memory calculation
Assume that DenseNet
the block contains L layers, then:
for the l-th layer, there is xl = H l ( [ x 0 , x 1 , . . . , xl − 1 ] ) for the l-th layer, there is ...,x_{l-1}])For layer l , there is xl=Hl([x0,x1,...,xl−1])
Assuming that the output size of each layerfeature map
is W×H, the number of channels is k, andBN-ReLU-Conv(3x3)
consists of , then:
- Splicing
Concat
operation: It is necessary to generate a temporaryfeature map
as the input of the l-th layer, and the memory consumption is W×H×k×l. BN
Operation: Need to generate temporaryfeature map
asReLU
input of , memory consumption is W×H×k×l.ReLU
Action: In-place modification can be performed, so no additionalfeature map
stashReLU
output is required.Conv
Operation: Need to generate outputfeature map
As the output of layer l, it is a necessary overhead.
feature map
Therefore, in addition to the memory overhead required for the output of layers 1, 2,...,L, layer l also requires 2W×H×k×l memory overhead to store the temporary generated in the middle feature map
.
The entire DenseNet block requires W × H × k × (L + 1) L memory overhead to store the temporary feature map generated in the middle. That is, the memory consumption of the DenseNet block is O (L 2), which is the square relationship of the network depth. The entire DenseNet block requires W×H×k×(L+1)L memory overhead to store the temporary feature maps generated in the middle. \\ That is, the memory consumption of the DenseNet block is O(L^2), which is the square relationship of the network depth.The whole D n se N e t block requires W×H×k×(L+1 ) L memory overhead to store the temporary feature map generated in the middle.That is, the memory consumption of D n se N e t block is O ( L2 ), is the square relationship of network depth.
1.4.2 The necessity of splicing and the reason of memory consumption
-
The splicing
Concat
operation is necessary because the convolution operation is more computationally efficient when the inputs to the convolution are stored in contiguous memory areas. InDenseNet Block
, the input of layer l is spliced along the channel directionfeature map
from the output of previous layers .feature map
These outputsfeature map
are not in contiguous memory areas. -
DenseNet Block
This memory consumption of is notDenseNet Block
caused by the structure of , but by the deep learning library. BecauseTensorflow/PyTorch
when the library implements the neural network, it will store the temporary nodes generated in the middle (such asBN
the output node). This is so that the value of the temporary node can be directly obtained during the backpropagation stage. -
This is a compromise between time cost and space cost: saving computation during the backpropagation stage by opening up more space to store temporary values.
1.4.3 Network parameters also consume memory
In addition to temporary feature map
memory consumption, network parameters also consume memory. Assuming that H BN-ReLU-Conv(3x3)
consists of , the number of network parameters in layer l is: 9×l×k^2 (not considered BN
).
The number of parameters of the entire DenseNet block is 9 k 2 ( L + 1 ) L 2 , that is, O ( L 2 ) The number of parameters of the entire DenseNet block is \frac{9k^2(L+1)L}{2} , that is, O(L^2)The number of parameters of the whole D n se N e t block is29k _2(L+1)L,That is O ( L2)
- Since
DenseNet
the number of parameters has a square relationship with the depth of the network,DenseNet
the network has more parameters and a larger network capacity. This is alsoDenseNet
an important factor over other networks. - Usually there is WH > (9×k/2), where W and H are
feature map
the width and height of the network, and k is the growth rate of the network. Therefore, the memory consumed by network parameters is much smaller thanfeature map
the memory consumed temporarily.
1.5 DenseNet memory optimization_shared memory
The idea is to exploit the compromise between time cost and space cost, but focus on sacrificing time cost in exchange for space cost.
The supporting factors behind it are: the computational cost of Concat
operations and operations is very low, but the space cost is high. BN
So this approach works DenseNet
very well in .
1.5.1 Traditional practices
The traditional DenseNet Block
layer l. First, feature map
copy to contiguous memory blocks, and complete the splicing operation during copying. Then perform the BN
, ReLU
, Conv
operations in sequence.
The temporary memory of this layer feature map
needs to consume 2W×H×k×l, and the output of this layer feature map
needs to consume memory W×H×k.
- In addition, some implementations (such as
LuaTorch
) also need to allocate memory for the gradient of the backpropagation process, as shown in the lower half of the left figure. For example: when calculatingBN
the gradient of layer output, the gradient of the l-th output layer andBN
the output of layer need to be used. Storing these gradients requires additional O(lk) memory. - Other implementations (e.g.
PyTorch,MxNet
) use a shared memory area for gradients to store these gradients, thus requiring only O(k) memory.
1.5.2 Shared memory practices
The picture on the right shows the memory-optimized DenseNet Block
layer l. Two sets of pre-allocated shared memory areas are used Shared memory Storage location
to store concate
operations and BN
operation output temporarily feature map
.
对于第一组预分配的共享内存区:
The first set of pre-allocated shared memory areas: concat
the operating shared area. concat
The output of operations on layers 1, 2, ..., L are all written into this shared area, and the writing of layer (l+1) will overwrite the result of layer (l).
-
For the whole
Dense Block
, this shared area only needs to allocate W×H×k×L (maximumfeature map
) memory, that is, the memory consumption is O(kL) (对比传统
DenseNet的O(kL^2)
). -
Subsequent
BN
operations read data directly from this shared area. -
Since the writing of the (l+1)th layer will overwrite the results of the (l)th layer, the data stored here is temporary and easily lost. Therefore, the result of the operation of layer (l) needs to be recalculated in the backpropagation stage
Concat
.Because
concat
the operation is very computationally efficient, this extra computation is cheap.
对于第二组预分配的共享内存区
The second set of pre-allocated shared memory areas: BN
the operating shared area. concat
The outputs of operations on layers 1, 2,...,L are all written to the shared area, and writing on layer (l+1) will overwrite the results on layer (l).
-
For
Dense Block
the entire shared area, only W×H×k×L (the largestfeature map
) memory needs to be allocated, that is, the memory consumption is O(kL) (对比传统DenseNet的O(kL^2)
). -
Subsequent convolution operations read data directly from this shared area.
-
For the same reason as operating the shared area, the results of layer (l) operations
concat
also need to be recalculated during the backpropagation stage .BN
BN
The computational efficiency of is also very high, only need to pay about 5% extra computational cost.
Since BN
operations and concat
operations are widely used in neural networks, this method of pre-allocating shared memory areas can be widely used. They can save a lot of memory consumption while adding a small amount of computing time.
2 Application examples of DenseNet on the Fashion-MNIST data set
2.1 Create DenseNet network model
As shown in 1.3.3.
2.2 Read the Fashion-MNIST data set
batch_size = 256
# 为了使Fashion-MNIST上的训练短⼩精悍,将输⼊的⾼和宽从224降到96,简化计算
train_iter,test_iter = get_mnist_data(batch_size,resize=96)
2.3 Model training on GPU
from _08_DenseNet import DenseNet
# 初始化模型
net = DenseNet()
lr, num_epochs = 0.1, 10
train_ch(net, train_iter, test_iter, num_epochs, lr, try_gpu())