A fully convolutional network (FCN) uses a convolutional neural network to achieve the transformation from image pixels to pixel categories [36]. Convolutional neural network with different previously introduced, the full convolution via a network transpose convolution (transposed convolution) layer height and the width size of the input image is transformed back characteristic diagram of the intermediate layer, so that the prediction result and the input image One-to-one correspondence in the spatial dimension (height and width): Given a position in the spatial dimension, the output of the channel dimension is the category prediction of the pixel corresponding to the position.

We first import the packages or modules needed for the experiment, and then explain what a transposed convolutional layer is.

In [1]: %matplotlib inline
        import d2lzh as d2l
        from mxnet import gluon, image, init, nd
        from mxnet.gluon import data as gdata, loss as gloss, model_zoo, nn
        import numpy as np
        import sys

9.10.1 Transposed Convolutional Layer

As the name implies, the transposed convolutional layer is named after the matrix transpose operation. In fact, the convolution operation can also be realized by matrix multiplication. In the following example, we define an input X with a height and a width of 4, and a convolution kernel K with a height and a width of 3. Print the output of the two-dimensional convolution operation and the convolution kernel. As you can see, the output height and width are 2 respectively.

In [2]: X = nd.arange(1, 17).reshape((1, 1, 4, 4))
        K = nd.arange(1, 10).reshape((1, 1, 3, 3)) 
        conv = nn.Conv2D(channels=1, kernel_size=3) 
        conv.initialize(init.Constant(K))
        conv(X), K

Out[2]: (
         [[[[348. 393.]
            [528. 573.]]]]
         <NDArray 1x1x2x2 @cpu(0)>, 
         [[[[1. 2. 3.]
            [4. 5. 6.]
            [7. 8. 9.]]]]
         <NDArray 1x1x3x3 @cpu(0)>)

Below we rewrite the convolution kernel K into a sparse matrix W containing a large number of zero elements, that is, the weight matrix. The shape of the weight matrix is (4, 16), in which the non-zero elements come from the elements in the convolution kernel K. Concatenate the input X line by line to get a vector of length 16. Then do matrix multiplication of W and vectorized X to get a vector of length 4. After deforming it, we can get the same result as the above convolution operation. It can be seen that we have implemented the convolution operation using matrix multiplication in this example.

In [3]: W, k = nd.zeros((4, 16)), nd.zeros(11)
        k [: 3], k [4: 7], k [8:] = K [0, 0, 0,:], K [0, 0, 1,:], K [0, 0, 2,: ]
        W[0, 0:11], W[1, 1:12], W[2, 4:15], W[3, 5:16] = k, k, k, k
        nd.dot(W, X.reshape(16)).reshape((1, 1, 2, 2)), W

Out[3]: (
         [[[[348. 393.]
            [528. 573.]]]]
         <NDArray 1x1x2x2 @cpu(0)>,
         [[1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0. 0. 0. 0. 0.]
          [0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0. 0. 0. 0.]
          [0. 0. 0. 0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0.]
          [0. 0. 0. 0. 0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9.]]
         <NDArray 4x16 @cpu(0)>)

Now we describe the convolution operation from the perspective of matrix multiplication. Suppose the input vector is x and the weight matrix is W. The realization of the forward calculation function of convolution can be regarded as multiplying the function input by the weight matrix and outputting the vector

. We know that back propagation needs to follow the chain rule. due to

, The realization of the convolutional backpropagation function can be seen as multiplying the function input by the transposed weight matrix

. The transposed convolution layer just swaps the forward calculation function and the back propagation function of the convolution layer: these two functions can be regarded as multiplying the function input vector by

And W .

It is not difficult to imagine that the transposed convolutional layer can be used to exchange the shape of the input and output of the convolutional layer. Let us continue to describe convolution with matrix multiplication. Let the weight matrix be of shape

For the input vector of length 16, the convolution forward calculation outputs a vector of length 4. If the length of the input vector is 4, the shape of the transposed weight matrix is

, Then the transposed convolutional layer will output a vector of length 16. In model design, transposed convolutional layers are often used to transform smaller feature maps into larger feature maps. In a fully convolutional network, when the input is a feature map with small height and width, the transposed convolution layer can be used to enlarge the height and width to the size of the input image.

Let's look at an example. Construct a convolutional layer conv, and set the shape of the input X to (1, 3, 64, 64). The number of channels of the convolution output Y is increased to 10, but the height and width are reduced by half.

In [4]: conv = nn.Conv2D(10, kernel_size=4, padding=1, strides=2) 
        conv.initialize()

        X = nd.random.uniform(shape=(1, 3, 64, 64)) 
        Y = conv(X)
        Y.shape

Out[4]: (1, 10, 32, 32)

Next, we construct the transposed convolutional layer conv_trans by creating an instance of Conv2DTranspose. Here we set the shape, filling and stride of the conv_trans convolution kernel to be the same as those in conv, and set the number of output channels to 3. When the input is the output Y of the conv layer conv, the output of the transposed convolution layer has the same height and width as the input of the convolution layer: the transposed convolution layer enlarges the height and width of the feature map by 2 times.

In [5]: conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) 
        conv_trans.initialize()
        conv_trans(Y).shape

Out[5]: (1, 3, 64, 64)

In some literature, also referred to as transposition convolution fractional step convolution (fractionally-strided convolution) [12 ].

9.10.2 Construction model

Here we give the most basic design of the full convolutional network model. As shown in Figure 9-11, the full convolutional network first uses the convolutional neural network to extract image features, and then passes

The convolution layer transforms the number of channels into the number of categories, and finally transforms the height and width of the feature map to the size of the input image by transposing the convolution layer. The model output has the same height and width as the input image, and has a one-to-one correspondence in the spatial position: the final output channel contains the category prediction of the spatial position pixel.

Figure 9-11 Fully convolutional network

Below we use a ResNet-18 model pre-trained on the ImageNet dataset to extract image features, and record the network instance as pretrained_net. It can be seen that the last two layers of the model member variable features are the global maximum pooling layer GlobalAvgPool2D and the sample flattening layer Flatten, and the output module contains the fully connected layer for output. Fully convolutional networks do not need to use these layers.

In [6]: pretrained_net = model_zoo.vision.resnet18_v2(pretrained=True) 
        pretrained_net.features[-4:], pretrained_net.output

Out[6]: (HybridSequential(
           (0): BatchNorm(axis=1, eps=1e-05, momentum=0.9, f ix_gamma=False,
➥  use_global_stats=False, in_channels=512) 
           (1): Activation (reread)
           (2): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0),
➥  ceil_mode=True)
           (3): Flatten
         ), Dense(512 -> 1000, linear))

Below we create a fully convolutional network instance net. It copies all the layers from the member variable features of the pretrained_net instance except the last two layers and the model parameters obtained by pre-training.

In [7]: net = nn.HybridSequential()
        for layer in pretrained_net.features[:-2]: 
            net.add(layer)

Given an input with a height and width of 320 and 480 respectively, the forward calculation of net reduces the input height and width to 1/32 of the original, namely 10 and 15.

In [8]: X = nd.random.uniform(shape=(1, 3, 320, 480)) 
        net(X).shape

Out[8]: (1, 512, 10, 15)

Next, we pass

The convolutional layer transforms the number of output channels into the number of categories 21 in the Pascal VOC2012 data set. Finally, we need to enlarge the height and width of the feature map by 32 times to change back to the height and width of the input image. Recall the calculation method of the output shape of the convolutional layer described in Section 5.2. due to

And

, We construct a transposed convolutional layer with a stride of 32, and set the height and width of the convolution kernel to 64 and the padding to 16. It is not difficult to find that if the stride is s , the padding is s /2 (assuming s /2 is an integer), and the height and width of the convolution kernel are 2 s , the transposed convolution kernel will enlarge the input height and width by s times respectively .

In [9]: num_classes = 21
        net.add(nn.Conv2D(num_classes, kernel_size=1), 
                nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16,
                                   strides=32))

9.10.3 Initialize the transposed convolutional layer

We already know that the transposed convolutional layer can enlarge the feature map. In image processing, we sometimes need to enlarge the image, that is , upsampling ( upsample ). There are many methods for upsampling, and bilinear interpolation is commonly used . Simply put, in order to get the output image at the coordinates

Pixels on the top, first map the coordinates to the coordinates of the input image

, For example, mapping based on the ratio of input to output size. Mapped

with

Usually real numbers. Then, find the coordinates on the input image

The nearest 4 pixels. Finally, the output image is at the coordinates

The pixels on the input image are based on these 4 pixels and their

The relative distance is calculated. The upsampling of bilinear interpolation can be achieved by the transposed convolution layer of the convolution kernel constructed by the following bilinear_kernel function. Due to space limitations, we only give the implementation of the bilinear_kernel function, and no longer discuss the principle of the algorithm.

In [10]: def bilinear_kernel(in_channels, out_channels, kernel_size): 
             factor = (kernel_size + 1) // 2
             if kernel_size % 2 == 1: 
                 center = factor - 1
             else:
                 center = factor - 0.5
             og = np.ogrid[:kernel_size, :kernel_size]
             felt = (1 - abs (and [0] - center) / factor) * \ 
                    (1 - abs (and [1] - center) / factor)
             weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size), 
                              dtype='f loat32')
             weight[range(in_channels), range(out_channels), :, :] = f ilt
             return nd.array(weight)

Let's experiment with the upsampling of bilinear interpolation using transposed convolutional layers. Construct a transposed convolution layer that magnifies the input height and width by 2 times, and initialize its convolution kernel with the bilinear_kernel function.

In [11]: conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) 
         conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))

Read the image X, and record the result of upsampling as Y. In order to print the image, we need to adjust the position of the channel dimension.

In [12]: img = image.imread('../img/catdog.jpg')
         X = img.astype('f loat32').transpose((2, 0, 1)).expand_dims(axis=0) / 255
         Y = conv_trans(X)
         out_img = Y[0].transpose((1, 2, 0))

It can be seen that the transposed convolutional layer enlarges the height and width of the image by 2 times. It is worth mentioning that, except for the different coordinate scales, the image enlarged by bilinear interpolation looks no different from the original image printed in Section 9.3.

In [13]: d2l.set_f igsize()
         print('input image shape:', img.shape) 
         d2l.plt.imshow(img.asnumpy()); 
         print('output image shape:', out_img.shape) 
         d2l.plt.imshow(out_img.asnumpy());

input image shape: (561, 728, 3)
output image shape: (1122, 1456, 3)

In a fully convolutional network, we initialize the transposed convolutional layer as upsampling of bilinear interpolation. For the 1 × 1 convolutional layer, we use Xavier random initialization.

In [14]: net[-1].initialize(init.Constant(bilinear_kernel(num_classes, num_classes,
                                                        64)))
         net[-2].initialize(init=init.Xavier())

9.10.4 Read data set

We use the method described in section 9.9 to read the data set. The shape of the randomly cropped output image is specified here as 320 × 480: both height and width can be divisible by 32.

In [15]: crop_size, batch_size, colormap2label = (320, 480), 32, nd.zeros(256**3)
         for i, cm in enumerate(d2l.VOC_COLORMAP): 
             colormap2label[(cm[0] * 256 + cm[1]) * 256 + cm[2]] = i
         voc_dir = d2l.download_voc_pascal (data_dir = '.. / data')

         num_workers = 0 if sys.platform.startswith('win32') else 4
         train_iter = gdata.DataLoader(
             d2l.VOCSegDataset(True, crop_size, voc_dir, colormap2label), batch_size, 
             shuff le=True, last_batch='discard', num_workers=num_workers)
         test_iter = gdata.DataLoader (
             d2l.VOCSegDataset(False, crop_size, voc_dir, colormap2label), batch_size, 
             last_batch='discard', num_workers=num_workers)

read 1114 examples
read 1078 examples

9.10.5 Training model

Now you can start training the model. The loss function and accuracy calculation here are not essentially different from those in image classification. Because we use the channel of the transposed convolution layer to predict the category of the pixel, the axis=1 (channel dimension) option is specified in SoftmaxCrossEntropyLoss. In addition, the model calculates the accuracy rate based on whether the predicted category of each pixel is correct.

In [16]: ctx = d2l.try_all_gpus()
         loss = gloss.SoftmaxCrossEntropyLoss(axis=1) 
         net.collect_params().reset_ctx(ctx)
         trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1,
                                                            'wd': 1e-3})
         d2l.train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs=5)

training on [gpu(0), gpu(1), gpu(2), gpu(3)]
epoch 1, loss 1.3306, train acc 0.726, test acc 0.811, time 17.5 sec
epoch 2, loss 0.6524, train acc 0.811, test acc 0.820, time 16.6 sec
epoch 3, loss 0.5364, train acc 0.838, test acc 0.812, time 16.3 sec
epoch 4, loss 0.4650, train acc 0.856, test acc 0.842, time 16.5 sec
epoch 5, loss 0.4017, train acc 0.872, test acc 0.851, time 16.3 sec

9.10.6 Predicting pixel categories

When predicting, we need to standardize the input image in each channel and convert it into the four-dimensional input format required by the convolutional neural network.

In [17]: def predict(img):
             X = test_iter._dataset.normalize_image(img)
             X = X.transpose((2, 0, 1)).expand_dims(axis=0)
             pred = nd.argmax(net(X.as_in_context(ctx[0])), axis=1)
             return pred.reshape((pred.shape[1], pred.shape[2]))

In order to visualize the predicted category of each pixel, we map the predicted category back to their label color in the data set.

In [18]: def label2image(pred):
             colormap = nd.array(d2l.VOC_COLORMAP, ctx=ctx[0], dtype='uint8')
             X = pred.astype('int32')
             return colormap[X, :]

The images in the test data set vary in size and shape. Since the model uses a transposed convolution layer with a stride of 32, when the height or width of the input image cannot be divisible by 32, the output height or width of the transposed convolution layer will deviate from the size of the input image. In order to solve this problem, we can intercept multiple rectangular areas with height and width that are integer multiples of 32 in the image, and perform forward calculations on the pixels in these areas respectively. The union of these regions needs to completely cover the input image. When a pixel is covered by multiple regions, the average value of the output of the convolutional layer transposed in the forward calculation of different regions can be used as the input of the softmax operation to predict the category.

For the sake of simplicity, we only read a few larger test images, and start from the upper left corner of the image to intercept an area of 320 × 480: only this area is used for prediction. For the input image, we print the intercepted area first, then print the prediction result, and finally print the labeled category (see also color illustration 20).

In [19]: test_images, test_labels = d2l.read_voc_images(is_train=False) 
         n, imgs = 4, []
         for i in range(n):
             crop_rect = (0, 0, 480, 320)
             X = image.f ixed_crop(test_images[i], *crop_rect) 
             pred = label2image(predict(X))
             imgs += [X, pred, image.f ixed_crop(test_labels[i], *crop_rect)] 
         d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n);

summary

The convolution operation can be realized by matrix multiplication.

The full convolutional network first uses a convolutional neural network to extract image features, then transforms the number of channels into the number of categories through a 1 × 1 convolutional layer, and finally transforms the height and width of the feature map into the input image through the transposed convolutional layer Size to output the category of each pixel.

In a fully convolutional network, the transposed convolutional layer can be initialized as upsampling of bilinear interpolation.

This article is excerpted from "Hands-on Deep Learning"

This book aims to deliver an interactive learning experience about deep learning to readers. The book not only explains the principles of deep learning algorithms, but also demonstrates their implementation and operation. Different from traditional books, each section of this book is a Jupyter notebook that can be downloaded and run. It combines text, formulas, images, codes, and running results. In addition, readers can also visit and participate in the discussion of the contents of the book.

The content of the book is divided into 3 parts: The first part introduces the background of deep learning, provides prerequisite knowledge, and includes the basic concepts and techniques of deep learning; the second part describes the important components of deep learning computing, and also explains how deep learning has been made in recent years Convolutional neural networks and recurrent neural networks that have been successful in many fields; the third part evaluates optimization algorithms, examines important factors that affect deep learning computing performance, and lists important applications of deep learning in computer vision and natural language processing. .

This book also covers the methods and practices of deep learning, mainly for college students, technicians and researchers. Reading this book requires readers to understand basic Python programming or the basics of linear algebra, differentiation and probability described in the appendix.

Deep Learning: Fully Convolutional Network (FCN)