Three-dimensional convolution

Link: https://www.cnblogs.com/xiaojianliu/articles/9905278.html#_label0

Read the table of contents

Back to list

Convolutions over volumes

Suppose you want to detect not only the features of grayscale images, but also the features of RGB color images. If the color image is 6×6×3, the 3 here refers to three color channels. You can think of it as a stack of three 6×6 images. In order to detect the edges or other features of the image, it is not convolved with the original 3×3 filter, but with a three-dimensional filter whose dimension is 3×3×3, so this filter also has Three layers, corresponding to the red, green, and blue channels.

Give these names (original images), where the first 6 represents the height of the image, the second 6 represents the width, and the 3 represents the number of channels. Similarly, your filter also has height, width and number of channels, and the number of channels of the image must match the number of channels of the filter, so these two numbers (the two numbers marked by the purple box) must be equal.

This convolution operation will be a 4×4 image. Note that it is 4×4×1, and the last number is not 3.

This is a 6×6×3 image, and this is a 3×3×3 filter. The number of the last digital channel must match the number of channels in the filter. In order to simplify the image of this 3×3×3 filter, instead of drawing it as a stack of 3 matrices, we draw it as a three-dimensional cube.

In order to calculate the output of this convolution operation, all you have to do is to put the 3×3×3 filter in the upper left corner first. This 3×3×3 filter has 27 numbers and 27 parameters It is 3 cubes. Take these 27 numbers in turn, and then multiply them by the numbers in the corresponding red, green, and blue channels. First take the first 9 numbers of the red channel, then the green channel, and then the blue channel, multiply by the corresponding 27 numbers covered by the yellow cube on the left, and then add these numbers together to get the first output digital.

If you want to calculate the next output, you slide the cube by one unit, multiply these 27 numbers, and add them all together to get the next output, and so on.

So, what can this do? For example, this filter is 3×3×3. If you want to detect the edge of the red channel of the image, then you can set the first filter to:

And the green channel is all 0:

Blue is also all 0. If you stack these three together to form a 3×3×3 filter, then this is a filter that detects the vertical boundary, but it is only useful for the red channel. Or if you don't care which color channel the vertical border is in, then you can use a filter like this:

This is true for all three channels. So by setting the second filter parameter, you have a boundary detector, a 3×3×3 boundary detector, which is used to detect the boundary in any color channel. With different parameter choices, you can get different feature detectors, all of which are 3×3×3 filters.

According to computer vision conventions, when your input has a specific height and width and number of channels, your filter can have different heights and different widths, but the number of channels must be the same. In theory, it is feasible for our filter to focus only on the red channel, or only focus on the green or blue channel.

Pay attention to this convolution cube again. A 6×6×6 input image is convolved with a 3×3×3 filter to get a 4×4 two-dimensional output.

What if you want to use multiple filters at the same time?

This 6×6×3 image is convolved with this 3×3×3 filter to get a 4×4 output. (The first one) This may be a vertical boundary detector or learning to detect other features. The second filter can be represented in orange, and it can be a horizontal edge detector.

So convolve with the first filter to get the first 4×4 output, and then convolve the second filter to get a different 4×4 output. We finish the convolution, and then take the two 4×4 outputs, take the first one and put it in front, and then take the second filter output, so stack these two outputs together so that you are both A 4×4×2 output cube is obtained. It uses a 6×6×3 image, and then convolves these two different 3×3 filters to get two 4×4 outputs, which are stacked together to form a 4×4×2 cube, The 2 here comes from the use of two different filters.

If you have an input image of n∗n∗ncn∗n∗nc (the number of channels), in this example it is 6×6×3, where ncnc is the number of channels, and then convolve the previous f∗f∗ncf∗ f∗nc, in this example it is 3×3×3, then you get (n-f + 1) \times (n-f + 1) \times {n_{ {c^'}}}(n- f + 1) \times (n-f + 1) \times {n_{ {c^'}}}:

Here {n_{ {c^'}}}{n_{ {c^'}}} is actually the number of channels in the next layer, which is the number of filters you use. In our example, that is 4 ×4×2. The stride used for this assumption is 1 and there is no padding. If you use different strides or padding, the value of n − f + 1 will change.

This concept of cubic convolution is really useful, you can now use a small part of it to operate directly on the three-channel RGB image. More importantly, you can detect two features, such as vertical and horizontal edges or 10 or 128 or hundreds of different features, and the number of output channels will be equal to the number of features you want to detect.

Guess you like

Origin blog.csdn.net/ch206265/article/details/109579614