[Hands-on deep learning-notes] From fully connected layer to convolution

A principle of object recognition - invariance (invariance)

translation invariance

No matter where in the image the detected object appears, the first few layers of the neural network should have a similar response to the same image region, which is "translation invariance".

locality

The first few layers of the neural network should only explore local areas in the input image, and not pay too much attention to the relationship between distant areas in the image. This is the principle of "locality". Eventually, these local features can be aggregated to make predictions at the whole image level.

Let me give you my point of view first:
Convolution is an improvement to full connection in order to satisfy invariance

Limitations of Fully Connected Layers

too many parameters

For high-dimensional perception data, networks with only fully connected layers may become impractical.

Assuming we have a sufficient photo dataset, the dataset contains labeled photos, and each photo has millions of pixels, which means that each input of the network has a million dimensions. Even if the hidden layer dimension is reduced to 1000, this fully connected layer will have 1 0 6 × 1 0 3 = 1 0 9 10^6 \times 10^3 = 10^9106×103=109 parameters. Trying to train this model would not be feasible, because it would require a large number of GPUs, experience in distributed optimization training, and extraordinary patience.

weak immutability

Let's first take a look at what the fully connected layer looks like when processing two-dimensional data

For one-dimensional input, the fully connected layer works like this:
[ H ] i = [ U ] i + ∑ j [ W ] i , j [ X ] j \begin{split}\begin{aligned} \left[\mathbf {H}\right]_{i} &= [\mathbf{U}]_{i} +\sum_j[\mathsf{W}]_{i, j} [\mathbf{X}]_{j} \\ \end{aligned}\end{split}[H]i=[U]i+j[W]i,j[X]j

In order to preserve the spatial information, we now directly input a 2D data.

So now the input is two-dimensional data containing width and height information, and the output is also two-dimensional. Correspondingly, the weight matrix of the fully connected layer will also increase the two dimensions of width and height to become a 4-dimensional weight matrix.

Example to understand this four-dimensional weight matrix structure
Suppose our input dimension is 20 ∗ 20 20*202020 , need to get a15∗15 15*1515The output of 15
then this[ W ] 15 , 15 , 20 , 20 [W]_{15,15,20,20}[W]15,15,20,20is a 15 ∗ 15 15*1515A large matrix of 15 , and each element of this large matrix is​​20 ∗ 20 20*2020A small matrix of 20
To calculate the output( 3 , 4 ) (3,4)(3,4 ) , take the large matrix( 3 , 4 ) (3,4)(3,4 ) The small matrix at the place, and the input perform an element-by-element weighted summation, and then add the bias matrix( 3 , 4 ) (3,4)(3,4 ) The offset value obtained can be output.

Then similarly, the fully connected layer that processes two-dimensional input can be expressed as:
[ H ] i , j = [ U ] i , j + ∑ k ∑ l [ W ] i , j , k , l [ X ] k , l \begin{split}\begin{aligned} \left[\mathbf{H}\right]_{i, j} &= [\mathbf{U}]_{i, j} + \sum_k \sum_l[\ mathsf{W}]_{i, j, k, l} [\mathbf{X}]_{k, l} \end{aligned}\end{split}[H]i,j=[U]i,j+kl[W]i,j,k,l[X]k,l
[ X ] k , l [\mathbf{X}]_{k, l}[X]k,lRepresents the input ( k , l ) (k,l)(k,l ) Pixels at position
[ H ] i , j [\mathbf{H}]_{i, j}[H]i,jRepresents the output ( i , j ) (i,j)(i,j ) pixel
[ U ] i , j [\mathbf{U}]_{i,j}[U]i,jis the output to ( i , j ) (i,j)(i,j ) The offset added at the position

Next, perform a subscript transformation , let k = i + ak=i+ak=i+a l = j + b l=j+b l=j+b , then:

[ H ] i , j = [ U ] i , j + ∑ k ∑ l [ W ] i , j , k , l [ X ] k , l = [ U ] i , j + ∑ a ∑ b [ V ] i , j , a , b [ X ] i + a , j + b . \begin{split}\begin{aligned} \left[\mathbf{H}\right]_{i, j} &= [\mathbf{U}]_{i, j} + \sum_k \sum_l[\mathsf{W}]_{i, j, k, l} [\mathbf{X}]_{k, l}\\ &= [\mathbf{U}]_{i, j} + \sum_a \sum_b [\mathsf{V}]_{i, j, a, b} [\mathbf{X}]_{i+a, j+b}.\end{aligned}\end{split} [H]i,j=[U]i,j+kl[W]i,j,k,l[X]k,l=[U]i,j+ab[V]i,j,a,b[X]i+a,j+b.
V \mathbf{V}VsW \mathbf{ W}Reindexing of W , we can [ X ] i , j [\mathbf{X}]_{i, j}[X]i,jAs the center pixel, by changing the offset aaa and offsetbbThe value of b , accessing[ X ] [\mathbf{X}][ X ] anywhere

Through this form, we can see that if the coordinates of the center pixel change, the weight V responsible for the calculation will also change, which violates the translation invariance, and V should not depend on the coordinates of the center pixel .
At the same time, by changing the offset aaa and offsetbbThe value of b , accessing[ X ] [\mathbf{X}][ X ] , which alsolocal invariance.

Translation invariance adjustment

We should keep the weight V responsible for the calculation constant when the position of the center pixel changes:
[ V ] i , j , a , b = [ V ] a , b [\mathbf{V}]_{i, j, a, b} =[\mathbf{V}]_{a, b}[V]i,j,a,b=[V]a,b
After that, the eccentric fixed dwell, from
[ H ] i , j = u + ∑ a ∑ b [ V ] a , b [ X ] i + a , j + b . [\mathbf{H}]_{i, j} = u + \sum_a\sum_b [\mathbf{V}]_{a, b} [\mathbf{X}]_{i+a, j+b}.[H]i,j=u+ab[V]a,b[X]i+a,j+b.In
this way, each[ X ] i , j [\mathbf{X}]_{i, j}[X]i,jWhen used as a central pixel, the pixels around it will be calculated with the same weight matrix, which is convolution

local invariance adjustment

aaa b b The change of b sets a limit that it does not deviate to a position ( i , j ) from the center pixel(i,j)(i,j ) far away,
that is, at∣ a ∣ > Δ |a|> \Deltaa>Δ or∣b ∣ > Δ |b|> \Deltab>When Δ , set[ V ] a , b = 0 [\mathbf{V}]_{a, b}=0[V]a,b=0,最终,我们得到:
[ H ] i , j = u + ∑ a = − Δ Δ ∑ b = − Δ Δ [ V ] a , b [ X ] i + a , j + b . [\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V} ]_{a, b} [\mathbf{X}]_{i+a, j+b}.[H]i,j=u+a = DDb = DD[V]a,b[X]i+a,j+b.

This is the convolutional layer (Convolutional Layer) , V \mathbf{V}V is the legendary convolution kernel (Kernel) or filter (Filter),V \mathbf{V}The size of V isΔ × Δ \Delta\times\DeltaD×D

References

6.1. From fully connected layers to convolutions — hands-on deep learning 2.0.0-beta1 documentation

Guess you like

Origin blog.csdn.net/qq_41129489/article/details/126401931