Transformer Vision (2) || ViT-B/16 network structure

1. Schematic diagram

​ Split a picture as shown in the figure below. 0,1,2,...,8,9 in the figure below are used to record the position information of the picture.

image-20220513160536511


image-20220513160953622

2.Transformer Encoder structure diagram (L× refers to repeated stacking L times)

image-20220513161059924


3. Implementation process:

image-20220513180934822

More detailed Encoder Block diagram

image-20220513181959875

The MLP Block diagram in the above figure is

image-20220513182200573

4.MLP Head layer

image-20220513183030719

Note: There is a Dropout layer in front of the Transformer Encoder and a Layer Norm layer after it.

When training your own network, you can simply think of the MLP Head layer as a fully connected layer

5. Summary of ViT-B/16 network structure

image-20220513183459106

Among them: Encoder Block

image-20220513181959875

Among them: MLP Block

image-20220513182200573

Guess you like

Origin blog.csdn.net/qq_56039091/article/details/124785401