1. Schematic diagram
Split a picture as shown in the figure below. 0,1,2,...,8,9 in the figure below are used to record the position information of the picture.
2.Transformer Encoder structure diagram (L× refers to repeated stacking L times)
3. Implementation process:
More detailed Encoder Block diagram
The MLP Block diagram in the above figure is
4.MLP Head layer
Note: There is a Dropout layer in front of the Transformer Encoder and a Layer Norm layer after it.
When training your own network, you can simply think of the MLP Head layer as a fully connected layer
5. Summary of ViT-B/16 network structure
Among them: Encoder Block
Among them: MLP Block