Transformer~45

This is the opportunity and challenge of end-to-side AI chips in the Transformer era. I would like to thank the big brother first. It is not surprising that I moved here to learn for myself.

Transformer seems to have gradually swallowed the position of CNN, and Transformer requires a greater amount of calculation, and is more likely to be constrained by bandwidth and cache. So how should short-term AI chip companies deal with this opportunity and challenge? This paper analyzes and summarizes the typical transformer models of each mode and their advantages and disadvantages from the perspective of end-to-end.

It seems that in the past two years, Transformer has quietly crushed CNN in voice, vision, and radar modes. Considering that the models in these fields almost contract the end-to-end workload, how should the majority of end-to-end AI chip companies deal with it next? How to consider the design and selection of next-generation chips? I think friends in the industry must have done a lot of research: Transformer is very good, but under the computing power scale of the terminal side, compared with CNN, Transformer often needs a larger amount of calculation to gain a substantial advantage, and Transformer's calculation intensity is stronger. Low, more likely to be constrained by bandwidth and cache. . . Therefore, it is necessary to adopt a strategy similar to that of the LLM-oriented large-computing AI chip: add cache and increase bandwidth.

Um, is this really enough? Two months ago, I really thought so. I lamented that the era of transformer unification has come, and it is good to pile up materials. I even output a little pessimistic attitude towards the future development of the field of AiSys in the following answer. But recently, after researching the structure of popular transformers in these fields, and communicating with friends of some algorithms, I found that the application paradigm of transformers in other modalities is quite different from that of NLP, and I tried to design LLM-oriented chips It is not appropriate for decisions to be directly applied to the end-side. 

Next, I will list the typical transformer models of each mode and their advantages and disadvantages from the end-side perspective. The thinking and summary will be given at the end. Friends who are not interested in the model can skip the foreshadowing and go directly to the summary.

In addition, for better communication (harmony), as always, define the subject of discussion. The end-side AI chip referred to in this article represents the circle in the figure below. Mobile, Edge, Autonomous. On the one hand, AI chips in this range are more sensitive to cost, and the stacking of cache and bandwidth is relatively conservative. On the other hand, DSA with more energy efficiency has a higher proportion, resulting in relatively weak programming flexibility. End-to-side typical transformer model inventory

Visual backbone articles

Vit: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE, 2020

Introduction

  • In the overall process: the image is directly packaged and encoded through patch embedding, and converted into a sequence feature of the transformer type. After patch position encoding is connected, the subsequent classic transformer block takes over.

  • From an algorithm perspective: the Vit structure converts visual information into sequential features with only a small amount of feature preprocessing, which is quite elegant and concise. However, since the scale information is uniquely determined by patch_size in the patch embedding stage, the multi-scale information that is crucial to vision tasks is absent. Therefore, in other visual tasks besides classification, Vit's performance is not ideal.

Advantages and disadvantages from the end-side perspective

  • Advantage:

    • Except for the patch embedding and output form, the Vit structure is the same as the classic transformer structure, so you can directly enjoy most of the transformer inference optimization strategies that ignore kv-cache.

  • Disadvantages:

    • The amount of calculation increases quadratically with the resolution (seqlen=(H/patch_size)x(W/patch_size)). When applied to large-resolution scenes (such as detection segmentation), neither FFN nor matrix multiplication in MHA can be directly loaded into the resource-constrained end-side near-kernel cache. Although there are various parallel segmentation strategies to support it, the bandwidth overhead caused by excessive swapping in and swapping out between the cache and the main memory still seriously restricts its computing efficiency.

    • The special conv2d of the patch_size level (16/32) does not have a good utilization efficiency. Of course, the focus is still on 1), the problem of 2) is very similar to the convolution of the first inchannel=3 of CNN, it is not a bottleneck, and 2) can also be converted to permute+gemm to avoid it.

swintransformer:Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV, 2021简介

  • Overall process: Referring to the above figure, PatchMerging downscales stage by stage (the first PatchPartition+LinearEmbedding can be regarded as a specialized PatchMerging); a single SwinTransformer first converts visual features into sequential features with local windows as the granularity through WindowPartition, and sends them to After the classic Transformer structure is processed, the visual features are transferred back through WindowUnpartition; each SwinTransformer performs a window sliding to achieve global information flow.

  • Algorithm perspective: SwinTransformer's magically modified WindowAttention still affirms the local characteristics in visual tasks. In addition, switching back and forth between visual features and sequential features throughout the pipeline makes attention-based multi-scale information extraction possible. This is of great significance for the implementation of Transformer in visual tasks such as detection and segmentation.

Advantages and disadvantages from the end-side perspective

  • Advantage

    • The WindowAttention mechanism makes the amount of calculation increase linearly with the resolution, which solves the main problem of ViT. In addition, the sequence size seqlen is only related to the window size ws (seqlen=ws x ws). As long as embed_dim is properly designed, the matrix operation can be resident in the near-core cache of the AI ​​chip regardless of the resolution.

    • This mechanism also increases the size of the Batch in disguise (Batch=batch_origin x (H/ws) x (W/ws), ws refers to the window size). Parallel without synchronization, completely free! ! This is definitely a surprise for multi-core architecture design.

  • disadvantage

    • In addition to inheriting the special convolution of ViT's PatchEmbedding, WindowPartition/Unpartition and window sliding introduce permute/torch.roll, and windowsize and resolution do not match the introduction of Pad. These high-frequency data rearrangements undoubtedly have a greater impact on performance. big impact.

Subsequent evolution

  • On the attention mechanism: the combination of window attention (SwinTransformer) and global attention (ViT). For example ViTDet: Exploring Plain Vision Transformer Backbones for Object Detection, ECCV, 2022. That is, some Transformers use Window Attention, and some use Attention.

  • Position coding: apply the innovation of position coding in the NLP field to vision tasks, such as RoPE: ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING, 2021. As shown below, the core calculation is odd-even rotation and even de-inversion, (x, y) --> (-y, x). At the onnx level, it will be split into subgraphs of low-dimensional split+neg+low-dimensional concat. Converting an obvious 1pass to multipass has a slight impact on performance.

def rotate_half(x):
    x = rearrange(x, '... (d r) -> ... d r', r = 2)
    x1, x2 = x.unbind(dim = -1)
    x = torch.stack((-x2, x1), dim = -1)
    return rearrange(x, '... d r -> ... (d r)')
Visual Inspection

DETR: End-to-End Object Detection with Transformers, ECCV, 2020简介

  • Overall process: as shown above. The main feature is extracted, after doing a ChannelMapping and position encoding, it is sent to the classic Transformer codec module, and finally handed over to the FFN frame. It is worth noting that the query embedding in the decoding part is a preset fixed value (learning parameter).

  • From an algorithm perspective: Transformer-based end-to-end framework, without NMS, without any artificial features and prior knowledge, but training is difficult. From the perspective of deployment, it is simply an astonishing design.

Advantages and disadvantages from the end-side perspective

  • Advantage

    • There is no nms end-to-end, and there is no need to offload to the cpu for post-processing.

    • Like ViT, it can basically seamlessly use most transformer inference optimization strategies.

  • disadvantage

    • From the perspective of the typical load detected, the amount of calculation is still too large. To achieve the same algorithm effect, compared with the combination of detection heads in the CNN era, the overhead is too large, which is not worth the candle.

Deformable DETR: DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION, 2020简介

  • In terms of the overall process: the process of Deformable DETR is a bit complicated. I just scanned the code and transferred the model to get started. It seems that it is not easy to understand based on the above two structural diagrams. I try to interpret it concisely from the perspective of model structure: Deformable DETR receives the multi-scale visual features extracted by the backbone and sends them into its customized DeformableTransformer structure. Within this structure, query features exist in two representations: visual multi-scale and sequence. Predict the coordinates (referencepoints in Fig. 2) and offset (offset in Fig. 2) of the feature points of interest based on the sequence features, and then extract the feature points from the multi-scale visual features according to the coordinates and combine the weights calculated by the sequence features to obtain The query input of the next round of DeformableTransformer.

  • From an algorithm perspective: it reduces the difficulty of training.

Advantages and disadvantages from the end-side perspective

  • Advantage

    • It seems that it is difficult to find a deployment-level advantage for Deformable DETR. If you have to say it, the calculation of the DeformableAttention matrix for its magic modification is indeed a little less.

  • disadvantage

    • If the data reorder (permute) continuously inserted into the model by WindowPartition and WindowUnpartition has an impact on the end-side performance, I think the operation of DeformableAttention’s high-frequency grid_sample to extract feature points can only be described as a disaster for the inflexible and low-bandwidth end-side .

    • Not to mention the specialized split (non-equal segmentation) and high-dimensional permute introduced when switching between multi-scale visual features and sequence features.

Subsequent evolution

  • DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, 2022, further introduces query selection based on DeformableDETR, and adds target selection (topk+gather) in the decoding stage.

Point Cloud

DSVT: DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets CVPR 2023简介

  • Overall process: The entire DSVT pipeline (the DSVT part in the above figure) can be understood as a point cloud version of the swintransformer. 1). Parallel local self-attention: DSVT divides the sparse voxel into multiple windows and finally divides it into an equal number of sets, and performs attention for each set. 2). Global information circulation (shift window sliding): DSVT partitions x/y respectively.

  • Data switching method: pre-record the index corresponding to the voxel feature in the x/y direction, and realize the data switching between the voxel point cloud feature space and the transformer sequence feature space through gather/scatter.

  • Voxelization and BEV in the process are both domain terms and will not be discussed in this article. For the voxel part, please refer to VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection CVPR 2018.

  • In addition to the process, DSVT and SwinTransformer have similar typical loads (set in DSVT)

Advantages and disadvantages from the end-side perspective

  • Advantage

    • Inherited all benefits of WindowAttention. It reduces the amount of calculation, the calculation scale of a single matrix, and improves the parallelism of calculation.

  • disadvantage

    • The switching cost of data in the transformer sequence feature space and voxel point cloud feature space is much higher than that of visual features. First, data random access (shuffle) such as scatter/gather is much more time-consuming than data rearrangement operation (reorder), and second, the support of these operations on the device side is a question mark. Is all offload given to the cpu? This is no joke.

    • In a word, the unfriendliness of the terminal side has nothing to do with DSVT itself, it is the loss of the bad genes and the point cloud data structure.

Evolution of 3D detection head

The algorithm evolution route of the 3D detection head is relatively complicated, and there are many preparations to be made, such as CenterNet and CenterPoint in the CNN era, but the ending is the same as that of 2D detection, which is DETR. So instead of too much ink.

Multimodal articles

CLIP: Learning Transferable Visual Models From Natural Language Supervision ICML 2021简介

  • In the overall process: the image encoding and text encoding parts pass through the transformer encoder (or other encoding structures), and then perform multimodal fusion (norm) on their respective encoding features, and then calculate the cosine similarity (matmul, cos) of each other .

Advantages and disadvantages from the end-side perspective

  • advantage

    • Like ViT, the simple and unadorned structure, and the practitioners of simplicity, are good for end-to-end implementation.

  • shortcoming

    • The text encoder including prompt engineering, whose input text sequence will change in real time with the application context, puts forward higher requirements for the dynamic shape support of the end-side software stack.

Lightweight Transformer

From the perspective of model structure, I divide the lightweight Transformer into two factions: the organic combination school and the deep reform school.

organic combination

Representatives of this faction are:

  • EfficientFormer: Vision Transformers at MobileNet Speed NIPS 2022

  • MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformerarxiv 2021

Just from the Architecture Overview, we can find their clues: the familiar conv+relu+bn, step-by-step downsampling. This type of model still retains the typical characteristics of the CNN architecture, and only the head and torso are implanted with Transformer units.

Although this type of algorithm can flexibly switch the lineage (adjust the proportion of its Transformer unit), to achieve a balance between algorithm effect and reasoning performance. But from the perspective of the system, I feel that the positioning of this type of algorithm is rather vague: is it for the CNN computing power range, the Transformer computing power range, or the middle ground between the two? At least from what I have read so far, the middle ground is the main arena for this faction.

deep reformist

Representatives of this faction are:

  • Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios, 2022

  • EfficientViT: Lightweight Multi-Scale Attention for On-Device Semantic Segmentation, ECCV, 2022

This faction is no longer satisfied with the superficial implantation of Transformer. They transform the Transformer structure at a deeper level, use pooling to cut the number of attention to reduce the size of the attention matrix, and use relu/pooling to achieve a global attention mechanism similar to softmax.

Compared with the organic integration faction, this faction accommodates the algorithm transformation of the characteristics of the existing AI chip, but has a clear goal, which is to compete in the computing power range of CNN. Designing "Domain Specific Algorithm" is naturally the best for deployment and optimization. However, the reverse adaptation of this algorithm to the platform, and the idea of ​​moving forward with the shackles of the old-age AI chip, I always feel that it runs counter to the mainstream innovation ideas.

Summarize

better programmability

The difference of inductive paranoia has profoundly affected the evolution mode of CNN and Transformer. The local characteristics of CNN undoubtedly guide more innovations at the global structural level of the network, such as ResNet, FPN, etc., while the global nature of Transformer in turn makes improvements. Optimization is more reflected in the tinkering of Attention and FFN details, that is, at the operator level. In addition, Transformers are often stacked in the form of Block/Layer in the model, so the operator innovation introduced around Transformer will be copied to all Transformer units. Therefore, in the Transformer era, the proportion of these modal model specialized operators will be much higher than in the past. The grid_sample/gather/scatter and high-dimensional tensors of DeformableDETR/DSVT listed above, the WindowPartition and Unpartition operations (additional Pad and high-dimensional Permute) introduced by SwinT and VitDet, and special position encoding operations such as RoPE are the best. illustration.

At this time, if you still follow the idea of ​​the CNN era and offload the calculation load that cannot be handled by compilation software stacks such as grid_sample/gather/scatter to the CPU, it will interrupt the pipeline frequently and occupy additional bandwidth, which must be a big mistake. Instead, it would be a better design direction to configure a programmable acceleration unit that shares a cache with TensorCore, or to embed TensorCore in DSP.

Considering the application of FlashAttention, an optimization strategy that requires extremely high computational flexibility (I am talking about the quantitative version of FA), the above decision will only be further strengthened. Then, let’s further imagine, under this trend, will the storage and computing AI chips that take the device-side route face greater challenges (after all, not all operations and optimizations are suitable for quantization)? whaosoft  aiot  http://143ai.com

More reasonable resource allocation

Better cache size . Whether it is the visual backbone and point cloud DSVT limited by WindowAttention, or the detection task QueryEmbedding limited by Detection Objects, or the multimodal simplified prompt, the design of these algorithms implies that the sequence length does not vary with the model. The amount of calculation increases and expands infinitely. Simply calculate the large parameter configuration scenarios of each modal transformer (mlp_ratio=4, queryobj=seqlen=1024, embed_dim=1024), the 8M cache can completely cover the weight data of any single calculation during single batch8bit reasoning. So there is no need to stack materials infinitely, as long as the size is in place.

There are potential advantages to multi-core designs . From the perspective of workloads such as SwinTransformer/ViTDet, algorithm modeling in tasks that emphasize locality has a higher willingness to transform Transformer's attention input into mutually independent subsequences representing local features. In this case, the benefits of multi-core design will be more obvious.

More precise computing power positioning.  At present, there are two innovative modes in the algorithm of the transformer era. One is the lightweight Transformer, which is deeply integrated with CNN. Under the premise that the original structure remains unchanged, a small number of Transformer modules are introduced to take the bottom line; the other is based on the modality. Features, introduce new operators to adapt feature data, and customize Transformer, taking the upper-level route. I dare not discuss the route dispute, but one thing is certain, the computing power ranges of the two are completely different. The design and positioning of an AI chip cannot be ambiguous, and can only choose one of the two. Both ends are needed, the underlying route has no price advantage, and the computing power resources of the upper route cannot keep up, the result may be ugly. Therefore, it is necessary for the software and hardware collaborative researchers in the team to invest more in the design phase and grasp the boundary points of the advantages of the transformer effect.

More detailed Buffer modeling

These workloads fully reflect:

  • The solution of non-linguistic models across feature spaces. The data is switched back and forth between sequential features and other modal features in search of better feature extraction. For example, shift and downsampling under image features in SwinT, Attention and FFN under sequence features, and feature grouping under point cloud features of DSVT, and Attention/FFN under the same sequence features.

  • A fancy encoding description for location information. For example, RoPE, and the decoder position operation in DINO.

If the Buffer modeling at the IR level can be linearly spanned from tensor granularity to item granularity, then on the one hand, the reorder (permute) and shuffle (grid_sample/gather) operations introduced by switching across feature spaces will be better described. In this way, there is room for further exploration of its fusion with the predecessor and successor eltwise/gemm nodes. On the other hand, the collection and dependency relationship between each Buffer in the calculation subgraph of fancy position encoding (RoPE, etc.) will also be better Modeling, so that it is possible to convert the entire subgraph from multipass to 1pass. The superposition of these performance benefits is likely to become the winner of the final performance benchmark.

Higher scalability software stack

These characteristics of the Transformer algorithm also have a certain impact on the support of the software stack:

  • Among the special operators introduced by each modal Transformer, except for shuffle operations, most workloads (high-dimensional tensor operations, high-dimensional Permute, non-equal split, etc.) are the scope of responsibility of the compiling software stack, which will These loads are trivialized and hardware adaptation is done.

  • Although workloads other than multimodality do not have a strong appeal for dynamic shapes, given the huge potential shown by the multimodal model, it is still necessary to lay out dynamic shapes in advance and make reasonable version planning for compiling the software stack.

And these may not have been included in the test cases or even the function list at all in the CNN era. Obviously, this is a higher requirement for the scalability of the AI ​​chip compilation software stack. There are too many dimensions to consider in design, but in terms of testing, I feel that it is time to launch Fuzz.

Maybe you can get a glimpse of each company's response from the subsequent SDK version changes and release notes

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132309467