Intensive reading of the paper: Ansor: Generating High-Performance Tensor Programs for Deep Learning

1. Abstract

  High-performance tensor programs are key to ensuring efficient execution of deep neural networks. However, it is very challenging to obtain well-performing tensor programs for different operators on various hardware platforms. Currently, deep learning systems rely on kernel libraries ( ) or various search strategies provided by hardware vendors kernel librariesto obtain well-performing tensor programs. But these methods have two drawbacks:
  (1)a huge engineering effort is required to develop platform-specific optimized codes;
  (2)limited search space and ineffective search strategies make it difficult to discover high-performance tensor programs.
  Based on the above disadvantages, the author proposes Ansora tensor program generation framework for deep learning applications. Compared with the existing search strategies, Ansorit has the following characteristics: explore more optimal combinations by sampling programs
  (1)from the hierarchical representation of the search space ; use evolutionary search and learning cost models to fine-tune the sampled programs to determine the optimal program; Using a task scheduler to simultaneously optimize multiple subgraphs of a deep neural network. The authors' experiments show that Ansor is able to find high-performance programs outside the search space of   existing state-of-the-art methods: yes times, times , times .(hierarchical representation)(optimization combinations)
  (2)(evolutionary search)(cost model)(fine-tune)
  (3)(task scheduler)
(state-of-the-art,SOAT)Intel CPU3.8ARM CPU2.6NVIDIA GPU1.7

2. Introduction

(DNN)Low-latency execution of   deep neural networks plays a vital role in autonomous driving (autonomous driving), augmented reality (augmented reality), language translation (language translation), and other applications. AIDNN can be expressed as a directed acyclic computing graph (directed acyclic graph, DAG), nodes represent operators (convolution, matrix multiplication), and directed edges represent dependencies between operators. Existing deep learning frameworks (Tensorflow, PyTorch, MXNet)map operators in DNNs to vendor-provided kernel libraries (cuDNN, MKL-DNN)for high performance. However, these kernel libraries require a huge engineering effort to manually tune for each hardware platform and operator, and the large amount of manual work required to produce an efficient operator implementation for each target accelerator limits the development of new operators and specific accelerators. and innovation.
  Given DNNthe importance of performance, researchers and industry practitioners have turned to compiler search-based (search-based compilation)automatic generation of tensor programs, such as low-level implementations of tensor operators. For an operator or a subgraph of multiple operators, users need to define computations in a high-level declarative language, and then the compiler searches for custom programs for different hardware platforms.

3. Background

  The deep learning ecosystem is embracing a rapidly growing diversity of hardware platforms, including CPU, GPU, FPGAand ASIC. In order to deploy on these platforms , high-performance tensor programs DNNneed to be provided for the operators used in , and the required set of operators usually contains standard operators and new operators invented by machine learning researchers .   To provide portability of these operators across a wide range of hardware platforms in an efficient manner, various compiler techniques have emerged . Users use a high-level declarative language to define computations in a form similar to mathematical expressions, and the compiler generates optimized tensor programs based on the definitions. The figure below shows the calculation definition of matrix multiplication in the tensor expression language. The user mainly needs to define the shape of the input tensor and how to calculate each element in the output tensor.DNN(matmul, conv2d)(capsule conv2d, dilated conv2d)
(TVM, Halide, Tensor Comprehensions)TVM

insert image description here
  However, it is extremely difficult to automatically generate high-performance tensor programs from high-level definitions. Depending on the architecture of the target platform, the compiler needs to search in an extremely large and complex space consisting of optimization combinatorial choices (e.g., unroll structure, unroll size, vectorization, parallelization), and (tile structure)finding (tile size)high (vectorization)- (parallelization)performance programs requires search strategies Cover a comprehensive space and explore it efficiently.

insert image description here

4. Design Overview

insert image description here
  Program sampler: AnsorA key challenge that must be addressed is generating a large search space for a given computational graph. To cover various tensor programs with various high-level structures and low-level details, Ansora hierarchical representation with two levels of search spaces is exploited: sketches (sketch)and annotations (annotation). AnsorDefining the high-level structure of a program as a sketch and billions of low-level choices (e.g., tile size (tile size), parallelism (parallel), unrolling annotations (unroll annotations)) as annotations, this notation allows Ansorflexible enumeration of high-level structures and efficient sampling of low-level details.
  Performance tuner: The performance of random sampling programs is not necessarily good, the next challenge is to fine-tune them. AnsorFine-tuning is performed iteratively using evolutionary search and a learned cost model, in each iteration the evolutionary Ansorsearch is started using a resampled new program as well as good programs from previous iterations as the initial population. Evolutionary search fine-tunes programs through mutation and crossover, performs out-of-order rewrites, and addresses the constraints of sequential construction. Querying the learned cost model is orders of magnitude faster than actually measuring it, so we can evaluate thousands of programs in seconds.
  Task scheduler: Using program sampling and performance fine-tuning allows Ansor to find high-performance tensor programs for computational graphs. Intuitively, processing a complete DNNas a single computation graph and generating a complete tensor program for it can potentially achieve the best performance. However, this is inefficient since it has to deal with an unnecessary exponential explosion of the search space. Usually, the compiler DNNdivides a large computational graph into several small subgraphs. Due to DNNthe layer-by-layer (layer-by-layer)construction feature, the impact of this division on performance is negligible, which brings the Ansorlast challenge: How to allocate time resources when graph generation program?
  AnsorThe task scheduler in uses a gradient descent-based scheduling algorithm to allocate resources to subgraphs that are more likely to improve end-to-end DNN performance.

5. Program Sampling

  The search space that an algorithm explores determines the best programs it can find. The search space considered by existing methods is limited by the following factors:
  (1)Manual enumeration (TVM): It is impractical to manually enumerate all possible choices through templates, so existing manual templates can only heuristically cover a limited search space;
  (2)Aggressive early pruning (Halide auto-scheduler): Aggressive early pruning based on evaluating incomplete programs prevents the search algorithm from exploring certain regions in the space.
  To solve (1), we automatically expand the search space by recursively applying a flexible set of derivation rules;
  to avoid that (2), we randomly sample complete programs in the search space.
  Since random sampling gives each sampled point an equal chance, the authors' proposed search algorithm can potentially explore every program in the considered space without relying on random sampling to find the optimal program, since each sampled program is later All fine-tuned.
  At the top level, sketches are generated by recursively applying some derivation rules. At the bottom level, these sketches are randomly annotated to get a complete program. This representation summarizes some basic structure from billions of low-level choices, enabling flexible enumeration of high-level structures and efficient sampling of low-level details.

insert image description here

5.1 Sketch Generation

  The first column in the figure above shows two input examples. There are three equivalent forms of input: mathematical expressions, corresponding naiveprograms resulting from direct unrolling of loop indices, and corresponding computational graphs (DAG).

  In the field of computer programming, "naive program"it usually refers to a simple or plain way of program implementation. Such procedures may not consider all possible cases, or take advantage of existing optimization techniques. "naive"This term is often used to describe certain programmers who are inexperienced or less skilled at writing code. In such cases, programmers may use some basic algorithm or data structure without considering more complex or efficient solutions. Such programs usually take up a lot of computing resources and run slowly.------by ChatGPT

  To DAGgenerate sketches with multiple nodes, we visit all nodes in topological order and build the structure iteratively. For compute nodes that are computationally intensive and have a lot of opportunities for data reuse (conv2d, matmul), we build basic tiling and fusion structures for them as sketches, and for simple element nodes (ReLU, elementwise add), we can safely inline them. Note that new nodes (cache nodes (caching nodes), layout transformation nodes (layout transform nodes)) can also be introduced during sketch generation DAG.
  The authors propose a derivation-based enumeration (derivation-based enumeration)method to generate all possible sketches by recursively applying several ground rules. This procedure takes DAGas input and returns a list of sketches. We define State σ = ( S ; i ) \sigma = (S;i)p=(S;i ) , among whichSSS isDAGthe sketch generated by the current part,iii is the index of the current working node, and the nodes in the DAG are sorted in topological order from output to input. The derivation starts from the initialnaiveprogram and the last node, or the initial stateσ = ( naive program ; index of the last node ) \sigma = (naive\ program;\ index\ of\ the\ last\ node)p=( nai v e p ro g r am ;  in d e x o f t h e l a s t no d e ) ,     then we try to recursively apply all derivation rules to these states. For each rule, if the current state satisfies the application condition, we apply this ruleσ = ( S ; i ) \sigma = (S;i)p=(S;i)得到 σ ′ = ( S ′ ; i ′ ) ,   i ′ < i \sigma \prime= (S\prime;i\prime),\ i\prime < i s =(S;i), i<i , such that indexiii (worker node) decreases monotonically wheni = 0 i = 0i=When 0 , a state becomes a terminal state. During enumeration, multiple rules can be applied to a state to generate multiple subsequent states, and a rule can also generate multiple possible subsequent states. Therefore, we maintain a queue to store all intermediate state, and when the queue is empty, the process ends. All σ . S \sigma .Sin terminal stateσ . S forms the sketch list at the end of sketch generation. For typical subgraphs, the number of sketches is less than10.

// 递归应用几个基本规则来生成所有可能的sketch
// Derivation rule based enumeration
Array<State> out_states;
while (!pnow->empty()) {
    
    
  pnext->clear();
  for (const State& state : *pnow) {
    
    
    int stage_id = cur_stage_id_map[state];

    // Reaches to the terminal stage
    if (stage_id < 0) {
    
    
      out_states.push_back(state);
      continue;
    }

    // Try all derivation rules
    for (const auto& rule : sketch_rules) {
    
    
      auto cond = rule->MeetCondition(*this, state, stage_id);
      if (cond != SketchGenerationRule::ConditionKind::kSkip) {
    
    
        for (const auto& pair : rule->Apply(*this, state, stage_id)) {
    
    
          cur_stage_id_map[pair.first] = pair.second;
          pnext->push_back(pair.first);
        }
        // Skip the rest rules
        if (cond == SketchGenerationRule::ConditionKind::kApplyAndSkipRest) {
    
    
          break;
        }
      }
    }
  }
  std::swap(pnow, pnext);
}
// Conv2d(3, 64, kernel_size=(7, 7), stride=2, padding=1)有3个sketch生成

insert image description here
  Derivation rules: The table above lists CPUthe derivation rules for . The authors first provide definitions of the predicates used, then describe the functionality of each rule, and then perform static analysis on the calculation definitions to obtain the values ​​of these predicates. The analysis is done automatically by parsing the read/write patterns in the mathematical expressions. I organized the above table:

Condition Description
I s S t r i c t I n l i a b l e ( S , i ) IsStrictInliable(S,i) IsStrictInliable(S,i) means SSNode iiin Si is a simple element-(element-wise)wise operator, such aselement-wise addandReLU
H a s D a t a R e u s e ( S , i ) HasDataReuse(S,i) HasDataReuse(S,i) means SSNode iiin Si is a computationally intensive(compute-intensive)operator, and has a large number of opportunities for data reuse within the operator, such asmatmulandconv2d
H a s F u s i b l e C o n s u m e r ( S , i ) HasFusibleConsumer(S, i) HasFusibleConsumer(S,i) means SSNode iiin Si has only one consumer nodejjj , nodejjj can be fused to nodeiii , such asmatmul + bias_addandconv2d + relu
H a s M o r e R e d u c t i o n P a r a l l e l ( S , i ) HasMoreReductionParallel(S, i) HasMoreReductionParallel(S,i) means SSNode iiin Si has very little parallelism in space dimension, but there are enough parallel opportunities in dimension reduction, such as calculating theL2norm of the matrix, multiplying momentsC 2 × 2 = A 2 × 512 ⋅ B 512 × 2 C_{2\times2} =A_{2\times512} \cdot B_{512\times2}C2×2=A2×512B512×2

  In computer programming, "inline"it usually refers to a compiler optimization technique that directly replaces function calls with code in the function body when compiling code. In this way, the extra overhead when calling the function can be avoided, thereby improving the execution efficiency of the code.
  In C++, we can use keywords "inline"to tell the compiler to inlinetreat a function as a function. C++The advantage of using inlinefunctions in a program is that the overhead of function calls can be reduced, thereby improving the operating efficiency of the program . In addition, using inlinea function reduces code duplication, since each call to the function embeds the function's code at the call site.
  It should be noted that although the use of inlinefunctions can improve the performance of the program, not all functions are suitable as inlinefunctions. In general, small, frequently called functions are best suited as inlinefunctions, while larger, complex functions are not inline. In addition, inlinefunctions may increase the size of the code, so there is a tradeoff between code size and performance.------by ChatGPT

  Rule 1just simply skip a node if the node is not strictly inline;
  Rule 2always a strictly inline node, since the conditions of Rule1and Rule2are mutually exclusive, i > 1 i > 1i>A state of 1 can always satisfy one of the conditions and continue the derivation;
  Rule 3is to perform multi-level tiling for data reusable nodes. ForCPU, we use"SSRSRS"a tile structure, where"S"a tile-level spatial cycle is represented(space loop)and"R"a tile-level reduced cycle is represented(reduction loop). For example, in moment multiplicationC ( i , j ) = ∑ k A [ i , k ] × B [ k , j ] C(i,j) = \sum_k A[i,k] \times B[k,j]C(i,j)=kA[i,k]×B[k,j] i i i andjjj is a space ring,kkk is a reduction ring. The tiling structure of"SSRSRS"the original3level( i , j , k ) (i,j,k)(i,j,k ) expands to a10level loop( i 0 , j 0 , i 1 , j 1 , k 0 , i 2 , j 2 , k 1 , i 3 , j 3 ) (i_0,j_0,i_1,j_1,k_0,i_2 ,j_2,k_1,i_3,j_3)(i0,j0,i1,j1,k0,i2,j2,k1,i3,j3) , although the loop order is not disturbed, this multi-level tiling can also cover some reordering cases. For example, the level loop above10can be specialized for simple reordering of( k 0 , j 2 , j 3 ) (k_0,j_2,j_3)(k0,j2,j3) by setting the length of the other loops to1. "SSRSRS"The tiling structure is generally used for computationally intensive operators in deep learning(matmul, conv2d, conv3d), because they are allcomposed ofspace loopandit is to perform multi-level tiling, and it also incorporates fused consumers. For example, element-wise nodes canbe fused to tile nodes;yes, add cache nodes if the current data-reusable node has no fused consumers. For example,the final output node in does not have any consumers, so by default it writes the result directly to main memory, which is inefficient due to the high latency of memory access. By adding a cache node, weintroduce a new fusible consumer in , which can then be appliedto fuse this newly added cache node into the final output node. With the fusion of cache nodes, the final output node now writes its results to the cache block, which is immediately written to main memory when all the data in the block has been computed;canbedecomposedintoParallelism;   visibility,andprocessing of multi-level tiling and fusion of nodes with data reuse.reduction loop
  Rule 4(ReLU,bias_add)(conv2d, matmul)
  Rule 5DAGDAGRule 4
  Rule 6rfactorreduction loopspace loop
Rule 3Rule 4Rule 5

  For GPU, we use "SSSRRSRS"the tile structure, the loops in the first three space tiles are respectively bound to BlockIdx, virtual threads ( virtual thread, to reduce bankconflicts) and ThreadIdx, and add two kinds of sketch derivation rules, one is by inserting cache nodes to Take advantage of shared memory (similar to Rule 5), and another for cross-thread reduction (similar to Rule 6).

  The figure at the beginning of this section shows three examples of generated sketches. Sketches TVMdiffer from manual templates in that manual templates specify both high-level structure and low-level details, whereas sketches only define high-level structure. For Input 1example, DAGthe sort order of the four nodes is ( A , B , C , D ) (A,B,C,D)(A,B,C,D ) . To getDAGthe sketch, we start from the output nodeD ( i = 4 ) D(i=4)D(i=4 ) Start and apply rules to nodes one by one. Specifically, the generatedSketch 1derivation process is:

insert image description here

  The right Input 1one DAGis a node to understand: Aand Bis the input data node, Cis matmulthe node, Dis the output node, A, Band Dthe node application Rule 1, Cthe node application Rule 4.

  For example Input 2, the sort order of five nodes is ( A , B , C , D , E ) (A,B,C,D,E)(A,B,C,D,E ) . Similarly, we start from the output nodeE ( i = 5 ) E(i = 5)E ( i=5 ) Initially, applying the rules recursively, the resultingSketch 1derivation is:

insert image description here

  The right Input 2one DAGis a node to understand: Aand Dis the input data node, Bis maxthe node, Cis the xxx node, Eis matmulthe node, and is also the output node, A, Cand Dnode application Rule 1, Bnode application Rule 2, Enode application Rule 5insert a cache node, and then apply it Rule 4.

  Likewise, the resulting Sketch 3derived procedure is:

insert image description here

5.2 Random Annotation

  The sketches generated in the previous subsection are incomplete programs because they only have tile structures without specific tile sizes and loop annotations (loop annotation)such as parallelism, unrolling, and vectorization. In this subsection, the sketch is annotated to make it a complete program for fine-tuning and evaluation.
  Given a generated list of sketches, we randomly select a sketch, randomly fill the tile size, parallelize some outer loops, vectorize some inner loops, and unroll some inner loops. We also randomly change the computed positions of some nodes in the program to make slight adjustments to the tiling structure. All here "随机"means a uniform distribution over all valid values. If some special algorithm requires custom annotations to be effective (e.g., special unrolling), the user is allowed to give simple hints in the calculation definition to adjust the annotation strategy. Finally, since changing the layout of constant tensors can be done at compile time with no runtime overhead, we rewrite the layout of constant tensors in terms of multilevel tiling to make them as cache-friendly as possible . This optimization works because the weight tensors of convolutional or fully-connected layers are constant for inference applications.
  An example of random sampling can be seen in the figure at the beginning of this section. The sampling program may have fewer cycles than the sketch, because the 1cycle of length is simplified.

   loop annotationIt refers to adding specific tags in the loop body to tell the compiler the nature and characteristics of the loop body and help the compiler to better optimize the execution of the loop body. These tags are usually added in the form of comments in the loop body of the code.
  Common ones loop annotationinclude:
  1. unroll: Unroll the loop, that is, copy the code in the loop body multiple times to reduce the overhead of the loop control statement.
  2. vectorize: Vectorized loop, that is to combine the same operation executed multiple times into one operation to speed up the execution of the loop body.
  3. parallelize: Parallelize the loop, that is, allocate multiple iterations in the loop body to different processor cores or threads for execution, so as to speed up the execution of the loop body.
  4.pipeline: Divide multiple iterations in the loop body into multiple stages and execute them simultaneously in different processor cores or threads to speed up the execution of the loop body.
  The use loop annotationneeds to select the appropriate mark according to the specific situation, and optimize according to the characteristics of the hardware device. Although loop annotationit can improve the execution efficiency of the loop body, too many loop annotationmay reduce the readability and maintainability of the code. Therefore, Lloop annotationtrade-offs and evaluations are required to determine the best optimization scheme when using it.------by ChatGPT

6. Performance Fine-tuning

  The programs sampled by the program sampler have good coverage of the search space, but their quality is not guaranteed because optimization options, such as tiling structures and loop annotations, are randomly sampled. The authors therefore introduce a performance tuner, which fine-tunes the performance of sampled programs through evolutionary search and learning a cost model.
  Fine-tuning is performed iteratively. In each iteration, we first use evolutionary search to find a small batch of promising programs according to the learned cost model, then measure these programs on hardware to get the actual execution time cost, and finally, use the measured The resulting performance data retrains the cost model to be more accurate.
  Evolutionary search uses randomly sampled programs with previously measured high-quality programs as an initial population and applies mutation and crossover to generate the next generation. A learned cost model is used to predict the fitness of each program (fitness), which in our case is the throughput of a program. We perform a fixed number of evolutions and select the best program found during the search. We utilize a learned cost model because cost models can estimate program fitness with relative accuracy while being orders of magnitude faster than actual measurements. It allows us to compare tens of thousands of programs in the search space in seconds and select promising programs for actual measurement.

6.1 Evolutionary Search

  Tile size mutation: This action scans the program and randomly selects a tile cycle. For this tiling loop, it divides the tile size of one tile layer by a random factor, and then multiplies this factor to the other tile layer. Since this operation makes the product of the tile sizes equal to the original loop length, the changed program always works.

  Parallel mutation: This action scans the program and randomly selects a parallelloop with an annotation. For this loop, the operation changes the parallelism granularity by fusing adjacent loop levels or splitting them by a factor.

  Pragma mutationpragma: Some optimizations in a program are specified by a compiler-specific operation that scans the program and chooses one at random pragma. For this pragma, the op randomly converts it to another valid value, e.g. our underlying code generator auto_unroll_max_step=N pragmasupports automatic unrolling of the maximum number of steps by providing we randomly adjust the number N.

  Computation location: This operation scans the program and randomly selects a non-multilayer tiled flexible node (e.g., a padding node in a convolutional layer). For this node, the operation randomly changes its computed position to another valid additional point.

  Node-based crossover: In Ansor, the gene of the program is the rewriting step of the program. Each program generated is rewritten starting from an initial simple implementation, and a complete rewrite history is preserved for each program Ansorduring sketch generation and random annotation . AnsorWe can think of the rewriting steps as the genes of a program, since they describe how this program was formed from the initial original program. Based on this, we can combine the rewriting steps of two existing programs to generate a new program. However, arbitrary combinations of rewriting steps from two programs may break dependencies among the steps and create invalid programs. Therefore, Ansorthe granularity of the crossover operation in is based on DAGthe nodes in , since rewriting steps across different nodes are usually less dependent. AnsorRandomly select a parent for each node, and merge the rewrite steps for the selected nodes. AnsorWe try to analyze and adjust these steps with simple heuristics when there are dependencies between nodes . AnsorFurther verify the merged program to ensure the correctness of the function. Verification is simple because Ansoronly a small number of loop transformation rewrite steps are used, and the underlying code generator can check correctness through dependency analysis.

  Evolutionary search uses mutation and crossover to iteratively generate a new set of candidate programs in several rounds and outputs a set of top-scoring programs that will be compiled and measured on the target hardware to obtain a realistic run-time cost, collecting Measurement data from is used to update the cost model. In this way, the accuracy of the learned cost model is gradually improved to match the target hardware. Thus, evolutionary search gradually generates higher-quality programs for the target hardware platform.
  Unlike the search algorithms in TVMand FlexTensorwhich can only work in a fixed grid-like parameter space, the Ansorevolution operation in is specifically designed for tensor programs. They can be applied to general tensor programs and can handle search spaces with complex dependencies. Unlike Halidethe unwind rules in the automatic scheduler, these operations can perform out-of-order modifications to the program, addressing order constraints.

6.2 Learned Cost Model

  Since our target program is mainly a data-parallel tensor program consisting of multiple interleaved loop nests with several assignment statements as the innermost statements, we train a cost model to predict the score of the innermost non-loop statement in a loop nest . For a complete program, we make predictions for each innermost acyclic statement and sum the predictions as a score. We construct the feature vector of the innermost acyclic statement by extracting features in the context of the complete program, the extracted features include arithmetic features and memory access features. Refer to other subsections for introduction to features.
  We use weighted squared error as the loss function because we are mainly concerned with identifying well-performing programs from the search space, so we give more weight to programs that run faster. Specifically, at a throughput of yyy 's programPPP , modelfff的损失函数为 l o s s ( f , P , y ) = w p ( ∑ s ∈ S ( P ) f ( s ) − y ) 2 = y ( ∑ s ∈ S ( P ) f ( s ) − y ) 2 loss(f,P,y) = w_p \big( \sum_{s \in S(P)} f(s) - y \big)^2 \\[5pt] =y \big( \sum_{s \in S(P)} f(s) - y \big)^2 loss(f,P,y)=wp(sS(P)f(s)y)2=y(sS(P)f(s)y)2   whereS ( P ) S(P)S ( P ) isPPThe innermost acyclic statement set in P , directly uses the throughput yyy as the weight.
  We train a gradient boosted decision tree(XGBoost)as the underlying modelfff , train a model forDAGall tensor programs from all andDAGnormalize the throughput from all programs from the same to[0,1]the range of . When optimizingDNN, the number of measured programs is usually less than310,000, and training on such a small data setXGBoostis very fast, so we train a new model every time instead of doing incremental updates.

7. Task Scheduler

  DNNcan be divided into many independent subgraphs (for example, conv2d + relu), and for some subgraphs, spending time tuning them will not significantly improve end-to-end DNNperformance for two reasons:
  (1)the subgraph is not a performance bottleneck;
  (2)tuning will only Brings minimal improvement to subgraph performance.
  To avoid wasting time on tuning unimportant subgraphs, Ansordifferent amounts of time resources are dynamically assigned to different subgraphs. For ResNet-50example, after the graph is divided, it has 29a unique subgraph. Most of these subgraphs are convolutional layers with different shape configurations ( input size, kernel size, etc.). strideWe need to generate different programs for different convolutional layers, because the best tensor programs depend on these shape configurations. In fact, all of a user's applications may have multiples DNN. This results in more subgraphs and more opportunities to reduce the total tuning time, since we can share and reuse knowledge between subgraphs, and a subgraph can also appear multiple times in one or in different ones DNN.   We define a task as the process performed to generate a high-performance program for a subgraph, implying that optimizing a single task requires the completion of dozens of tasks (e.g., tasks ). The task scheduler in iteratively allocates time resources to tasks. In each iteration, a task is selected, a batch of promising programs is generated for the subgraph, and the programs are measured on the hardware. We define such an iteration as A time resource unit. When we allocate a unit of time resources to a task, that task has an opportunity to generate and measure new programs, which means a chance to find better programs.DNN
DNNResNet-5029AnsorAnsor

7.1 Problem Formulation

  When tuning one DNNor a group DNN, users can have various types of goals, such as reduced DNNlatency, meeting latency requirements for a group , or minimizing tuning time when DNNtuning no longer significantly improves performance. DNNTherefore, we provide users with a set of objective functions to express their goals, and users can also provide their own objective functions.
  Suppose there are a total of nnn tasks, lett ∈ Z nt \in \mathcal Z^ntZn is the allocation vector, whereti t_itifor spending on task iiThe number of time units on i , let task iiThe minimum subgraph delay obtained by i is the allocation vector gi ( t ) g_i(t)gi( t ) function, letDNNthe end-to-end cost(cost)be the subgraphf ( g 1 ( t ) , g 2 ( t ) , … , g 3 ( t ) ) f\big( g_1(t), g_2(t) , \dots, g_3(t) \big)f(g1(t),g2(t),,g3( t ) ) , our goal is to minimize the end-to-end cost: minimizef ( g 1 ( t ) , g 2 ( t ) , … , g 3 ( t ) ) minimize f\big( g_1( t), g_2(t), \dots, g_3(t) \big)minimizef(g1(t),g2(t),,g3( t ) )   In order to minimize the singleDNNend-to-end delay, we can definef ( g 1 , g 2 , … , gn ) = ∑ i = 1 nwi × gif\big( g_1, g_2, \dots, g_n \ big) = \sum_{i=1}^{n} w_i \times g_if(g1,g2,,gn)=i=1nwi×giwi w_i   in thatwiis task iiThe number of times iDNN appears in. This formula is simple becausefff isDNNan approximate value of the end-to-end delay.

insert image description here

DNNThe above table shows examples of objective functions   used to tune multiple . Let mmm isDNNthe number,S ( j ) S(j)S ( j ) belongs toDNN jjj 's task set. f 1 f_1f1Adding up the delays of each DNN, this means optimizing DNNthe cost of running all the pipelines in succession at once; at f 2 f_2f2, we will L j L_jLjdefined as DNN jjThe delay requirement of j , which means that ifDNNthe delay of j is satisfied, we do not want to spend time on it; atf 3 f_3f3, we will B j B_jBjdefined as DNN jjj 's reference delay, so our goal is to maximize the geometric mean of the speedup with respect to the given reference delay; finally atf 4 f_4f4In, we define a function ES ( gi , t ) ES(g_i,t)EN ( gi,t ) , by looking at taskiiThe delay history of i returns an early stop value, which can achieve the effect of early stop for each task.

7.2 Optimizing with Gradient Descent

  In order to effectively optimize the objective function, the author proposes a scheduling algorithm based on gradient descent, the idea is that, given the current allocation ttt , in order to select taskiii , the gradient of the approximate objective function∂ f ∂ ti \frac {\partial f} {\partial t_i}tif,使 i = a r g m a x i ∣ ∂ f ∂ t i ∣ i = argmax_i \big| \frac {\partial f} {\partial t_i} \big| i=argmaxi tif . We approximate the gradient by making optimistic guesses and taking into account the similarity between tasks.
  The gradient approximation formula is as follows: ∂ f ∂ ti = ∂ f ∂ gi ( α gi ( ti ) − gi ( ti − Δ t ) Δ t + ( 1 − α ) ( min ( − gi ( ti ) ti , β C imaxk ∈ N ( i ) V k − gi ( ti ) ) ) ) \frac {\partial f} {\partial t_i} = \frac {\partial f} {\partial g_i} \bigg( \alpha \frac {g_i(t_i ) - g_i(t_i - \Delta t)} {\Delta t} + \big(1 - \alpha\big)\big(min(-\frac {g_i(t_i)} {t_i}, \beta \frac { C_i} {max_{k\in N(i)} V_k} - g_i(t_i))\big) \bigg)tif=gif( aΔtgi(ti)gi(tiΔt)+(1a ) ( min ( tigi(ti),bmaxkN(i)VkCigi(ti)) ) )   whereΔ t \Delta tΔt is a small backward window size,gi ( ti ) g_i(t_i )gi(ti) g i ( t i − Δ t ) g_i(t_i-\Delta t) gi(tiΔt ) are all known from the distribution history, N ( i ) N(i)N ( i ) isiiA collection of similar tasks in i , C i C_iCiis task iiThe number of floating-point operations in i , V k V_kVkis in task kkThe number of floating-point operations that can be completed per second in k , the parameter α \alphaαβ \betaβ controls the weights to trust certain predictions.
  To run the algorithm,Ansorfromt = 0 t = 0t=0 , and(round-robin)warm up(warm-up)to obtain the initial allocation vectort = ( 1 , 1 , … , 1 ) t=(1,1,\dots,1)t=(1,1,,1 ) . warm-upAfterwards, at each iteration, we compute the gradient for each task and chooseargmaxi ∣ ∂ ​​f ∂ ti ∣ argmax_i \big| \frac {\partial f} {\partial t_i} \big|argmaxi tif , then we assign resource units to task iii , and update the allocation vectorti = ti + 1 t_i = t_i +1ti=ti+1 , the optimization process will continue until the time budget is exhausted. To encourage exploration, we useϵ \epsilonϵ greedy strategy(e-greedy), which preserves with probabilityϵ \epsilonϵ to randomly select tasks.

Guess you like

Origin blog.csdn.net/qq_42730750/article/details/129364173