Article directory
1. Abstract
High-performance tensor programs are key to ensuring efficient execution of deep neural networks. However, it is very challenging to obtain well-performing tensor programs for different operators on various hardware platforms. Currently, deep learning systems rely on kernel libraries ( ) or various search strategies provided by hardware vendors kernel libraries
to obtain well-performing tensor programs. But these methods have two drawbacks:
(1)
a huge engineering effort is required to develop platform-specific optimized codes;
(2)
limited search space and ineffective search strategies make it difficult to discover high-performance tensor programs.
Based on the above disadvantages, the author proposes Ansor
a tensor program generation framework for deep learning applications. Compared with the existing search strategies, Ansor
it has the following characteristics: explore more optimal combinations by sampling programs
(1)
from the hierarchical representation of the search space ; use evolutionary search and learning cost models to fine-tune the sampled programs to determine the optimal program; Using a task scheduler to simultaneously optimize multiple subgraphs of a deep neural network. The authors' experiments show that Ansor is able to find high-performance programs outside the search space of existing state-of-the-art methods: yes times, times , times .(hierarchical representation)
(optimization combinations)
(2)
(evolutionary search)
(cost model)
(fine-tune)
(3)
(task scheduler)
(state-of-the-art,SOAT)
Intel CPU
3.8
ARM CPU
2.6
NVIDIA GPU
1.7
2. Introduction
(DNN)
Low-latency execution of deep neural networks plays a vital role in autonomous driving (autonomous driving)
, augmented reality (augmented reality)
, language translation (language translation)
, and other applications. AI
DNN can be expressed as a directed acyclic computing graph (directed acyclic graph, DAG)
, nodes represent operators (convolution, matrix multiplication), and directed edges represent dependencies between operators. Existing deep learning frameworks (Tensorflow, PyTorch, MXNet)
map operators in DNNs to vendor-provided kernel libraries (cuDNN, MKL-DNN)
for high performance. However, these kernel libraries require a huge engineering effort to manually tune for each hardware platform and operator, and the large amount of manual work required to produce an efficient operator implementation for each target accelerator limits the development of new operators and specific accelerators. and innovation.
Given DNN
the importance of performance, researchers and industry practitioners have turned to compiler search-based (search-based compilation)
automatic generation of tensor programs, such as low-level implementations of tensor operators. For an operator or a subgraph of multiple operators, users need to define computations in a high-level declarative language, and then the compiler searches for custom programs for different hardware platforms.
3. Background
The deep learning ecosystem is embracing a rapidly growing diversity of hardware platforms, including CPU
, GPU
, FPGA
and ASIC
. In order to deploy on these platforms , high-performance tensor programs DNN
need to be provided for the operators used in , and the required set of operators usually contains standard operators and new operators invented by machine learning researchers . To provide portability of these operators across a wide range of hardware platforms in an efficient manner, various compiler techniques have emerged . Users use a high-level declarative language to define computations in a form similar to mathematical expressions, and the compiler generates optimized tensor programs based on the definitions. The figure below shows the calculation definition of matrix multiplication in the tensor expression language. The user mainly needs to define the shape of the input tensor and how to calculate each element in the output tensor.DNN
(matmul, conv2d)
(capsule conv2d, dilated conv2d)
(TVM, Halide, Tensor Comprehensions)
TVM
However, it is extremely difficult to automatically generate high-performance tensor programs from high-level definitions. Depending on the architecture of the target platform, the compiler needs to search in an extremely large and complex space consisting of optimization combinatorial choices (e.g., unroll structure, unroll size, vectorization, parallelization), and (tile structure)
finding (tile size)
high (vectorization)
- (parallelization)
performance programs requires search strategies Cover a comprehensive space and explore it efficiently.
4. Design Overview
Program sampler
: Ansor
A key challenge that must be addressed is generating a large search space for a given computational graph. To cover various tensor programs with various high-level structures and low-level details, Ansor
a hierarchical representation with two levels of search spaces is exploited: sketches (sketch)
and annotations (annotation)
. Ansor
Defining the high-level structure of a program as a sketch and billions of low-level choices (e.g., tile size (tile size)
, parallelism (parallel)
, unrolling annotations (unroll annotations)
) as annotations, this notation allows Ansor
flexible enumeration of high-level structures and efficient sampling of low-level details.
Performance tuner
: The performance of random sampling programs is not necessarily good, the next challenge is to fine-tune them. Ansor
Fine-tuning is performed iteratively using evolutionary search and a learned cost model, in each iteration the evolutionary Ansor
search is started using a resampled new program as well as good programs from previous iterations as the initial population. Evolutionary search fine-tunes programs through mutation and crossover, performs out-of-order rewrites, and addresses the constraints of sequential construction. Querying the learned cost model is orders of magnitude faster than actually measuring it, so we can evaluate thousands of programs in seconds.
Task scheduler
: Using program sampling and performance fine-tuning allows Ansor to find high-performance tensor programs for computational graphs. Intuitively, processing a complete DNN
as a single computation graph and generating a complete tensor program for it can potentially achieve the best performance. However, this is inefficient since it has to deal with an unnecessary exponential explosion of the search space. Usually, the compiler DNN
divides a large computational graph into several small subgraphs. Due to DNN
the layer-by-layer (layer-by-layer)
construction feature, the impact of this division on performance is negligible, which brings the Ansor
last challenge: How to allocate time resources when graph generation program?
Ansor
The task scheduler in uses a gradient descent-based scheduling algorithm to allocate resources to subgraphs that are more likely to improve end-to-end DNN performance.
5. Program Sampling
The search space that an algorithm explores determines the best programs it can find. The search space considered by existing methods is limited by the following factors:
(1)
Manual enumeration (TVM)
: It is impractical to manually enumerate all possible choices through templates, so existing manual templates can only heuristically cover a limited search space;
(2)
Aggressive early pruning (Halide auto-scheduler)
: Aggressive early pruning based on evaluating incomplete programs prevents the search algorithm from exploring certain regions in the space.
To solve (1)
, we automatically expand the search space by recursively applying a flexible set of derivation rules;
to avoid that (2)
, we randomly sample complete programs in the search space.
Since random sampling gives each sampled point an equal chance, the authors' proposed search algorithm can potentially explore every program in the considered space without relying on random sampling to find the optimal program, since each sampled program is later All fine-tuned.
At the top level, sketches are generated by recursively applying some derivation rules. At the bottom level, these sketches are randomly annotated to get a complete program. This representation summarizes some basic structure from billions of low-level choices, enabling flexible enumeration of high-level structures and efficient sampling of low-level details.
5.1 Sketch Generation
The first column in the figure above shows two input examples. There are three equivalent forms of input: mathematical expressions, corresponding naive
programs resulting from direct unrolling of loop indices, and corresponding computational graphs (DAG)
.
In the field of computer programming,
"naive program"
it usually refers to a simple or plain way of program implementation. Such procedures may not consider all possible cases, or take advantage of existing optimization techniques."naive"
This term is often used to describe certain programmers who are inexperienced or less skilled at writing code. In such cases, programmers may use some basic algorithm or data structure without considering more complex or efficient solutions. Such programs usually take up a lot of computing resources and run slowly.------by ChatGPT
To DAG
generate sketches with multiple nodes, we visit all nodes in topological order and build the structure iteratively. For compute nodes that are computationally intensive and have a lot of opportunities for data reuse (conv2d, matmul)
, we build basic tiling and fusion structures for them as sketches, and for simple element nodes (ReLU, elementwise add)
, we can safely inline them. Note that new nodes (cache nodes (caching nodes)
, layout transformation nodes (layout transform nodes)
) can also be introduced during sketch generation DAG
.
The authors propose a derivation-based enumeration (derivation-based enumeration)
method to generate all possible sketches by recursively applying several ground rules. This procedure takes DAG
as input and returns a list of sketches. We define State
σ = ( S ; i ) \sigma = (S;i)p=(S;i ) , among whichSSS isDAG
the sketch generated by the current part,iii is the index of the current working node, and the nodes in the DAG are sorted in topological order from output to input. The derivation starts from the initialnaive
program and the last node, or the initial stateσ = ( naive program ; index of the last node ) \sigma = (naive\ program;\ index\ of\ the\ last\ node)p=( nai v e p ro g r am ; in d e x o f t h e l a s t no d e ) , then we try to recursively apply all derivation rules to these states. For each rule, if the current state satisfies the application condition, we apply this ruleσ = ( S ; i ) \sigma = (S;i)p=(S;i)得到 σ ′ = ( S ′ ; i ′ ) , i ′ < i \sigma \prime= (S\prime;i\prime),\ i\prime < i s ′=(S′;i′), i′<i , such that indexiii (worker node) decreases monotonically wheni = 0 i = 0i=When 0 , a state becomes a terminal state. During enumeration, multiple rules can be applied to a state to generate multiple subsequent states, and a rule can also generate multiple possible subsequent states. Therefore, we maintain a queue to store all intermediate state, and when the queue is empty, the process ends. All σ . S \sigma .Sin terminal stateσ . S forms the sketch list at the end of sketch generation. For typical subgraphs, the number of sketches is less than10
.
// 递归应用几个基本规则来生成所有可能的sketch
// Derivation rule based enumeration
Array<State> out_states;
while (!pnow->empty()) {
pnext->clear();
for (const State& state : *pnow) {
int stage_id = cur_stage_id_map[state];
// Reaches to the terminal stage
if (stage_id < 0) {
out_states.push_back(state);
continue;
}
// Try all derivation rules
for (const auto& rule : sketch_rules) {
auto cond = rule->MeetCondition(*this, state, stage_id);
if (cond != SketchGenerationRule::ConditionKind::kSkip) {
for (const auto& pair : rule->Apply(*this, state, stage_id)) {
cur_stage_id_map[pair.first] = pair.second;
pnext->push_back(pair.first);
}
// Skip the rest rules
if (cond == SketchGenerationRule::ConditionKind::kApplyAndSkipRest) {
break;
}
}
}
}
std::swap(pnow, pnext);
}
// Conv2d(3, 64, kernel_size=(7, 7), stride=2, padding=1)有3个sketch生成
Derivation rules
: The table above lists CPU
the derivation rules for . The authors first provide definitions of the predicates used, then describe the functionality of each rule, and then perform static analysis on the calculation definitions to obtain the values of these predicates. The analysis is done automatically by parsing the read/write patterns in the mathematical expressions. I organized the above table:
Condition |
Description |
---|---|
I s S t r i c t I n l i a b l e ( S , i ) IsStrictInliable(S,i) IsStrictInliable(S,i) | means SSNode iiin Si is a simple element-(element-wise) wise operator, such aselement-wise add andReLU |
H a s D a t a R e u s e ( S , i ) HasDataReuse(S,i) HasDataReuse(S,i) | means SSNode iiin Si is a computationally intensive(compute-intensive) operator, and has a large number of opportunities for data reuse within the operator, such asmatmul andconv2d |
H a s F u s i b l e C o n s u m e r ( S , i ) HasFusibleConsumer(S, i) HasFusibleConsumer(S,i) | means SSNode iiin Si has only one consumer nodejjj , nodejjj can be fused to nodeiii , such asmatmul + bias_add andconv2d + relu |
H a s M o r e R e d u c t i o n P a r a l l e l ( S , i ) HasMoreReductionParallel(S, i) HasMoreReductionParallel(S,i) | means SSNode iiin Si has very little parallelism in space dimension, but there are enough parallel opportunities in dimension reduction, such as calculating theL2 norm of the matrix, multiplying momentsC 2 × 2 = A 2 × 512 ⋅ B 512 × 2 C_{2\times2} =A_{2\times512} \cdot B_{512\times2}C2×2=A2×512⋅B512×2 |
In computer programming,
"inline"
it usually refers to a compiler optimization technique that directly replaces function calls with code in the function body when compiling code. In this way, the extra overhead when calling the function can be avoided, thereby improving the execution efficiency of the code.
InC++
, we can use keywords"inline"
to tell the compiler toinline
treat a function as a function.C++
The advantage of usinginline
functions in a program is that the overhead of function calls can be reduced, thereby improving the operating efficiency of the program . In addition, usinginline
a function reduces code duplication, since each call to the function embeds the function's code at the call site.
It should be noted that although the use ofinline
functions can improve the performance of the program, not all functions are suitable asinline
functions. In general, small, frequently called functions are best suited asinline
functions, while larger, complex functions are notinline
. In addition,inline
functions may increase the size of the code, so there is a tradeoff between code size and performance.------by ChatGPT
Rule 1
just simply skip a node if the node is not strictly inline;
Rule 2
always a strictly inline node, since the conditions of Rule1
and Rule2
are mutually exclusive, i > 1 i > 1i>A state of 1 can always satisfy one of the conditions and continue the derivation;
Rule 3
is to perform multi-level tiling for data reusable nodes. ForCPU
, we use"SSRSRS"
a tile structure, where"S"
a tile-level spatial cycle is represented(space loop)
and"R"
a tile-level reduced cycle is represented(reduction loop)
. For example, in moment multiplicationC ( i , j ) = ∑ k A [ i , k ] × B [ k , j ] C(i,j) = \sum_k A[i,k] \times B[k,j]C(i,j)=∑kA[i,k]×B[k,j], i i i andjjj is a space ring,kkk is a reduction ring. The tiling structure of"SSRSRS"
the original3
level( i , j , k ) (i,j,k)(i,j,k ) expands to a10
level loop( i 0 , j 0 , i 1 , j 1 , k 0 , i 2 , j 2 , k 1 , i 3 , j 3 ) (i_0,j_0,i_1,j_1,k_0,i_2 ,j_2,k_1,i_3,j_3)(i0,j0,i1,j1,k0,i2,j2,k1,i3,j3) , although the loop order is not disturbed, this multi-level tiling can also cover some reordering cases. For example, the level loop above10
can be specialized for simple reordering of( k 0 , j 2 , j 3 ) (k_0,j_2,j_3)(k0,j2,j3) by setting the length of the other loops to1
. "SSRSRS"
The tiling structure is generally used for computationally intensive operators in deep learning(matmul, conv2d, conv3d)
, because they are allcomposed ofspace loop
andit is to perform multi-level tiling, and it also incorporates fused consumers. For example, element-wise nodes canbe fused to tile nodes;yes, add cache nodes if the current data-reusable node has no fused consumers. For example,the final output node in does not have any consumers, so by default it writes the result directly to main memory, which is inefficient due to the high latency of memory access. By adding a cache node, weintroduce a new fusible consumer in , which can then be appliedto fuse this newly added cache node into the final output node. With the fusion of cache nodes, the final output node now writes its results to the cache block, which is immediately written to main memory when all the data in the block has been computed;canbedecomposedintoParallelism; visibility,andprocessing of multi-level tiling and fusion of nodes with data reuse.reduction loop
Rule 4
(ReLU,bias_add)
(conv2d, matmul)
Rule 5
DAG
DAG
Rule 4
Rule 6
rfactor
reduction loop
space loop
Rule 3
Rule 4
Rule 5
For
GPU
, we use"SSSRRSRS"
the tile structure, the loops in the first three space tiles are respectively bound toBlockIdx
, virtual threads (virtual thread
, to reducebank
conflicts) andThreadIdx
, and add two kinds of sketch derivation rules, one is by inserting cache nodes to Take advantage of shared memory (similar toRule 5
), and another for cross-thread reduction (similar toRule 6
).
The figure at the beginning of this section shows three examples of generated sketches. Sketches TVM
differ from manual templates in that manual templates specify both high-level structure and low-level details, whereas sketches only define high-level structure. For Input 1
example, DAG
the sort order of the four nodes is ( A , B , C , D ) (A,B,C,D)(A,B,C,D ) . To getDAG
the sketch, we start from the output nodeD ( i = 4 ) D(i=4)D(i=4 ) Start and apply rules to nodes one by one. Specifically, the generatedSketch 1
derivation process is:
The right
Input 1
oneDAG
is a node to understand:A
andB
is the input data node,C
ismatmul
the node,D
is the output node,A
,B
andD
the node applicationRule 1
,C
the node applicationRule 4
.
For example Input 2
, the sort order of five nodes is ( A , B , C , D , E ) (A,B,C,D,E)(A,B,C,D,E ) . Similarly, we start from the output nodeE ( i = 5 ) E(i = 5)E ( i=5 ) Initially, applying the rules recursively, the resultingSketch 1
derivation is:
The right
Input 2
oneDAG
is a node to understand:A
andD
is the input data node,B
ismax
the node,C
is the xxx node,E
ismatmul
the node, and is also the output node,A
,C
andD
node applicationRule 1
,B
node applicationRule 2
,E
node applicationRule 5
insert a cache node, and then apply itRule 4
.
Likewise, the resulting Sketch 3
derived procedure is:
5.2 Random Annotation
The sketches generated in the previous subsection are incomplete programs because they only have tile structures without specific tile sizes and loop annotations (loop annotation)
such as parallelism, unrolling, and vectorization. In this subsection, the sketch is annotated to make it a complete program for fine-tuning and evaluation.
Given a generated list of sketches, we randomly select a sketch, randomly fill the tile size, parallelize some outer loops, vectorize some inner loops, and unroll some inner loops. We also randomly change the computed positions of some nodes in the program to make slight adjustments to the tiling structure. All here "随机"
means a uniform distribution over all valid values. If some special algorithm requires custom annotations to be effective (e.g., special unrolling), the user is allowed to give simple hints in the calculation definition to adjust the annotation strategy. Finally, since changing the layout of constant tensors can be done at compile time with no runtime overhead, we rewrite the layout of constant tensors in terms of multilevel tiling to make them as cache-friendly as possible . This optimization works because the weight tensors of convolutional or fully-connected layers are constant for inference applications.
An example of random sampling can be seen in the figure at the beginning of this section. The sampling program may have fewer cycles than the sketch, because the 1
cycle of length is simplified.
loop annotation
It refers to adding specific tags in the loop body to tell the compiler the nature and characteristics of the loop body and help the compiler to better optimize the execution of the loop body. These tags are usually added in the form of comments in the loop body of the code.
Common onesloop annotation
include:
1.
unroll
: Unroll the loop, that is, copy the code in the loop body multiple times to reduce the overhead of the loop control statement.
2.
vectorize
: Vectorized loop, that is to combine the same operation executed multiple times into one operation to speed up the execution of the loop body.
3.
parallelize
: Parallelize the loop, that is, allocate multiple iterations in the loop body to different processor cores or threads for execution, so as to speed up the execution of the loop body.
4.
pipeline: Divide multiple iterations in the loop body into multiple stages and execute them simultaneously in different processor cores or threads to speed up the execution of the loop body.
The useloop annotation
needs to select the appropriate mark according to the specific situation, and optimize according to the characteristics of the hardware device. Althoughloop annotation
it can improve the execution efficiency of the loop body, too manyloop annotation
may reduce the readability and maintainability of the code. Therefore,Lloop annotation
trade-offs and evaluations are required to determine the best optimization scheme when using it.------by ChatGPT
6. Performance Fine-tuning
The programs sampled by the program sampler have good coverage of the search space, but their quality is not guaranteed because optimization options, such as tiling structures and loop annotations, are randomly sampled. The authors therefore introduce a performance tuner, which fine-tunes the performance of sampled programs through evolutionary search and learning a cost model.
Fine-tuning is performed iteratively. In each iteration, we first use evolutionary search to find a small batch of promising programs according to the learned cost model, then measure these programs on hardware to get the actual execution time cost, and finally, use the measured The resulting performance data retrains the cost model to be more accurate.
Evolutionary search uses randomly sampled programs with previously measured high-quality programs as an initial population and applies mutation and crossover to generate the next generation. A learned cost model is used to predict the fitness of each program (fitness)
, which in our case is the throughput of a program. We perform a fixed number of evolutions and select the best program found during the search. We utilize a learned cost model because cost models can estimate program fitness with relative accuracy while being orders of magnitude faster than actual measurements. It allows us to compare tens of thousands of programs in the search space in seconds and select promising programs for actual measurement.
6.1 Evolutionary Search
Tile size mutation
: This action scans the program and randomly selects a tile cycle. For this tiling loop, it divides the tile size of one tile layer by a random factor, and then multiplies this factor to the other tile layer. Since this operation makes the product of the tile sizes equal to the original loop length, the changed program always works.
Parallel mutation
: This action scans the program and randomly selects a parallel
loop with an annotation. For this loop, the operation changes the parallelism granularity by fusing adjacent loop levels or splitting them by a factor.
Pragma mutation
pragma
: Some optimizations in a program are specified by a compiler-specific operation that scans the program and chooses one at random pragma
. For this pragma
, the op randomly converts it to another valid value, e.g. our underlying code generator auto_unroll_max_step=N
pragma
supports automatic unrolling of the maximum number of steps by providing we randomly adjust the number N
.
Computation location
: This operation scans the program and randomly selects a non-multilayer tiled flexible node (e.g., a padding node in a convolutional layer). For this node, the operation randomly changes its computed position to another valid additional point.
Node-based crossover
: In Ansor
, the gene of the program is the rewriting step of the program. Each program generated is rewritten starting from an initial simple implementation, and a complete rewrite history is preserved for each program Ansor
during sketch generation and random annotation . Ansor
We can think of the rewriting steps as the genes of a program, since they describe how this program was formed from the initial original program. Based on this, we can combine the rewriting steps of two existing programs to generate a new program. However, arbitrary combinations of rewriting steps from two programs may break dependencies among the steps and create invalid programs. Therefore, Ansor
the granularity of the crossover operation in is based on DAG
the nodes in , since rewriting steps across different nodes are usually less dependent. Ansor
Randomly select a parent for each node, and merge the rewrite steps for the selected nodes. Ansor
We try to analyze and adjust these steps with simple heuristics when there are dependencies between nodes . Ansor
Further verify the merged program to ensure the correctness of the function. Verification is simple because Ansor
only a small number of loop transformation rewrite steps are used, and the underlying code generator can check correctness through dependency analysis.
Evolutionary search uses mutation and crossover to iteratively generate a new set of candidate programs in several rounds and outputs a set of top-scoring programs that will be compiled and measured on the target hardware to obtain a realistic run-time cost, collecting Measurement data from is used to update the cost model. In this way, the accuracy of the learned cost model is gradually improved to match the target hardware. Thus, evolutionary search gradually generates higher-quality programs for the target hardware platform.
Unlike the search algorithms in TVM
and FlexTensor
which can only work in a fixed grid-like parameter space, the Ansor
evolution operation in is specifically designed for tensor programs. They can be applied to general tensor programs and can handle search spaces with complex dependencies. Unlike Halide
the unwind rules in the automatic scheduler, these operations can perform out-of-order modifications to the program, addressing order constraints.
6.2 Learned Cost Model
Since our target program is mainly a data-parallel tensor program consisting of multiple interleaved loop nests with several assignment statements as the innermost statements, we train a cost model to predict the score of the innermost non-loop statement in a loop nest . For a complete program, we make predictions for each innermost acyclic statement and sum the predictions as a score. We construct the feature vector of the innermost acyclic statement by extracting features in the context of the complete program, the extracted features include arithmetic features and memory access features. Refer to other subsections for introduction to features.
We use weighted squared error as the loss function because we are mainly concerned with identifying well-performing programs from the search space, so we give more weight to programs that run faster. Specifically, at a throughput of yyy 's programPPP , modelfff的损失函数为 l o s s ( f , P , y ) = w p ( ∑ s ∈ S ( P ) f ( s ) − y ) 2 = y ( ∑ s ∈ S ( P ) f ( s ) − y ) 2 loss(f,P,y) = w_p \big( \sum_{s \in S(P)} f(s) - y \big)^2 \\[5pt] =y \big( \sum_{s \in S(P)} f(s) - y \big)^2 loss(f,P,y)=wp(s∈S(P)∑f(s)−y)2=y(s∈S(P)∑f(s)−y)2 whereS ( P ) S(P)S ( P ) isPPThe innermost acyclic statement set in P , directly uses the throughput yyy as the weight.
We train a gradient boosted decision tree(XGBoost)
as the underlying modelfff , train a model forDAG
all tensor programs from all andDAG
normalize the throughput from all programs from the same to[0,1]
the range of . When optimizingDNN
, the number of measured programs is usually less than3
10,000, and training on such a small data setXGBoost
is very fast, so we train a new model every time instead of doing incremental updates.
7. Task Scheduler
DNN
can be divided into many independent subgraphs (for example, conv2d + relu
), and for some subgraphs, spending time tuning them will not significantly improve end-to-end DNN
performance for two reasons:
(1)
the subgraph is not a performance bottleneck;
(2)
tuning will only Brings minimal improvement to subgraph performance.
To avoid wasting time on tuning unimportant subgraphs, Ansor
different amounts of time resources are dynamically assigned to different subgraphs. For ResNet-50
example, after the graph is divided, it has 29
a unique subgraph. Most of these subgraphs are convolutional layers with different shape configurations ( input size
, kernel size
, etc.). stride
We need to generate different programs for different convolutional layers, because the best tensor programs depend on these shape configurations. In fact, all of a user's applications may have multiples DNN
. This results in more subgraphs and more opportunities to reduce the total tuning time, since we can share and reuse knowledge between subgraphs, and a subgraph can also appear multiple times in one or in different ones DNN
. We define a task as the process performed to generate a high-performance program for a subgraph, implying that optimizing a single task requires the completion of dozens of tasks (e.g., tasks ). The task scheduler in iteratively allocates time resources to tasks. In each iteration, a task is selected, a batch of promising programs is generated for the subgraph, and the programs are measured on the hardware. We define such an iteration as A time resource unit. When we allocate a unit of time resources to a task, that task has an opportunity to generate and measure new programs, which means a chance to find better programs.DNN
DNN
ResNet-50
29
Ansor
Ansor
7.1 Problem Formulation
When tuning one DNN
or a group DNN
, users can have various types of goals, such as reduced DNN
latency, meeting latency requirements for a group , or minimizing tuning time when DNN
tuning no longer significantly improves performance. DNN
Therefore, we provide users with a set of objective functions to express their goals, and users can also provide their own objective functions.
Suppose there are a total of nnn tasks, lett ∈ Z nt \in \mathcal Z^nt∈Zn is the allocation vector, whereti t_itifor spending on task iiThe number of time units on i , let task iiThe minimum subgraph delay obtained by i is the allocation vector gi ( t ) g_i(t)gi( t ) function, letDNN
the end-to-end cost(cost)
be the subgraphf ( g 1 ( t ) , g 2 ( t ) , … , g 3 ( t ) ) f\big( g_1(t), g_2(t) , \dots, g_3(t) \big)f(g1(t),g2(t),…,g3( t ) ) , our goal is to minimize the end-to-end cost: minimizef ( g 1 ( t ) , g 2 ( t ) , … , g 3 ( t ) ) minimize f\big( g_1( t), g_2(t), \dots, g_3(t) \big)minimizef(g1(t),g2(t),…,g3( t ) ) In order to minimize the singleDNN
end-to-end delay, we can definef ( g 1 , g 2 , … , gn ) = ∑ i = 1 nwi × gif\big( g_1, g_2, \dots, g_n \ big) = \sum_{i=1}^{n} w_i \times g_if(g1,g2,…,gn)=i=1∑nwi×giwi w_i in thatwiis task iiThe number of times iDNN
appears in. This formula is simple becausefff isDNN
an approximate value of the end-to-end delay.
DNN
The above table shows examples of objective functions used to tune multiple . Let mmm isDNN
the number,S ( j ) S(j)S ( j ) belongs toDNN
jjj 's task set. f 1 f_1f1Adding up the delays of each DNN
, this means optimizing DNN
the cost of running all the pipelines in succession at once; at f 2 f_2f2, we will L j L_jLjdefined as DNN
jjThe delay requirement of j , which means that ifDNN
the delay of j is satisfied, we do not want to spend time on it; atf 3 f_3f3, we will B j B_jBjdefined as DNN
jjj 's reference delay, so our goal is to maximize the geometric mean of the speedup with respect to the given reference delay; finally atf 4 f_4f4In, we define a function ES ( gi , t ) ES(g_i,t)EN ( gi,t ) , by looking at taskiiThe delay history of i returns an early stop value, which can achieve the effect of early stop for each task.
7.2 Optimizing with Gradient Descent
In order to effectively optimize the objective function, the author proposes a scheduling algorithm based on gradient descent, the idea is that, given the current allocation ttt , in order to select taskiii , the gradient of the approximate objective function∂ f ∂ ti \frac {\partial f} {\partial t_i}∂ti∂f,使 i = a r g m a x i ∣ ∂ f ∂ t i ∣ i = argmax_i \big| \frac {\partial f} {\partial t_i} \big| i=argmaxi
∂ti∂f
. We approximate the gradient by making optimistic guesses and taking into account the similarity between tasks.
The gradient approximation formula is as follows: ∂ f ∂ ti = ∂ f ∂ gi ( α gi ( ti ) − gi ( ti − Δ t ) Δ t + ( 1 − α ) ( min ( − gi ( ti ) ti , β C imaxk ∈ N ( i ) V k − gi ( ti ) ) ) ) \frac {\partial f} {\partial t_i} = \frac {\partial f} {\partial g_i} \bigg( \alpha \frac {g_i(t_i ) - g_i(t_i - \Delta t)} {\Delta t} + \big(1 - \alpha\big)\big(min(-\frac {g_i(t_i)} {t_i}, \beta \frac { C_i} {max_{k\in N(i)} V_k} - g_i(t_i))\big) \bigg)∂ti∂f=∂gi∂f( aΔtgi(ti)−gi(ti−Δt)+(1−a ) ( min ( −tigi(ti),bmaxk∈N(i)VkCi−gi(ti)) ) ) whereΔ t \Delta tΔt is a small backward window size,gi ( ti ) g_i(t_i )gi(ti)和 g i ( t i − Δ t ) g_i(t_i-\Delta t) gi(ti−Δt ) are all known from the distribution history, N ( i ) N(i)N ( i ) isiiA collection of similar tasks in i , C i C_iCiis task iiThe number of floating-point operations in i , V k V_kVkis in task kkThe number of floating-point operations that can be completed per second in k , the parameter α \alphaα和β \betaβ controls the weights to trust certain predictions.
To run the algorithm,Ansor
fromt = 0 t = 0t=0 , and(round-robin)
warm up(warm-up)
to obtain the initial allocation vectort = ( 1 , 1 , … , 1 ) t=(1,1,\dots,1)t=(1,1,…,1 ) . warm-up
Afterwards, at each iteration, we compute the gradient for each task and chooseargmaxi ∣ ∂ f ∂ ti ∣ argmax_i \big| \frac {\partial f} {\partial t_i} \big|argmaxi
∂ti∂f
, then we assign resource units to task iii , and update the allocation vectorti = ti + 1 t_i = t_i +1ti=ti+1 , the optimization process will continue until the time budget is exhausted. To encourage exploration, we useϵ \epsilonϵ greedy strategy(e-greedy)
, which preserves with probabilityϵ \epsilonϵ to randomly select tasks.