HLS第二十九课（UG871，设计分析与优化）

设计优化的主要目标是优化II和latency。

High-level synthesis might automatically inline small functions to improve the quality of results (QoR). You can prevent this by adding the Inline directive with the -off option to any function being automatically inlined.
如果不希望小的子函数被内联，就需要对子函数设置INLINE OFF约束。

Loops ensure the design will have small area but the design will take multiple iterative states to complete: each iteration of the loop will complete before the next iteration starts.
To improve performance, these loops should be pipelined.
for循环的LOOP首先要优化的，就是pipeline，可以降低II。

Expanding these loops in Performance view shows both loops call function dct_dct_1d2.
Unless this function itself is pipelined, there is no benefit in pipelining the loop.
如果对for循环施加pipeline约束，只能讲循环体中的操作步独立为FSM，
但是如果操作步本身是一个没有被pipeline约束的顺序处理机，那么它将成为整个流水线的瓶颈。
所以，首先要对for循环体内调用的子函数进行优化。

You can choose to do one of the following:
You can pipeline the function and then pipeline the loop that calls it.
先流水化循环体内调用的子函数，然后再流水化for循环体。
You can pipeline the loops within this function and simply make this function execute faster。
也可以深入到子函数中的for循环，针对性的将子函数中的for循环流水化。

Pipelining the function unrolls all the loops within it, and thus greatly increases the area.If the objective is to get the highest possible performance with no regard for area, this may be the best optimization to perform.
如果是对函数级进行流水化，HLS的理解是，平铺所有的for循环体。
所以，函数级流水化，是简单粗暴的杀招。

You should pipeline the outer loop instead. This causes the inner loop to be completely unrolled.
如果是对外层循环体进行流水化约束，那么，HLS将把内层循环体完全展平。
注意，由于内层是complete unroll的，所以，内层要删除掉之前的pipeline约束。

当内层循环体被完全展平时，往往会带来有限的II改善。
原因在于，如果内层循环体中，存在对数组元素的访问，
这就有可能由于数组的访问端口的数量限制，使得展平的代码，不能在同一时刻，访问数组的各个元素。而只能通过调度，排队访问数组。这就需要多个周期才能完成。

通过对variable施加数组分割约束，可以增加数组的访问端口。
例如：

short buf_2d_in[DCT_SIZE][DCT_SIZE];
#pragma HLS ARRAY_PARTITION variable=buf_2d_in complete dim=2

对dim2进行完全分割，可以将原来的二维数组分割成独立的一维数组。

注意，对dim1或者dim2的分割，取决于完全UNROLL的代码中，对数据的同时访问的需求。
在本例中，同时访问的需求，存在于第二维，即col维度，所以对第二维进行完全分割。
注意，HLS中的dim的编号和常规的编号是不同的，HLS中编号是从左到右的，而常规是从右到左的，即从低到高的。

这里有一个小技巧，
如果需要施加资源约束，那么什么位置比较合适？
是约束子函数的形参数组，还是约束上层调用函数中定义的局部数组？
答案是，推荐将资源约束，施加到上层调用函数的局部数组上。

进一步降低II的方法是，
在top函数中，施加dataflow约束，
这会使得调用的各个子函数之间，插入FIFO，并让各个子函数各自运行于独立的FSM，而不是被顺序调度。
注意，dataflow约束，设计风格建议，只用于top函数中。

One way to have the blocks in dct_2d operate in parallel would be to pipeline the entirefunction. This, however, would unroll all the loops, which can sometimes lead to a large areaincrease. An alternative is use dataflow optimization on function dct_2d.
如前所述，如果施加函数级流水化，那么将展平所有的循环体，所以这个大杀器要谨慎使用。

可以尝试在函数级使用dataflow约束。虽然也可以实现，但是这并不是推荐的设计风格，通常只在top函数中使用dataflow。
那么如果希望子函数能够获得dataflow的优化，有什么替代方法呢？
那就是将子函数内联，是子函数被内联到top中，而top函数是施加了dataflow约束的，所以，被内联的子函数，自然也能获得dataflow优化。

当然，如果内联化并不适合某些场景时，也可以在子函数级单独使用dataflow约束。

+++++++++++++++++++++++++++++++++++++++++++++
来看第一个例子。

void matrixmul(
      mat_a_t a[MAT_A_ROWS][MAT_A_COLS],
      mat_b_t b[MAT_B_ROWS][MAT_B_COLS],
      result_t res[MAT_A_ROWS][MAT_B_COLS])
{
  	// Iterate over the rows of the A matrix
   	Row: for(int i = 0; i < MAT_A_ROWS; i++) {
      	// Iterate over the columns of the B matrix
      	Col: for(int j = 0; j < MAT_B_COLS; j++) {
         	res[i][j] = 0;
         	// Do the inner product of a row of A and col of B
         	Product: for(int k = 0; k < MAT_B_ROWS; k++) {
            	res[i][j] += a[i][k] * b[k][j];
         	}
      	}
   	}
}

在没有优化的情况下，II取决于每次迭代时，对数组的访问。

对于for循环，首先想到的就是施加pipeline优化约束。

High-Level Synthesis automatically applies loop flattening, collapsing the nested loops, removing the loop transitions (essentially creating a single loop with more iterations but overall fewer clock cycles).

在最内层product，使用pipeline优化，II无变化，仍然为80.

来看看如下关键代码：

Col: for(int j = 0; j < MAT_B_COLS; j++) {
         	res[i][j] = 0;
         	// Do the inner product of a row of A and col of B
         	Product: for(int k = 0; k < MAT_B_ROWS; k++) {
            	res[i][j] += a[i][k] * b[k][j];
         	}
      	}

加法器的操作，依赖于前面的write操作，而这个write操作，是对外部接口的写操作，不能被优化，这就是瓶颈的原因。

Because res is a top-level function argument, it is a write to a port in the RTL:
This operation must happen before the operations in loop Product are executed. Because it is not an internal operation but has an impact on the I/O behavior, this operation cannot be moved or optimized.
This prevents the Product loop from being flattened into the Row_Col loop.

it is worth addressing why only an initiation interval (II) of 2 was possible for the Product loop。
because Unable to enforce a carried dependency constraint.
The issue is a carried dependency. This is a dependency between an operation in one iteration of a loop and an operation in a different iteration of the same loop.

you can see line 60 is a read from array res (due to the += operator) and a write to array res.
These are the operations which have a dependency between loop iterations.
这个+=操作，要求先读取一次，再写入一次。必须先读取上一次迭代的结果，然后再写入本次迭代的结果。
这就是II为2的原因。

将pipeline提高一级，在col级别使用pipeline。
II降低为21.
这是因为，将内层的product完全展平了，很大程度上消除了读写依赖。
INFO: [XFORM 203-502] Unrolling all sub-loops inside loop ‘Col’ (matrixmul.cpp:56) in function ‘matrixmul’ for pipelining.
INFO: [XFORM 203-541] Flattening a loop nest ‘Row’ (matrixmul.cpp:54:37) in function ‘matrixmul’.
WARNING: [SCHED 204-69] Unable to schedule ‘load’ operation (‘a_load_1’, matrixmul.cpp:60) on array ‘a’ due to limited memory ports.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
这一次，可以看到，限制不再是for循环的读写依赖，而是资源限制。读写端口不够。

By accessing array a through a single block RAM interface, there are not enough ports to be able to read all three values in one clock cycle.
这时，需要用到array相关的约束。

array_partition，用于将数组分割，呈现出多个独立的端口，也就是将数组分割成多个独立的子数组。
在本例中，a和b成为了瓶颈，因为在循环体完全展开后，需要同时访问由k来遍历的数组元素。
如果使用数组分割，那么就要把k遍历的这个维度，完全分割。

For array a, this is dimension 2 because its access patterns is a[i][k]; for array b, this isdimension 1 because its access pattern is b[k][j].
对于a，就是dim2，对于b，就是dim1。

Alternatively, we can use re-shape instead of partition allowing one wide array (port) to be created instead of k ports.

我们也可以使用array_reshape。
它其实就是先将数组分割，然后再把这些独立的端口拼位合并，形成一个长位宽的单一端口。

array_reshape特别适合这种场景的资源优化。
即，同时需要访问数组的某一维，即遍历某个数组中的一维向量，而这个一维向量的遍历，在C中是用一个最内层的for循环描述的。

经过资源优化后，II降低为11.

可以看到，此时的col的for循环，II已经降低为1，也就是说，瓶颈已经不是有for循环带来的了。
这时，我们需要分析整个函数的执行顺序。
最外层的row的for循环，迭代了9次，所以需要9个clock，这就是整个函数的II需要11的原因。

After 9 iterations/samples (Trip count) it completes all samples.
The function can then complete and return to start to process the next set of data.

可以看到，整个函数的执行，是按照 a set of data来调度的。认为数据时一批次一批次的就绪，然后处理，然后完整输出的。

但是，这并不是我们常见的处理场景。
常见的数据场景，并不是按批次就绪的，而是以数据流的形式就绪的。数据按照地址单调递增的方式就绪。也按照地址单调递增的方式输出。
如果要让HLS理解为流数据，需要修改代码风格。

在不修改代码下，我们进一步优化，让最外层的row的for循环能够并行化。
现在，在不修改代码的情况下，如果希望降低整个函数的II，我们可以施加函数级的流水化约束。
在函数级，使用pipeline。
此时，II降低到了5。
INFO: [XFORM 203-502] Unrolling all loops for pipelining in function ‘matrixmul’

The pipelined function results in the best performance. Pipelining loops gives you an easy way to control resources, with the option of partially unrolling the design to meet performance.

我们可以看到，即使在函数级使用了pipeline，仍然整个函数的II不能降低到1，是因为HLS将输入的数组理解为批次数据，而不是流数据。
所以，我们要对输入数组施加约束，让HLS理解为流数据。

为输入的形参数组，施加接口约束。
这里，可以使用ap_fifo约束。但是需要修改代码。这个后面再说。

++++++++++++++++++++++++++++++++++++++++++++
For the read ports, the data must be cached internally to ensure the design does not have to re-read the port. For the write port res, the data must be saved into a temporary variable and only written to the port in the cycles shown in red.

内部cache缓冲，是解决IO问题的最主要途径。

void matrixmul(
      mat_a_t a[MAT_A_ROWS][MAT_A_COLS],
      mat_b_t b[MAT_B_ROWS][MAT_B_COLS],
      result_t res[MAT_A_ROWS][MAT_B_COLS])
{

  	mat_a_t a_row[MAT_A_ROWS];
  	mat_b_t b_copy[MAT_B_ROWS][MAT_B_COLS];
  	int tmp = 0;

 	// Iterate over the rowa of the A matrix
  	Row: for(int i = 0; i < MAT_A_ROWS; i++) {
    	// Iterate over the columns of the B matrix
    	Col: for(int j = 0; j < MAT_B_COLS; j++) {
      		// Do the inner product of a row of A and col of B
      		tmp=0;
      		// Cache each row (so it's only read once per function)
      		if (j == 0)
        		Cache_Row: for(int k = 0; k < MAT_A_ROWS; k++)
          			a_row[k] = a[i][k];
      
			// Cache all cols (so they are only read once per function)
			if (i == 0)
     			Cache_Col: for(int k = 0; k < MAT_B_ROWS; k++)
        			b_copy[k][j] = b[k][j];

			Product: for(int k = 0; k < MAT_B_ROWS; k++) {
 				tmp += a_row[k] * b_copy[k][j];
			}
      		
      		res[i][j] = tmp;
    	}
 	 }
}

在这个代码中，对上面的例子进行了代码风格的修改。加入了内部cache缓冲。从而改善IO，是IO不再是批次数据，而可以实现为流数据。

第一个改进点，是tmp。
在for循环中，反复读写的不再是res,而是tmp，而tmp的读写，只有一次，就是整个for循环处理完成后，将tmp的最后结果写入res。

第二个改进点，是a_row。
在内层for循环中，不再反复读取数组a。在内层循环col的循环体中，需要反复读取a的某一行数据，改进后，使用cache，每一次修改行坐标，就刷一次cache。一次性读取全部的数据到cache中。
这里，在内层使用了条件显隐方式，使用了条件显隐方式。即if(j == 0)，如果列坐标为0，就启动一次行读取。保证了for循环的嵌套紧凑。
我们知道，一行数据是一个一维向量，在C语言中，用一个for循环来描述一个一维向量的访问。
这里，用k来描述对一行数据的遍历。

第三个改进点，是b_copy。
在内层for循环中，不再反复读取数组b。在内层循环col的循环体中，需要反复读取b的某一列数据，改进后，使用cache，每一次修改列坐标，就刷一次cache。一次性读取全部的数据到cache中。
这里，在内层使用了条件显隐方式，即if(i == 0)，如果行坐标为0，就启动一次列读取。保证了for循环的嵌套紧凑。
我们知道，一列数据是一个一维向量，在C语言中，用一个for循环来描述一个一维向量的访问。
这里，用k来描述对一列数据的遍历。

第四个改进点，是用a_row和b_copy这两个cache，作为操作数，参与到数据处理中。
这样，IO被分离出来，IO只在刷cache时发生操作。

注意，这里可以看到，a_row是一个一维的cache，而b_copy却是一个二维的cache，为什么？
这取决于cache的有效期。
如果资源允许，自然是cache整个批次的数据最可靠。
但是在本例中，一行数据，只需要在本行的循环迭代中使用，之后就可以失效了。所以，如果为每一行数据都分配cache，大量的cache实际上是浪费了的资源。
但是，一列数据，除了在本行的循环迭代中使用，还需要在其他行的循环迭代中使用，它的有效期贯穿始终。
这就是数据的data fallback access。
对于a，不存在data fallback access。在列坐标j向前推进时，所需要的数据，都能完成cache shooting。而当行坐标i向前推进后，会刷一次cache。此后也并不会需要访问i-1行或者i-2行的a的数据。cache shooting不会失败。
对于b，存在data fallback access。因为行坐标i向前推进后，仍然需要使用i-1行或者i-2行的数据。当列坐标向前推进时，每次都需要刷cache。为了能够完成cache shooting，必须在cache中保存有之前的行的数据。

+++++++++++++++++++++++++++++++++++++++++++++++++
有了以上的代码风格的修改，我们就可以施加接口约束了，将接口约束为流数据。
首先，仍然是优化for循环，
这次，我们已经预期形成流数据，所以，我们在col的for循环体上，施加的pipeline，可以更严格，附加rewind。
Because the for-loops to cache the row and column would require multiple cycles to perform the reads, the pipeline directive has been applied to the Col for-loop, ensuring these cache for-loops are automatically unrolled.

Col: for(int j = 0; j < MAT_B_COLS; j++) {
#pragma HLS PIPELINE rewind

然后，对接口进行约束，
对形参数组添加ap_fifo约束。

#pragma HLS INTERFACE ap_fifo port=a
#pragma HLS INTERFACE ap_fifo port=b
#pragma HLS INTERFACE ap_fifo port=res

HLS第二十九课（UG871，设计分析与优化）

猜你喜欢