HLS第三十课（UG1270，设计优化）

xilinx推荐的优化步骤，分为如下几个：
Simulate Design----Validate The C function

Synthesize Design----Baseline design

1: Initial Optimizations----Define loop trip counts，Define interfaces (and data packing)

2: Pipeline for Performance----Pipeline and dataflow

3: Optimize Structures for Performance----Partition memories and ports，Remove false dependencies

4: Reduce Latency----Optionally specify latency requirements

5: Improve Area----Optionally recover resources through sharing

Designs will reach the optimum performance after step 3.
The key factors in the performance estimates are the timing, interval, and latency in that order.

In an ideal hardware function, the hardware processes data at the rate of one sample per clock cycle. If the largest data set passed into the hardware is size N (e.g., my_array[N]), the most optimal II is N + 1.

The loop initiation interval is the number of clock cycles before the next iteration of a loop starts to process data.
The loop iteration latency is the number of clock cycles it takes to complete one iteration of a loop, and the loop latency is the number of cycles to execute all iterations of the loop.

If the design has loops with variable loop bounds, the compiler cannot determine the latency or II and uses the “?” to indicate this condition.
在for循环中，尽量使用定长的迭代次数。
如果确实需要使用变长循环边界，那么需要使用trip_count约束。

Because the LOOP_TRIPCOUNT value is only used for reporting, and has no impact on the resulting hardware implementation, any value can be used.

+++++++++++++++++++++++++++++++++++++++++
Pipeline for Performance

Pipelining results in the greatest level of concurrency and the highest level of performance.

A recommended strategy is to work from the bottom up and be aware of the following:

The non-pipelined sub-function will be the limiting factor.
When you use the PIPELINE directive, the directive automatically unrolls all loops in the hierarchy below. This can create a great deal of logic. It might make more sense to pipeline the loops in the hierarchy below.
loops with variable bounds cannot be unrolled, and any loops and functions in the hierarchy above these loops cannot be pipelined.
这里，还是强调了变长循环边界带来的并行化问题。

pipeline these loops wih variable bounds, and use the DATAFLOW optimization to ensure the pipelined loops operate concurrently to maximize the performance of the tasks that contains the loops.
如果实在是需要使用变长循环边界，尝试配合使用dataflow约束。

The key optimization directives for obtaining a high-performance design are the PIPELINE and DATAFLOW directives.

+++++++++++++++++++++++++++++++++++++
Frame-Based C Code
the function processes multipledata samples - a frame of data – typically supplied as an array or pointer with data accessed through pointer arithmetic during each transaction .
a transaction is considered to be one complete execution of the C function.
In this coding style, the data is typically processed through a series of loops or nested loops.

void foo(
	data_t in1[HEIGHT][WIDTH],
	data_t in2[HEIGHT][WIDTH],
	data_t out[HEIGHT][WIDTH] )
{
	Loop1: for(int i = 0; i < HEIGHT; i++) {
		Loop2: for(int j = 0; j < WIDTH; j++) {
			out[i][j] = in1[i][j] * in2[i][j];
			Loop3: for(int k = 0; k < NUM_BITS; k++) {
				. . . .
			}
		}
	}
}

you want to place the pipeline optimization directive at the level where a sample of data is processed.

If the function is pipelined with II = 1—read a new set of inputs every clock cycle—this informs the compiler to read all HEIGHT*WIDTH values of in1 and in2 in a single clock cycle. It is unlikely this is the design you want.

Because the data is accessed in a sequential manner, the arrays on the interface to the hardware
function can be implemented as multiple types of hardware interface:
• Block RAM interface
• AXI4 interface
• AXI4-Lite interface
• AXI4-Stream interface
• FIFO interface

The logic in Loop1 processes an entire row of the two-dimensional matrix. Placing the PIPELINE directive here would create a design which seeks to process one row in each clock cycle.
an array of HEIGHT data words, with each word being WIDTH*<number of bits in data_t> bits wide.
this would again result in a case where there are many highly parallel hardware resources that cannot operate in parallel due to bandwidth limitations.

The logic in Loop2 seeks to process one sample from the arrays. In an image algorithm, this is the level of a single pixel.This is the level to pipeline if the design is to process one sample per clock cycle.
This will cause Loop3 to be completely unrolled but to process one sample per clock.
这是最常见的，也是最理想的进行pipeline优化的位置。

In a typical design, the logic in Loop3 is a shift register or is processing bits within a word.
Loop3 will typically be doing bit-level or data shifting tasks, so this level is doing multiple operations per pixel.

For cases where there are multiple loops at the same level of hierarchy，the best location to place the PIPELINE directive can be determined for each loop and then the DATAFLOW directive applied to the function to ensure each of the loops executes in a concurrent manner.
推荐的做法是，对每个for循环施加pipeline约束，
如果多个for循环在结构上是顺序排列的，那么对这个顺序结构，施加dataflow约束。

++++++++++++++++++++++++++++++++++++++++++
Sample-Based C Code
The primary characteristic of this coding style is that the function processes a single data sample during each transaction.

void foo (data_t *in, data_t *out)
{
	static data_t acc;
	Loop1: for (int i=N-1;i>=0;i--) {
		acc+= ..some calculation..;
	}
	
	*out=acc>>N;
}

the location of the PIPELINE directive is clear, namely, to achieve an II = 1 and process one data value each clock cycle, for which the function must be pipelined.
对于这种例子，施加函数级pipeline是最合适的。

In this type of coding style, the loops are typically operating on arrays and performing a shift register or line buffer functions. It is used to partition these arrays into individual elements。
对for循环进行函数级pipeline后，通常会受到访问端口的制约，所以，这种情况下，一般是需要将数组进行完全分割的。

++++++++++++++++++++++++++++++++++++++++++
Optimize Structures for Performance

dout_t bottleneck(...) 
{
	...
	SUM_LOOP: for(i=3;i<N;i=i+4) {
		#pragma HLS PIPELINE
		sum += mem[i] + mem[i-1] + mem[i-2] + mem[i-3];
	}
	...
}

WARNING: [SCHED 69] Unable to schedule ‘load’ operation (‘mem_load_2’, bottleneck.c:62) on array ‘mem’ due to limited memory ports.
The memory port limitation issue can be solved by using the
ARRAY_PARTITION
directive on the mem array.

providing more data ports and allowing a higher performance pipeline.
cyclic partitioning with a factor of N。
如果需要同时访问的元素是数组的连续元素，那么要使用cyclic模式。将相邻元素散列到不同的子数组中去。

#pragma HLS ARRAY_PARTITION variable=mem cyclic factor=2 dim=1

+++++++++++++++++++++++++++++++++++++++
Reducing Area
The most common area optimization is the optimization of dataflow memory channels to reduce the number of block RAM resources required to implement the hardware function.

If you used the DATAFLOW optimization and the compiler cannot determine whether the tasks in the design are streaming data, it implements the memory channels between dataflow tasks using ping-pong buffers.
These require two block RAMs each of size N, where N is the number of samples to be transferred between the tasks (typically the size of the array passed between tasks).

If the design is pipelined and the data is in fact streaming from one task to the next with values produced and consumed in a sequential manner, you can greatly reduce the area by using the STREAM directive to specify that the arrays are to be implemented in a streaming manner that uses a simple FIFO for which you can specify the depth.
For most applications, the depth can be specified as 1, resulting in the memory channel being implemented as a simple register.
尽量使用stream约束，将任务间的数组传输，实现为FIFO。

For tasks which reduce the data rate by a factor of X-to-1, specify arrays at the input of the task to stream with a depth of X.All arrays prior to this in the function should also have a depth of X to ensure the hardware function does not stall because the FIFOs are full.

For tasks which increase the data rate by a factor of 1-to-Y, specify arrays at the output of the task to stream with a depth of Y. All arrays after this in the function should also have a depth of Y to ensure the hardware function does not stall because the FIFOs are full.
注意要为STREAM预留足够的depth。

If the ARRAY_PARITION directive is used to improve the initiation interval you might want to
consider using the ARRAY_RESHAPE directive instead.

++++++++++++++++++++++++++++++++++++++++++++
Data Access Patterns

The use of block RAM can be minimized by using the DATAFLOW optimization and streaming the data through small efficient FIFOs, but this will require the data to be used in a streaming sequential manner.

Each access for data, which has previously been fetched, negatively impacts the performance of the system.
One of the keys to a high-performance FPGA is to minimize the access to and from the PS.

the flow of data between the horizontal and vertical loop should be managed via a FIFO。
Instead, this code which requires arbitrary/random accesses requires a ping-pong block RAM to improve performance.
The code suffers from the same repeated access for data.

The summary from this review is that the following poor data access patterns negatively impact
the performance and size of the FPGA implementation:
• Multiple accesses to read and then re-read data. Use local cache where possible.
• Accessing data in an arbitrary or random access manner. This requires the data to be stored locally in arrays and costs resources.
• Setting default values in arrays costs clock cycles and performance.

The key to implementing the convolution example reviewed in the previous section as a highperformance design with minimal resources is to:
• Maximize the flow of data through the system. Refrain from using any coding techniques or algorithm behavior that inhibits the continuous flow of data.
• Maximize the reuse of data. Use local caches to ensure there are no requirements to re-read data and the incoming data can keep flowing.
• Embrace conditional branching. This is expensive on a CPU, GPU, or DSP but optimal in an FPGA.
这和C编写程序是不同，在CPU上运行的程序，如果在每次迭代都要进行条件判断，时间开销是很大的。
但是在HLS中，这是不同的，因为HLS中，C并不是在编写程序，而是在描述硬件。所以，反而要尽量在内层迭代中使用条件判断，这在综合时，形成两个并行的处理机，根据条件不同，启动不同的处理机，并且采用合适的处理机的输出结果。

The convolution algorithm shown below embraces this style of coding.

However, there are now intermediate buffers, hconv and vconv, between each loop. Because these are accessed in a streaming manner, they are optimized into single registers in the final implementation.

template<typename T, int K>
static void convolution_strm(
	int width,
	int height,
	T src[TEST_IMG_ROWS][TEST_IMG_COLS],
	T dst[TEST_IMG_ROWS][TEST_IMG_COLS],
	const T *hcoeff,
	const T *vcoeff)
{
	T hconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];
	T vconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];
	T *phconv, *pvconv;
	// These assertions let HLS know the upper bounds of loops
	assert(height < MAX_IMG_ROWS);
	assert(width < MAX_IMG_COLS);
	assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
	
	// Horizontal convolution
	HConvH:for(int col = 0; col < height; col++) {
		HConvW:for(int row = 0; row < width; row++) {
			HConv:for(int i = 0; i < K; i++) {
			...
			}
		}
	}
	// Vertical convolution
	VConvH:for(int col = 0; col < height; col++) {
		VConvW:for(int row = 0; row < vconv_xlim; row++) {
			VConv:for(int i = 0; i < K; i++) {
			...
			}
		}	
	}
	
	Border:for (int i = 0; i < height; i++) {
		for (int j = 0; j < width; j++) {
		...
		}
	}
}

All three processing loops now embrace conditional branching to ensure the continuous
processing of data.
+++++++++++++++++++++++++++++++++++++
Optimal Horizontal Convolution
The algorithm must use the K previous samples to compute the convolution result.
It therefore copies the sample into a temporary cache hwin.
This use of local storage means there is no need to re-read values from the PS and interrupt the flow of data.
The algorithm keeps reading input samples and caching them into hwin.
At that point, only the last K samples are stored in hwin.

As shown below, the code to perform these operations uses both local cache to prevent rereads from the PL, and the extensive use of conditional branching to ensure each new data
sample can be processed in a different manner.

	// Horizontal convolution
	phconv=hconv_buffer; // set / reset pointer to start of buffer
	
	// These assertions let HLS know the upper bounds of loops
	assert(height < MAX_IMG_ROWS);
	assert(width < MAX_IMG_COLS);
	assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
	
	HConvH:for(int col = 0; col < height; col++) {
		HConvW:for(int row = 0; row < width; row++) {
		#pragma HLS PIPELINE
			T in_val = *src++;
			// Reset pixel value on-the-fly - eliminates an O(height*width) loop
			T out_val = 0;
			HConv:for(int i = 0; i < K; i++) {
				hwin[i] = i < K - 1 ? hwin[i + 1] : in_val;
				out_val += hwin[i] * hcoeff[i];
			}
		
			if (row >= K - 1) {
				*phconv++=out_val;
			}
		}
	}

The outputs from the task are either discarded or used, but the task keeps constantly computing.
注意，HLS中使用C来描述硬件，尽可能的在内层循环中使用if条件判断语句。

+++++++++++++++++++++++++++++++++++++++++++++++
Optimal Vertical Convolution

The data must be accessed by column but you do not wish to cache the entire image. The
solution is to use line buffers.
Once again, the samples are read in a streaming manner, this time from the local buffer hconv.
The algorithm requires at least K-1 lines of data before it can process the first sample.
A line buffer allows K-1 lines of data to be stored. Each time a new sample is read, another sample is pushed out the line buffer.
the newest sample is used in the calculation, and then the sample is stored into the line buffer and the old sample ejected out.
This ensures that only K-1 lines are required to be cached.

	// Vertical convolution
	phconv=hconv_buffer; // set/reset pointer to start of buffer
	pvconv=vconv_buffer; // set/reset pointer to start of buffer
	
	VConvH:for(int col = 0; col < height; col++) {
		VConvW:for(int row = 0; row < vconv_xlim; row++) {
		#pragma HLS DEPENDENCE variable=linebuf inter false
		#pragma HLS PIPELINE
			T in_val = *phconv++;
			// Reset pixel value on-the-fly - eliminates an O(height*width) loop
			T out_val = 0;
			VConv:for(int i = 0; i < K; i++) {
				T vwin_val = i < K - 1 ? linebuf[i][row] : in_val;
				out_val += vwin_val * vcoeff[i];
				
				if (i > 0)
					linebuf[i - 1][row] = vwin_val;
			}
			
			if (col >= K - 1) {
				*pvconv++ = out_val;
		}
	}

++++++++++++++++++++++++++++++++++++++++++++
Optimal Border Pixel Convolution

To ensure the constant flow of data and data reuse, the algorithm makes use of local caching.
• Each sample is read from the vconv output from the vertical convolution.
• The sample is then cached as one of four possible pixel types.
• The sample is then written to the output stream.

	// Border pixels
	pvconv=vconv_buffer; // set/reset pointer to start of buffer
	
	Border:for (int i = 0; i < height; i++) {
		for (int j = 0; j < width; j++) {
		T pix_in, l_edge_pix, r_edge_pix, pix_out;
		#pragma HLS PIPELINE
		
			if (i == 0 || (i > border_width && i < height - border_width)) {
				// read a pixel out of the video stream and cache it for
				// immediate use and later replication purposes
				if (j < width - (K - 1)) {
					pix_in = *pvconv++;
					borderbuf[j] = pix_in;
				}
				
				if (j == 0) {
					l_edge_pix = pix_in;
				}
				
				if (j == width - K) {
					r_edge_pix = pix_in;
				}
			}

			// Select output value from the appropriate cache resource
			if (j <= border_width) {
				pix_out = l_edge_pix;
			} else if (j >= width - border_width - 1) {
				pix_out = r_edge_pix;
			} else {
				pix_out = borderbuf[j - border_width];
			}
			
			*dst++=pix_out;
		}
	}

+++++++++++++++++++++++++++++++++++++++
Optimal Data Access Patterns
• Minimize data input reads.
• Minimize accesses to arrays, Use small localized caches to hold results such as accumulations and then write the final result to the array.
• Seek to perform conditional branching inside pipelined tasks rather than conditionally execute
tasks, even pipelined tasks. Allowing the data from one task to flow into the next task with the conditional performed inside the next task will result in a higher performing system.
• Minimize output writes for the same reason as input reads,

employing a coding style that promotes read-once/write-once to function arguments.
+++++++++++++++++++++++++++++++++++++++++++
pragma HLS dependence

The DEPENDENCE pragma is used to provide additional information that can overcome loopcarry dependencies and allow loops to be pipelined (or pipelined with lower intervals).

Vivado HLS automatically detects dependencies:
• Within loops (loop-independent dependence), or
• Between different iterations of a loop (loop-carry dependence).

Loop-independent dependence: The same element is accessed in the same loop iteration.

for (i=0;i<N;i++) {
A[i]=x;
y=A[i];
}

Loop-carry dependence: The same element is accessed in a different loop iteration.

for (i=0;i<N;i++) {
A[i]=A[i-1]*2;
}

The DEPENDENCE pragma allows you to explicitly specify the dependence and resolve a false dependence.

In the following example,use the DEPENDENCE pragma to state that there is no dependence
between loop iterations (in this case, for both buff_A and buff_B).

	for (row = 0; row < rows + 1; row++) {
		for (col = 0; col < cols + 1; col++) {
		#pragma HLS PIPELINE II=1
		#pragma HLS dependence variable=buff_A inter false
		#pragma HLS dependence variable=buff_B inter false
			
			if (col < cols) {
				buff_A[2][col] = buff_A[1][col]; // read from buff_A[1][col]
				buff_A[1][col] = buff_A[0][col]; // write to buff_A[1][col]
				buff_B[1][col] = buff_B[0][col];
				temp = buff_A[0][col];
			}
		}
	}

++++++++++++++++++++++++++++++++++++++
pragma HLS interface
The INTERFACE pragma specifies how RTL ports are created from the function definition during
interface synthesis.
The default I/O protocol created depends on the type of C argument.
After the block-level protocol has been used to start the operation of the block, the port-level IO protocols are used to sequence data into and out of the block.
Vivado HLS automatically determines the I/O protocol used by any sub-functions.

ap_none: No protocol. The interface is a data port.

ap_fifo: Implements the port with a standard FIFO interface using data input and output
ports with associated active-Low FIFO empty and full ports.

axis: Implements all ports as an AXI4-Stream interface.
s_axilite: Implements all ports as an AXI4-Lite interface.
m_axi: Implements all ports as an AXI4 interface.

ap_ctrl_hs: Implements a set of block-level control ports to start the design operation
and to indicate when the design is idle, done, and ready for new input data.

bundle=<string>: Groups function arguments into AXI interface ports. This option explicitly groups all interface ports with the same bundle=<string> into the same AXI interface port and names the RTL port the value specified by <string>.

register: An optional keyword to register the signal and any relevant protocol signals, and causes the signals to persist until at least the last cycle of the function execution.
register_mode= <forward|reverse|both|off>:The default register_mode is both.

depth=<int>: Specifies the maximum number of samples for the test bench to process. While depth is usually an option, it is required for m_axi interfaces.

offset=<string>: Controls the address offset in AXI4-Lite (s_axilite) and AXI4 (m_axi) interfaces.

num_read_outstanding=<int>: For AXI4 (m_axi) interfaces,
num_write_outstanding=<int>: For AXI4 (m_axi) interfaces,
max_read_burst_length=<int>: For AXI4 (m_axi) interfaces,
max_write_burst_length=: For AXI4 (m_axi) interfaces,

name=<string>: This option is used to rename the port based on your own specification.

In this example, both function arguments are implemented using an AXI4-Stream interface:

void example(int A[50], int B[50]) {
//Set the HLS native interface types
#pragma HLS INTERFACE axis port=A
#pragma HLS INTERFACE axis port=B
	int i;
	for(i = 0; i < 50; i++){
		B[i] = A[i] + 5;
	}
}

The following turns off block-level I/O protocols, and is assigned to the function return value:

#pragma HLS interface ap_ctrl_none port=return

The function argument InData is specified to use the ap_vld interface, and also indicates the
input should be registered:

#pragma HLS interface ap_vld register port=InData

This example defines the INTERFACE standards for the ports of the top-level transpose
function. Notice the use of the bundle= option to group signals.

// TOP LEVEL - TRANSPOSE
void transpose(int* input, int* output) {
#pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1

#pragma HLS INTERFACE s_axilite port=input bundle=control
#pragma HLS INTERFACE s_axilite port=output bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

#pragma HLS dataflow

+++++++++++++++++++++++++++++++++++++
pragma HLS stream
By default, array variables are implemented as RAM:
• Top-level function array parameters are implemented as a RAM interface port.
• General arrays are implemented as RAMs for read-write access.
• In sub-functions involved in DATAFLOW optimizations, the array arguments are implemented
using a RAM pingpong buffer channel.
• Arrays involved in loop-based DATAFLOW optimizations are implemented as a RAM
pingpong buffer channel

When an argument of the top-level function is specified as INTERFACE type
ap_fifo, the array is automatically implemented as streaming.

variable=<variable>: Specifies the name of the array to implement as a streaming interface.

depth=<int>: Relevant only for array streaming in DATAFLOW channels. By default, the depth of the FIFO implemented in the RTL is the same size as the array specified in the C code.
For example, in a DATAFLOW region when all loops and functions are processing data at a rate of II=1, there is no need for a large FIFO because data is produced and consumed in each clock cycle.

dim=<int>: Specifies the dimension of the array to be streamed.

In this example array B is set to streaming with a FIFO depth of 12:

#pragma HLS STREAM variable=B depth=12

HLS第三十课（UG1270，设计优化）

猜你喜欢