Logic series and logic delay optimization in practice

review

      FPGA design will inevitably intersperse combinatorial logic between FFs, so how to quantify and analyze these combinatorial logics? How to optimize convergence? How to estimate the possible delay from RTL design?

      Next, a simple project will be used for practical demonstration.

original project

Define a 32-count timer, counting 80S at regular intervals, assuming the main clock frequency is 50M, the code is as follows:

    module TEST_TOP(
    input                clk_sys,    // 50M
    input                rst  ,
    input                plus ,
    output reg  [15:0]   d 
    
    ); 
   function [31:0]count_s(  input  [7:0]   s_n        );
        count_s = 50_000_000* s_n ;
    endfunction

    reg [31:0]  cnt_s ;
    always@(posedge clk_sys or negedge rst)begin
        if(rst)begin
            cnt_s <= 'd0 ;
        end else if(cnt_s >= count_s(80)) begin
            cnt_s <= 'd0 ;
        end else begin 
            cnt_s <= cnt_s + 1 ;
        end
    end  
     
   reg  plus_d1,plus_d2;
   always@(posedge clk_sys)begin
       plus_d1    <= plus ;
       plus_d2    <= plus_d1 ;
    end 
     
    always@(posedge clk_sys)begin
        if(s_carry_en)
            d <= d + plus_d2 ;
    end
endmodule

With the above code, how do we get the delay of each path? Xilinx provides an evaluation method called logic level (logic_level), which is simply the number of combinatorial logic in series, so how to obtain the logic level of the current design?

Open the file after this code is synthesized, and run the previous query command in Tcl consol to get the number of paths of each logic level in the current design,

Logical progression query command

report_design_analysis -logic_level_distribution 
-logic_level_dist_paths 5000 -name design_analysis_prePlace

As shown in the figure below, the highest logical level is 12, and there are 4 paths with 12 logical levels.

Generally, the path with the highest level is firstly analyzed, and the specific path information can be obtained through the report command:

 Take Path1 as an example, select the path and press the shortcut key F4 to get the schematic diagram of the path:

As can be seen from the schematic diagram, a total of 1LUT5+2LUT6+8CARRY4 = 11 logic_level has passed between the two FFs;

First popularize a concept:

What are LUTs? What is CARRY?

LUT: look up table, a lookup table, is a way for FPGA to implement combinational logic. It will be explained in other blogs explaining the underlying resources.

          Customer details;

CARRY: Carry chain, carry overflow, this is learned in the principle of microcomputer in university, it exists for connecting an accumulator with too large bit width

          logic tool;

The judgment condition of if-else needs to be realized by LUT, and the carry of the accumulator needs to be realized by CARRY.

If two FFs are directly connected, logic_level = 1;

11 logical units are inserted in the above figure, so logic_level = 12;

The total transmission delay of this path is 3.015ns, and the logic delay is 1.46ns.

The first step of optimization - splitting the large bit width accumulator

Let's change the way of writing and split the 32-bit counter into two 16-bit counters:

    function [15:0]count_ms(  input  [7:0]   ms_n        );
        count_ms = 50_000*ms_n ;
    endfunction

    function [15:0]count_s(  input  [7:0]   s_n        );
        count_s = 1_000* s_n ;
    endfunction
   
    reg [15:0]   cnt_ms ;
    reg [15:0]   cnt_s ;
    always@(posedge clk_sys or negedge rst)begin
        if(rst)begin
            cnt_ms <= 'd0 ;
        end else if(cnt_ms >= count_ms(1)) begin
            cnt_ms <= 'd0 ;
        end else begin
            cnt_ms <= cnt_ms + 1 ;
        end
    end

    always@(posedge clk_sys or negedge rst)begin
        if(rst)begin
            cnt_s <= 'd0 ;
        end else if(cnt_ms >= count_ms(1)) begin
            if(cnt_s >= count_s(80))
                cnt_s <= 'd0 ;
            else 
                cnt_s <= cnt_s + 1 ;
        end
    end  

   reg  plus_d1,plus_d2;
   always@(posedge clk_sys)begin
       plus_d1    <= plus ;
       plus_d2    <= plus_d1 ;
    end 
     
    always@(posedge clk_sys)begin
        if(cnt_ms >= count_ms(1))
            d <= d + plus_d2 ;
    end

After re-synthesis, the output analysis report shows that the maximum number of logic stages in the new design is only 7,

 Open the path with logical level 7

  The transmission path becomes 2LUT + 4 CARRY4; the logic unit reduces 1 LUT6 and 4 CARRY4 compared to the original.

The logic delay is reduced from the previous 1.46ns to 1.16ns, a reduction of 0.3ns, a reduction of 20%;

The reduction of CARRY is easy to understand, because we split the 32-bit accumulator into two 16-bit accumulators,

The original one-level accumulation becomes two-level accumulation after splitting; after splitting, each level is only 16 bits wide; so the carry chain required between each level of FF is also reduced accordingly;

Through this step we can find:

1. CARRY4 has 4 inputs. If the bit width of the accumulator or counter exceeds 4, one more CARRY4 will be consumed:

For example: in example 1, the counter definition is 32bit, and finally consumes 8 carry chains; and in example 2, after being optimized to 16bit, only 4 carry chains are consumed.

2. Under normal circumstances, the overall wiring delay and logic delay are close to 1:1. When the number of logic stages is reduced, the logic delay is reduced, and the wiring delay is correspondingly reduced.

Look at the specific timing report of the path again, and you can see the specific delay of each level of logic:

Incr is the added delay, and Path is the moment of each node in the middle;

The second step of optimization - simplify the if-else decision condition

From the previous we can find that reducing the bit width of the accumulator can greatly reduce the number of stages in the carry chain, thereby reducing the logic delay and wiring delay. Is there any other way to reduce the combinational logic delay? ? From the front, the combinational logic delay is mainly composed of two parts: 1. Carry chain; 2. LUT.

The number of stages of the carry chain is determined by the bit width of the accumulator, so what about the number of LUTs? We know that LUTs are used to implement combinational logic, and a LUT has only 6 inputs. When the complexity of the combinational logic is high and the bit width of the input signal is large, the number of LUTs that need to be consumed will naturally be more. Taking the previous project as an example, we did not use the combinational logic assignment statement of assign, so where is the combinational logic used?

The answer is the logical judgment condition of if-else. The if-else decision condition involves multi-bit width data comparison, and multi-condition nesting will increase the complexity of the combination logic to realize the decision function.

This is how the always block of the original code is written:

    always@(posedge clk_sys or negedge rst)begin
        if(rst)begin
            cnt_s <= 'd0 ;
        end else if(cnt_ms >= count_ms(1)) begin
            if(cnt_s >= count_s(80))
                cnt_s <= 'd0 ;
            else 
                cnt_s <= cnt_s + 1 ;
        end
    end 

Let's modify it:

module TEST_TOP(
    input                clk_sys,    // 50M
    input                rst  ,
    input                plus ,
    output reg  [15:0]   d 
    
    );
    
    function [15:0]count_ms(  input  [7:0]   ms_n        );
        count_ms = 50_000*ms_n ;
    endfunction

    function [15:0]count_s(  input  [7:0]   s_n        );
        count_s = 1_000* s_n ;
    endfunction
 
    reg   ms_carry_en ;
    always@(posedge clk_sys)begin
        if(cnt_ms == count_ms(1)-1)
            ms_carry_en <= 'd1 ;
        else 
            ms_carry_en <= 0 ;
    end   
     reg   s_carry_en ;
    always@(posedge clk_sys)begin
        if(cnt_s == count_s(80)-1)
            s_carry_en <= 'd1 ;
        else 
            s_carry_en <= 0 ;
    end  
   
    reg [15:0]   cnt_ms ;
    reg [15:0]   cnt_s ;
    always@(posedge clk_sys or negedge rst)begin
        if(rst)begin
            cnt_ms <= 'd0 ;
        end else if(ms_carry_en) begin
            cnt_ms <= 'd0 ;
        end else begin
            cnt_ms <= cnt_ms + 1 ;
        end
    end
  

    always@(posedge clk_sys or negedge rst)begin
        if(rst)begin
            cnt_s <= 'd0 ;
        end else if(ms_carry_en) begin
            if(s_carry_en)
                cnt_s <= 'd0 ;
            else 
                cnt_s <= cnt_s + 1 ;
        end
    end  

Let's take a look at the combined effect:

The maximum logic level is only 6, which is reduced by 1 level compared with Example 2, from the original LUT4+LUT5 to 1 LUT1 (the optimized judgment condition is only 1bit input);

The total delay is reduced from the previous 2.356 to 1.451, a reduction of 0.9ns, and the reduction ratio reaches 38%;

 The logic delay is reduced from the previous 1.16, and the reduction is not much. By comparing the data path data, it is found that after optimization, one LUT5 and the connection between the front and rear stages of the LUT5 are mainly reduced.

The third step of optimization - splitting the assignment expression (area for speed)

In fact, this is easy to understand. It is to split one-step operations into multi-step operations and build a pipeline;

For example: S = A + B + C;

Can be designed as:

S1 = A+B ;

S = S1+C 。

The purpose of this step is actually similar to that of the second step. The implementation of the right side of the assignment equation is also realized through the combination of LUT and carry chain. An overly complex assignment expression will lead to an overly long combinatorial logic cascade.

Since our original routine does not have a very lengthy assignment expression, and this situation is very common and easy to understand, we will not analyze it with an example for the time being.

Summarize

1. The data_delay between FFs is mainly composed of two parts: logic delay and wiring delay; the number of logic stages increases, and wiring nodes

Increase, the wiring delay will increase accordingly;

2. The ratio of wiring delay to logic delay should be close to 1:1;

When the logic delay is > 50% of the wiring delay, please optimize the logic delay;

When the wiring delay is > 50% of the logic delay, please optimize the wiring delay; (refer to UG1292)

When the delay is not satisfied, it is recommended to optimize the logic delay first, because this is what we can do,

The routing delay can only be optimized by the strategy of the tool. In many cases, the combinatorial logic is unreasonable, which may cause the same signal group to be placed in CLBs of different columns, resulting in difficulty in routing.

3. Excessive counter bit width will bring too many carry chains, resulting in too many logical progressions;

Try to avoid large-bit-width counters, and the design within 250M, it is best not to exceed 16bits;

4bits bit width will occupy a carry chain;

4. Complex if - else judgment conditions require multi-level LUT implementation, which will also cause too many logical progressions;

Try to avoid if-else nesting and if-case nesting in the design;

Try to avoid the input variable bit width of the if-else judgment condition being too large;

Try to avoid implementing multi-conditional logic operations at the judgment condition, and convert it into a single-bit condition by clicking in advance;

5. When the assignment expression is too lengthy, consider splitting it into multi-level processing to improve design performance;

6. Reasonable empirical value of combinatorial logic series: ≤2N (N is the clock period of the current clock domain).

Finally, the logic level is not as low as possible. Optimizing the logic level to a certain extent will bring additional resource consumption, but when the design does not meet the performance requirements, optimization is necessary. It is best to be aware of it when designing, so as to avoid failure to achieve design convergence in the end, and then go back and modify one by one, wasting time.

Guess you like

Origin blog.csdn.net/ypcan/article/details/129888933