Fixed-point multiplier - partial product compression (Huawei Cup)

1. Introduction

In the previous article, I have introduced how to use the booth algorithm to generate partial products, so in this article I will introduce how to use the addition tree to compress the partial products. There are many forms of additive tree compression. The common one is Wallace compression, which is also a method introduced in the competition question.
If you are interested, you can do research.

2. Wallace compression

In Wallace compression, common compression methods include 3:2 compression and 4:2 compression, and rare 5:2 compression.

The expression and block diagram of 3:2 compression are as follows

Please add a picture description

The expression and block diagram of 4:2 compression are as follows

Please add a picture description

The 5:2 compressor will not be introduced here. Here we will talk about the 4:2 compressor in detail. Because of the algorithm passed, there are exactly 8 partial products generated, and it is enough to pass through three 4:2 compressors. Using 3:2 compression would require a deeper structure, so it was not used in my design.

4:2 compression is to input 4 data of 1 bit and a carry of 1 bit, and then output two data of 1 bit and a carry. When we need a 32-bit compressor, we only need to cascade the 1-bit 4:2 compression, and the Cout of the previous one can be input to the Cin of the next one. The final result needs to move Carry one bit to the left. The code is implemented as follows.

Please add a picture description

This compresses 4 partial products to two partial products, and using three such compressors, we end up with 2 partial products.

Compression Why not just use the adder here? It can be understood by the structure of the compressor.

3. Results

Finally, use a 32bit adder to add the last two partial products. A 32-bit adder connected in series by four 8-bit look-ahead carry adders is used here. The carry-look-ahead adder can effectively reduce the delay caused by the carry.

ps: In addition, you can add two additional 8bit adders and a selector by selecting the adder to further reduce the delay.

Please add a picture description

4. Summary

A brief introduction to the implementation principle of the fixed-point multiplier, as well as the implementation process. Under this framework, if you want to optimize the performance of the multiplier, you generally start from two points, the generation of partial products and the compression of partial products. Most of the related papers on HowNet are optimized from these two points. Later, I will also optimize from these two points and the program, hoping to improve.

At present, the whole design consumes more than 2100 NAND gates and more than 800 NAND gates (converted into NAND gates). The yosys tool can be used for automatic statistics. Using the DC tool, only more than 800 gates are used, and the calculation delay is 16ns. (On the FPGA), the simulation diagram is as follows. It still does not meet the requirements of the competition, and an optimized version will be shared later. Those who are interested can study together.

Follow the reply to the basic version of the fixed-point multiplier to get the complete code

Please add a picture description

Guess you like

Origin blog.csdn.net/weixin_44678052/article/details/129890055