FPGA pure verilog implements Gzip data compression deflate algorithm, providing engineering source code and technical support

1 Introduction

When it comes to the application of FPGA, the hardware accelerator of data compression algorithm is undoubtedly one of the classic applications. Using FPGA to compress pictures, videos, common data, etc. has the unique advantage of parallel execution. Regarding FPGA compression of pictures and videos, my previous blog has related Design, today I will talk about the implementation of the Gzip compression algorithm for ordinary data using FPGA; the function of the source code of this project is: FPGA-based streaming GZIP (deflate algorithm) compressor for general lossless data compression: input the original data, Output the standard GZIP format, that is, the common .gz / .tar.gz file format.

2. The FPGA compression algorithm scheme that I have here

I have image JPEG decompression, JPEG-LS compression, H264 codec, H265 codec and other solutions, and there will be more solutions in the future. I will integrate them into a column and will continue to update. Column address:
direct click to go

3. FPGA Gzip data compression function and performance

3.1: Pure RTL design, can be deployed on various FPGA models;
3.2: Simplified stream interface: AXI-stream input interface:
3.2.1: Data bit width 8-bit, each cycle can input 1 byte of waiting Compressed data;
3.2.2: AXI-stream packets with input length ≥ 32 bytes will be compressed into an independent GZIP data stream; 3.2.3:
AXI-stream packets with input length < 32 bytes will be Discarded by the module (not worth compressing), will not produce any output;
3.3: AXI-stream output interface:
4.3.1: Each output AXI-stream package is an independent GZIP data stream (including GZIP file header and file trailer );
3.4: Performance:
If the output interface has no back pressure, that is, o_tready is always = 1, then the input interface must also have no back pressure, that is, o_tready is always = 1 (even in the worst case); this is my deliberate
design , the advantage is that when the external bandwidth is sufficient, this module can run at a certain and highest performance (input throughput = clock frequency);
on Xilinx Artix-7 xc7a35ticsg324-1L, the clock frequency runs to 128MHz (input performance is 128MByte/s );
resources: about 8200×LUT and 25×BRAM36K on Xilinx FPGA;
support almost complete deflate algorithm:
written in accordance with deflate algorithm specification (RFC1951 [1]) and GZIP format specification (RFC1952 [2]);
deflate block:
Input AXI-stream packets smaller than 16384 bytes are regarded as a deflate block;
input AXI-stream packets larger than 16384 bytes are divided into multiple deflate blocks, each not exceeding 16384;
LZ77 run length compression:
search distance is 16383, range coverage The entire deflate block;
use hash table matching search, hash table size = 4096;
dynamic huffman encoding:
when the deflate block is large, build a dynamic huffman tree, including literal code tree and distance code tree;
when the deflate block is small, use Static huffman tree is used for encoding;
due to the support of the above functions, the compression rate of this design
is close to the .gz file generated by 7ZIP software under the "fast compression" option;
significantly greater than other existing open source deflate compressors. See comparison and evaluation;
according to the regulations of GZIP, the CRC32 that generates the original data is placed at the end of GZIP for verification;
unsupported features:
do not build a dynamic code length tree, but use a fixed code length tree because of its benefits The cost ratio is not as high as the dynamic literal code tree and distance code tree;
the deflate block size will not be dynamically adjusted to improve the compression rate, the purpose is to reduce complexity;

4. FPGA Gzip data compression design scheme

The block diagram of the FPGA Gzip data compression design scheme is as follows:
insert image description here

I/O interface description

The input interface is a standard 8-bit AXI-stream slave;
insert image description here
i_tvalid and i_tready form a handshake signal, and only when both = 1 can a data be successfully input (as shown in the figure above);
i_tdata is 1 byte of input data;
i_tlast It is the demarcation mark of a packet, i_tlast=1 means that the current transmission is the last byte of a packet, and the next byte to be transmitted is the first byte of the next packet. Each packet will be compressed into an independent GZIP data stream;

The output interface is a standard 32-bit (4 bytes) AXI-stream master;
o_tvalid and o_tready form a handshake signal, and only when both are equal to 1 can one data be successfully output (similar to the input interface);
o_tdata is 4 words Section output data. According to the regulations of AXI-stream, o_tdata is little endian, o_tdata[7:0] is the first byte, o_data[31:24] is the last byte; o_tlast is the demarcation mark of the packet
. Each packet is an independent GZIP data stream;
o_tkeep is a byte valid signal, as follows:
o_tkeep[0]=1 means o_tdata[7:0] is valid, otherwise invalid;
o_tkeep[1]=1 means o_tdata[ 15:8] is valid, otherwise it is invalid;
o_tkeep[2]=1 means o_tdata[23:16] is valid, otherwise it is invalid;
o_tkeep[3]=1 means o_tdata[31:24] is valid, otherwise it is invalid;
when the output packet When the number of bytes is not divisible by 4, only at the end of the packet (o_tlast=1), o_tkeep may be 4'b0001, 4'b0011, 4'b0111; in
other cases o_tkeep=4'b1111;

Data processing flow

LZ77 Compressor

See the block diagram of the design scheme for details;
the input data is first sent to the LZ77 compressor for data processing;
for the principle of the LZ77 compressor, it is recommended to read the following blog:
Click to read the LZ77 compressor directly
. The position of the LZ77 compressor in the code is as follows: You can see To, realized by pure verilog code;
insert image description here

Huffman coding

After the input data passes through the LZ77 compressor, it is sent to the dynamic construction of the Huffman tree and the Huffman encoding module in two ways. The Huffman encoding module internally defines a two-dimensional array as a cache. Regarding the principle of Huffman encoding, it is recommended Read the following blog for reference:
Click to read Huffman coding directly
The dynamic construction of the Huffman tree and the location of the Huffman coding module in the code are as follows: As you can see, it is implemented by pure verilog code; the
insert image description here
internal definition of the Huffman coding module The two-dimensional array is used as a cache code location as follows:
insert image description here

output cache

The essence of the output cache is a FIFO of the AXIS interface, which caches the output data for AXIS format output. This is very simple, and the position in the code is as follows: As you can see, it is implemented by pure verilog code;
insert image description here

Data output instructions

The output of the AXI-stream interface is the data that meets the GZIP format standard. After storing the data of each AXI-stream package into a .gz file independently, this file can be decompressed by many compression software (7ZIP, WinRAR, etc.).
Note: .gz is the concept of GZIP compressed files. Better known as .tar.gz. In fact, TAR packs multiple files into a .tar file, and then performs GZIP compression on this .tar file to obtain a .tar.gz file. If a single file is compressed, it can be directly compressed into a .gz without TAR packaging. For example, data.txt is compressed into data.txt.gz; for example, there are 987 successful handshakes on the AXI-stream interface, and o_tlast=1 in the last handshake, indicating that the 987 beats of data are an independent GZIP stream. Assuming o_tkeep=4'b0001 during the last handshake, the last beat only carries 1 byte of data, and the GZIP stream contains a total of 986×4+1=3949 bytes. If you put those bytes into a .gz file, you should:

.gz 文件的第1字节 对应 第1拍的 o_tdata[7:0]
.gz 文件的第2字节 对应 第1拍的 o_tdata[15:8]
.gz 文件的第3字节 对应 第1拍的 o_tdata[23:16]
.gz 文件的第4字节 对应 第1拍的 o_tdata[31:24]
.gz 文件的第5字节 对应 第2拍的 o_tdata[7:0]
.gz 文件的第6字节 对应 第2拍的 o_tdata[15:8]
.gz 文件的第7字节 对应 第2拍的 o_tdata[23:16]
.gz 文件的第8字节 对应 第2拍的 o_tdata[31:24]
......
.gz 文件的第3945字节 对应 第986拍的 o_tdata[7:0]
.gz 文件的第3946字节 对应 第986拍的 o_tdata[15:8]
.gz 文件的第3947字节 对应 第986拍的 o_tdata[23:16]
.gz 文件的第3948字节 对应 第986拍的 o_tdata[31:24]
.gz 文件的第3949字节 对应 第987拍的 o_tdata[7:0]

Special Instructions

If the output interface has no backpressure, that is, o_tready is always = 1, then the input interface must also have no backpressure, that is, o_tready is always = 1 (even in the worst case); with this feature, if the external bandwidth is stable enough that It can be guaranteed that o_tready is always = 1, then the i_tready signal can be ignored, and i_tvalid = 1 can be used to input a byte at any time.
The deflate algorithm needs to use the entire deflate block to build a dynamic huffman tree, so the end-to-end delay of this module is high:
when the input AXI-stream packet length is 32~16384, only after the complete packet is input (and it takes a while time), the first data of the corresponding compressed package can be obtained on the output AXI-stream interface.
When the length of the input AXI-stream packet is >16384, the compressed data corresponding to this part of the data can be obtained on the output AXI-stream interface after each complete input of 16384 bytes (and after a period of time). When the length of the input AXI-stream packet is <32, the module will discard the packet internally and will not generate any output data for it.
To obtain high compression rate, try to make the packet length > 7000 bytes, otherwise the module may not choose to use dynamic huffman, and the search range of LZ77 will be very limited. If the data to be compressed is logically a lot of small AXI-stream packets, you can add a preprocessor in front to merge them into a large packet of several thousand or tens of thousands of bytes and send it to gzip_compressor_top;

5. Vivado simulation

The vivado simulation design block diagram is as follows:
insert image description here
The vivado simulation code architecture is as follows:
insert image description here
Among them, the random data packet generator (tb_random_data_source.v) will generate four types of data packets with different characteristics (random data with uniform distribution of byte probability, random data with non-uniform distribution of byte probability) data, random and continuously changing data, sparse data), and send it to the module under test (helai_gzip_compressor) for compression, and then the tb_save_result_to_file module will save the compression result to a file, and each independent data packet will be stored in an independent .gz file.

tb_print_crc32 is responsible for calculating and printing the CRC32 of the original data. Note: the module under test will also calculate CRC32 and encapsulate it into the GZIP data stream. These two CRC32 calculators are independent (the former is only used for simulation to verify the generation of the module under test. CRC32 is correct). You can compare the emulated printed CRC32 with the CRC32 in the generated GZIP file yourself.
The vivado simulation print results are as follows:
insert image description here
The simulation waveform is as follows:
insert image description here
The path to save the zip compressed file generated after the simulation is as follows:
insert image description here
Then you can unzip it and open it for viewing. . .

6. vivado project

Development board FPGA model: Xilinx–>xc7k325tffg676-2;
input: python test data, serial port input;
input: zip compressed data, serial port output;
application: zip data compression application;

The vivado project design block diagram is as follows:
insert image description here
the FPGA project receives serial port data, sends the data to the GZIP compressor, and sends the obtained GZIP compressed data stream through the serial port (serial port format: baud rate 115200, no check digit).
On the computer (host computer), a python program is written. The execution steps of the program are:
read a file from the computer disk (the user specifies the file name through the command line);
list all serial ports on the computer, the user needs to Select the serial port corresponding to the FPGA (if only one serial port is found, select this serial port directly); send
all the bytes of the file to the FPGA through the serial port; receive the data sent by the FPGA at the same time;
save the received data into a .gz The file is equivalent to calling the FPGA to compress the file;
finally, call the gzip library of python to decompress the .gz file, and compare it with the original data to see if it is equal. If they are not equal, an error will be reported;
since the serial port speed is much lower than the maximum performance that gzip_compressor_top can achieve, this project is only for demonstration. In order to make the Gzip data compression module play a higher performance, other high-speed communication interfaces are needed.
The python code placement directory is as follows:
insert image description here
the project code structure is as follows:
insert image description here
FPGA resource consumption is as follows:
insert image description here
the resource consumption after porting to other FPGAs is as follows:
insert image description here

7. Board debugging and verification

FPGA development board test

After burning the FPGA project, open the command line in the python directory and run the following command:

python fpga_uart_gz_file.py <需要压缩的文件名>

For example, run the following command to compress raw.hex:

python fpga_uart_gz_file.py raw.hex

If the compression is successful, the raw.hex.gz file will be obtained, and no error message will be printed.

This zip algorithm is compared to the evaluation

In order to evaluate the comprehensive performance of this design, we compare this design with the following three deflate compression schemes:
1: Software compression: run 7ZIP compression software configuration on the computer
: compression format=gzip, compression level=extreme compression, compression method= deflate, dictionary size = 32KB, word size = 32;

2: HDL-deflate algorithm:
an FPGA-based deflate compressor/decompressor, its compressor only uses LZ77 + static huffman, not dynamic huffman. In LZ77, it does not use a hash table for searching.
In this test, the highest configuration (also the default configuration) of its compressor is used: LOWLUT=False, COMPRESS=True, DECOMPRESS=False, MATCH10=True, FAST=True;

3: HT-Deflate-FPGA algorithm:
a multi-core deflate compressor based on FPGA, it also only uses LZ77 + static huffman, not dynamic huffman. In the LZ77, it uses a hash table for searching, which is the same as this design.
This design is multi-core parallel deflate compression for Xilinx-AWS cloud platform, which has high performance but takes up more resources. In contrast, both HDL-deflate and this design are single-core compression, designed for embedded applications.
Since this design does not provide a hands-on simulation project, it is difficult to run it quickly, so it is not used to participate in the actual test comparison, only a qualitative comparison is made.

4: Comparison platform:
software compression runs on a personal computer (Intel Core i7-12700H, 16GB DDR4);
this design and HDL-deflate are deployed on a Xilinx Artix-7 development board;

5: Test data
The data to be compressed for comparison is raw.hex in the python directory, and its size is 512kB;

6: Comparison results
The following table shows the comparison results: In the indicators, ↑ means the bigger the better, ↓ means the smaller the better; the
insert image description here
compression rate of HT-Deflate-FPGA should be higher than that of HDL-deflate, but lower than that of this design. This is based on the analysis of the algorithm characteristics it supports.

8. Benefits: Acquisition of engineering codes

Benefits: Obtaining the engineering code
The code is too large to be sent by email. It will be sent via a certain network disk link, and
the data acquisition method: the V business card at the end of the article.
The network disk information is as follows:
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_41667729/article/details/131791639