International Conference on Field Programmable Logic and Applications(FPL)-2015-2020

19-1Bent Routing Pattern for FPGA

Xibo Sun, Hao Zhou , Lingli Wang *
8pages

Abstract

FPGA placement and routing is very important in the early design of FPGA. In the current undirected placement and routing structure, signal transmission is realized through programming-controlled switches, and additional corner switches are required at the right angles of the crossing, which leads to increased wiring length and increase. Big time delay. Existing research often focuses on the topological structure of the wiring, considering how to allocate various wires in the vertical and horizontal directions; this article takes a different approach and proposes a curved layout pattern, through this bent routing pattern, can span horizontal and vertical directions The connection at the corner is completed, and the symmetry and regularity of the original circuit are still maintained. Finally, a random search simulation is performed based on the annealing algorithm. The simulation results show that the mixed use of bent wire and straight wire for FPGA connection reduces 9% The critical path delay and 11% area-delay.

19-2 A Low-Latency Multi-Version Key-Value Store Using B-tree on an FPGA-CPU Platform

Author: Yuchen Ren 1, Jinyu Xie 1, Yunhui Qiu 1, Hankun Lv 1, Wenbo Yin 1 , Lingli Wang 1, Bowei Yu 2, Hua Chen 2,
Xianjun He 2, Zhijian Liao 2, Xiaozhong Shi 2
*
5pages

Abstract

In recent years, many low-latency key-value store systems have attracted wide attention. For example, in order to reduce the processing time of server-side data in the network, Remote Direct Memory Access (RDMA), RDMA transfers data directly through the network The storage area of ​​the computer quickly moves data from one system to the remote system memory without any impact on the operating system, so that much computer processing functions are not needed.
However, most KVS systems currently do not support access or query of data on different machines, such as snapshots. This article builds a low-latency cross-machine memory KVS platform based on FPGA, uses cuckoo hashing to build a hash table to store the key value, and compares each set of key-value pairs with the actual server storage Keep the key consistent. The built system supports the following operations: put, get, delete, CAS (Compare and Swap), getPredecessor and range query within a B-tree. The above operations, except for range query within a B-tree, bypass the CPU execution.
Take the get operation as an example. The experimental results show that the average operation time in 5-level B-tree is 8us, which is 9 times faster than the execution time on the CPU.

18-1 RNA: An Accurate Residual Network Accelerator for Quantized and Reconstructed Deep Neural Networks

Authors: Cheng Luo 1, Yuhua Wang 1, Wei Cao 1, Philip HW Leong 2, Lingli Wang *
4pages

Abstract

With the deepening of DNN research, a series of complex and sophisticated networks have been proposed. For example, the residual network ResNet shows good performance in image classification tasks. However, due to the complexity of the structure and the huge amount of calculation, the hardware implementation of the residual network is It is a very difficult thing.
This paper first proposes a quantized and reconstructed network (Quanized and reconstructed DNN, QR-DNN) method, which can add the BN layer during training and remove it during hardware implementation; secondly, a simple and efficient design based on QR-DNN Residual Network Accelerator (RNA), completes the representation of weights in a logarithmic system. RNA implements shift-accumulation operations in a pulsating manner instead of multiplication.
Experimental results show that the accuracy of the QR-DNN strategy is improved by 1-2% compared to other current methods, and RNA has achieved the best results in the fixed-point accelerator. The acceleration effect of this article on the Xilinx Zynq XC7Z045 board is: s 804.03 GOPS, 104.15 FPS and 91.41%
top-5 accuracy for the ResNet-50 benchmark, and also achieved SOTA on AlexNet and VGG.

18-2 A Novel Low-Communication Energy-Efficient Reconfigurable CNN Acceleration Architecture

Authors:Di Wu,Jin Chen,Wei Cao ,Lingli Wang *
4 pages

Abstract

Winograd is a very efficient fast convolution algorithm, which can effectively reduce the number of calculations for volume and calculation. [Ref: Winograd based Winograd paper proposes a method for calculating the matrix to further reduce the amount of calculation Winograd itself and the use of a core allocation strategy (kernel-partitioning) can be adapted so long Winograd larger step convolution operation. Secondly, too much off-chip communication will also hinder computing efficiency. Because different frameworks make the data stored in DRAM vary, too much DRAM access is not conducive to efficiency improvement, so this article uses dynamic configuration combined with on-chip sharing memory to reduce access to DRAM. Finally, this article can be configured with three different data streams. Experiments have tested the implementation effects of VGG16, AlexNet and ResNet50, and the final acceleration performance: 685.6GOP/s, 1250GOP/s and 507GOP/s for AlexNet, VGGNet16 and ResNet50.
Experimental platform: ZC706

18-3 Fast Adjustable NPN Classification Using Generalized Symmetries

Authors: Xuegong Zhou, Lingli Wang, Peiyi Zhao
7pages

Abstract

NPN classification is a commonly used technical means for FPGA logic synthesis and mapping. Classification often requires calculation of the canonical form of a function. This article proposes a new method of calculating function paradigm, which reduces the search space of calculation through symmetry-folding. , Making the running time extremely reduced. Through this calculation method, a speedup of 30x can be achieved.

17-1 Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA

Authors: Li Jiao, Cheng Luo, Wei Cao , Xuegong Zhou, Lingli Wang *
4pages

Abstract

Convolutional neural network can achieve high classification accuracy, but the calculation is complicated. The binary neural network BNN greatly simplifies the calculation by binarizing weights and activation values, but it inevitably brings accuracy loss. This article first compares the hardware implementation effects of low-bit-width CNN, BNN and standard CNN. The experimental results show that low-bit-width CNN is more suitable for embedded systems. Secondly, this paper proposes a two-stage arithmetic unit (TSAU) as the basic unit of each layer calculation, thereby accelerating the calculation of each layer in the low-bit width CNN. Finally, the Zynq XC7Z020 hardware implements DoReFa-Net, where the weight and activation value use 1bit and 2bit respectively to represent the final acceleration effect: 106 FPS throughput, 73.1% top-5 accuracy on the ImageNet dataset.
The implementation method of this article has reached SOTA in the current neural network accelerator using FPGA, which is a good trade-off between accuracy, energy and resource efficiency.

17-2 FPGA Acceleration of the Scoring Process of X!TANDEM for Protein Identification

Authors:Jin Qiu, Ping Kang, Li Ding, Yipeng Yuan, Wenbo Yin, Lingli Wang
4pages

Abstract

Tandem mass spectrometry is the main technical method for protein detection. X! Tandem is a protein library search software, but it often takes hours or even days to perform a search task in a huge search space, so it is urgently needed. An efficient search method. And X! Tandem search time 70%-90% is spent on profiling analysis (Profiling analysis), mainly by calculating the ion intensity of the mass spectrum peaks to be measured in series to generate scores. This article uses FPGA to accelerate the scoring process of X! Tandem, and simulates the process of 1 particle generation and 6 scoring generation in Xilinx Virtex-7 XC7VX690T FPGA, and X! Tandem's pure soft implementation (a 2.5GHz Intel i7 -4870 processor with 16 GB memory) obtained 26 times the speedup, in which ion generation: 67x, score generation: 17x. And the scalability of the score generation module changes linearly, which is better than the previous parallel acceleration strategy.

16-1 FPGA Acceleration of the Scoring Process of X!TANDEM for Protein Identification

Author: Huimin Li, Xitian Fan, Li Jiao, Wei Cao , Xuegong Zhou, Lingli Wang *

Abstract

In recent years, convolutional neural networks have been widely used in the field of computer vision, but some large-scale convolutional neural networks have extremely high requirements for computing power and memory, which limit the hardware implementation of CNN. This article proposes an end-to-end FPGA-based neural network accelerator that can map all layers of the network to the same chip, and each layer of the network can be calculated in parallel in a pipeline manner to increase throughput; in order to improve throughput and resource utilization , This article also proposes a new parallelization strategy to complete the calculation of the fully connected layer in a patch-based manner, which can improve the utilization of internal bandwidth; in addition, by inputting two specific specifications of input data to the FC layer, Can greatly reduce the buffer required on the chip. The experimental results of implementing AlexNet on Xilinx VC709 show that the performance after acceleration reaches: 565.94 GOP/s and 391 FPS, 156MHz clock frequency

16-2 Reconfigurable Architecture for Stream Applications

** Author: Xitian Fan, Huimin Li, Wei Cao*, Lingli Wang**

Abstract

This paper proposes a coarse-to-fine reconfigurable architecture (CGRA) for object detection tasks in the field of computer vision. CGRA mainly optimizes the stream processing part and designs a programming module. CGRA is written in VHDL, using SMIC 55nm process library. Eight calculations were tested in the CGRA framework: HOG, CNN, K-means, PCA, SPM, linear-SVM, softmax, and Joint-Bayesian. You can see that they are all operations that are used in face/object detection. The experimental results show that the CGRA implementation achieves a 1443x energy efficiency improvement over the CPU (Intel i7-3770) implementation, and a 7.82x energy efficiency improvement over FPGA.

16-3 Connect On the Fly: Enhancing and Prototyping of Cycle-Reconfigurable Modules

**Authors:Hao Zhou ∗, Xinyu Niu †, Junqi Yuan ∗, Lingli Wang ∗, Wayne Luk **

Abstract

In order to deduct the FPGA's support for various operations of dynamic data, this paper proposes a cycle-reconfigurable module, which only needs to obtain the data size and location at runtime to complete the data access. The whole module contains three parts: dynamic FIFOs, dynamic caches and dynamic shared memories. Based on the SMIC 130nm process, a FOGA chip including a cycle-reconfigurable module is designed. The entire module contains only 39 CLBs, and the configuration can be completed in 1.2ns. It is suitable for large-scale search and sparse matrix multiplication, and the speedup ratio of configuration time with conventional FPGA is 11x.

16-4 Memory Efficient and High Performance Key-value Store on FPGA Using Cuckoo Hashing

Authors: Wei Liang, Wenbo Yin ∗, Ping Kang, Lingli Wang

Abstract

Key-Value Store is a popular topic now, especially when building large-scale Internet applications such as search engines, game servers, and providing cloud computing services, how to ensure the system's high performance, high reliability, and high expansion in a massive data environment Performance, high availability, and low cost have become the focus of exploration of various system architectures, and how to solve the throughput and latency of KVS is the biggest challenge. Parallel computing with FPGA can greatly improve energy efficiency, and Cuckoo hashing is an efficient way to implement KVS, which can effectively improve resource utilization and maintain an almost fixed worst run time.
This article implements KVS on FPGA based on Cuckoo hashing. The memory utilization rate is 81.7%. The latency for insert, search, and delete operations is only 40ns, while the throughput rate for search and delete operations reaches 200MRPS, which is 5 of the current SOTA. Times.

15-1 Greedy Approach Based Heuristics for Partitioning SpMxV on FPGAs

Authors: Jiasen Huang', Weina Lu2, Junyan Ren'

Abstract

In the early days, in order to solve the zero-filling problem of sparse matrix multiplication, the main optimization idea was to decompose the sparse matrix into row vectors or sub-matrices, but the sparseness still inevitably brought performance degradation. This paper proposes a recursive merging method, which continuously merges non-zero vectors into a row vector set based on a greedy strategy to ensure that each set is the current local optimal solution. In the public University of Florida Sparse Matrix Collection benchmark, the algorithm in this article achieved the highest mean density (96%), and the speedup of sparse matrix multiplication on the XC7V485T 32 processor was 249x.

15-2 UniStream: A Unified Stream Architecture Combining Configuration and Data Processing

* Author: Jian Yan 1, Jifang Jin 1, Ying Wang 1, Xuegong Zhou 1 , Philip Leong 2 and Lingli Wang **

Abstract

The Unistream (unified stream) of this article combines the processing of bit stream and data stream into one, and also provides a unified API to support bit stream configuration, data stream processing and stream interconnection. This paper proposes a cost model for evaluating flow interconnection, hardware task configuration and system-level data flow processing, which can be used for cost evaluation in the early stages of development. The flexibility and high energy efficiency of Unistream are demonstrated on XilinxVirtex-5 and Virtex-6FPGA. Experimental results on bitstream configuration/readback, data encryption/decryption, and discrete cosine transform show that adopting different streaming modes can significantly improve performance.

Guess you like

Origin blog.csdn.net/qq_37151108/article/details/107491962