ZKP hardware acceleration

1 Introduction

This article focuses on:

  • 1) What is hardware acceleration? Why is hardware acceleration needed?
  • 2) The key computing primitives of ZKP:
    • Multiscalar Multiplication
    • Number Theoretic Transformation
    • Arithmetic Hashes
  • 3) Required hardware resources
  • 4) Limitation of acceleration
  • 5) The current status of hardware acceleration
  • 6) The future direction of hardware acceleration

2. What is hardware acceleration? Why is hardware acceleration needed?

2.1 What is hardware acceleration?

The so-called hardware acceleration refers to the use of dedicated hardware to accelerate a certain operation, making the operation faster and (and) more efficient.

Hardware acceleration can include:

  • Use off-the-shelf hardware (COTS) to optimize functions and code
  • Develop new hardware for specific tasks

Existing hardware includes CPUs, GPUs, and FPGAs, while custom hardware usually refers to ASICs.
There is a long history of using custom hardware to accelerate computationally expensive tasks such as:

  • Floating Point (FPU): Scientific Computing
  • Digital Signals (DSP): Audio and Video
  • Graphic (GPU): games, video, etc.
  • AI (TPU): Machine Learning Training and Inference
  • Networking (NPU/NIC): network processing
  • Cryptography

2.2 Hardware acceleration of cryptography

Current hardware acceleration for cryptography is:

  • 1) Hash operation:
    • SHA-NI
    • Bitcoin Miners
    • Altcoin Miners
  • 2) Public key cryptography:
    • Encryption (like AES-NI)
    • Key Exchange (TLS Offload)
    • Digital Signature (Intel Quick-Assist)
  • 3) Fully homomorphic encryption:
    • GPU
    • FPGA
  • 4)VDF(Verifiable Delay Functions)
  • 5)Zero Knowledge Proofs

2.3 Why is hardware acceleration needed?

ZK (and non-ZK) proof generation has high overhead compared to direct computation, which means:

  • Compared with direct witness checking, the total cost of Prover is 1 million to 10 million times higher. A program that only takes one second to run on a laptop may take tens or hundreds of days for SNARK Prover to run in a single thread.
  • zkEVM:
    • Scroll zkEVM: 1 million gas on-chain Verifier, corresponding Prover takes about 40 minutes on the CPU to generate ZKP proof. Compared with Polygon EVM's 10 million + gas/second limit, the overhead is 25,000 times more expensive.
  • zkVM:
    • Risc0 zkVM evaluates to ~50kHz vs. 5GHz for modern CPUs, 100K+ overhead.

For different purposes, the design of hardware acceleration is also different:

  • 1) Throughput: Increases the number of operations processed by a single system.
  • 2) Overhead: Reduce operational overhead. For example, Bitcoin mining rig is designed to reduce capital costs ($/hash) and operating costs (watts/hash).
  • 3) Delay: Reduce the delay of an operation. For example, zkBridge achieves faster finality by reducing the proof generation time.

2.4 What needs to be accelerated in ZKP

Each proof system, and its associated implementation, may have different computational requirements.
Nevertheless, the three most expensive computing operations are mainly:

  • Multiscalar Multiplication(MSM)
  • Number Theoretic Transformation(NTT)
  • Arithmetic Hashes(如Poseidon)

insert image description here
The types of operations required by different proof systems to generate a proof are mainly determined by the commitment scheme adopted:

  • SNARKs: Usually about 65% of events are used for MSM and NTT operations.
  • STARKs: About 65% of the time is spent on NTT and hash operations.

3. The key computing primitives of ZKP

3.1 MSM and its acceleration

Multiscalar Multiplication(MSM)为:

  • Algorithm for summing multiple scalar multiplications.
  • It can be regarded as the operation of 'dot product' between elliptic curve points and scalars.
  • Based on the characteristics of this algorithm, it is easy to parallelize each or each group of scalar multiplication, which can be divided and calculated on different hardware engines, and finally accumulated together.

There are various optimization methods to reduce the amount of calculation required to calculate the MSM, such as:

  • For larger sized MSM, the Pippenger algorithm can reduce the computational overhead from linear to about O ( n / log ⁡ ( n ) ) O(n/\log(n))O ( n /log(n))
  • Replacing point coordinate representations (such as Affine, Jacobian) and curve representations (such as Edwards) can also reduce the number of field operations required for a single curve operation.

The problems with MSM acceleration are:

  • When offloading MSM computations from the host device, scalars and points must have been sent to the accelerator. The available communication bandwidth limits the maximum possible performance of the accelerator.
    insert image description here

3.2 NTT and its acceleration

Number Theoretic Transformation (NTT) algorithm for multiplying two polynomials:

  • Think of it as an FFT/DFT of the finite field elements
  • Commonly used algorithms such as Cooley-Tukey can reduce the complexity from O ( N 2 ) O(N^2)O ( N2 )Reduced toO ( N log ⁡ N ) O(N\log N)O ( NlogN)

insert image description here
The problems with NTT acceleration are:

  • When offloading NTT operations from the host device, scalars must also be offloaded to the accelerator. The available communication bandwidth limits the maximum possible performance of the accelerator.
  • Based on the algorithm properties of NTT, NTT is not easy to parallelize , and each element must interact with multiple other elements, which means that it is not easy to split.
  • Additionally, these elements must be kept in memory for manipulation, requiring high memory.
    insert image description here

3.3 Arithmetic hash and its acceleration

The reason why Arithmetic hash is needed is:

  • Many ZKP use cases include proving knowledge of a hashed preimage, or using hashes, Merkle roots, and Merkle inclusion paths to efficiently represent data outside of circuits.
  • In the ZKP proof system, Arithmetic hash functions (such as Poseidon, Rescue Prime) are often used to replace traditional hash functions (such as SHA).
  • Although Arithmetic hash is more expensive to compute natively, it is more efficient when used in circuits, i.e. Arithmetic hash has fewer constraints.
  • A variety of algorithm parameters can be selected to instantiate the Arithmetic hash in the system, and different choices will affect the calculation overhead (such as field size, prime, round number, MDS matrix structure, etc.).
  • The efficient implementation of Arithmetic hash is mainly driven by the modular multiplication operation.

4. Required hardware resources

4.1 Modular multiplication

However, the underlying primitives of operations such as MSM, NTT, and Hash are:

  • Finite field operation and curve ECC operation
  • Finite field operations and curve ECC operations are mainly dominated by modular multiplication operations (ModMul)
  • Naively modular multiplication operation is O ( N 2 ) O(N^2)O ( N2 )For example, the 384-bit curve ECC operation is about 2.25 times more expensive than the 256-bit one.
    insert image description here
    The performance overhead of high-level operations depends on different characteristics, such as:
  • Number of operations
  • field size
  • curve point size
  • point representation (such as Affine vs. Jacobian)
  • Prime/Modulus Features
  • Operational complexity (such as Poseidon's x 5 x^5x5 vs. x 7 x^7 x7
  • etc.

From these characteristics, the total number of ModMul operations required for high-level operations can usually be calculated.

4.2 Choosing the right hardware

Knowing that all computational overhead is dominated by modular multiplication operations, the hardware platform chosen should be able to perform a large number of modular multiplication operations quickly and cheaply.

When evaluating hardware performance, mainly look at:

  • Number of hardware multipliers
  • Hardware multiplier size
  • Speed/frequency per command

For example, the following hardware resource example:
insert image description here
where the Mul Power calculation rule is:

  • number of multipliers * multiplier size * frequency / 1000

5. Limitation of acceleration

The 2 key elements of hardware acceleration are:

  • 1) Algorithm:
    • A "hardware-friendly" algorithm should be chosen. For example, off-the-shelf hardware (COTS) GPUs have thousands of cores and are suitable for highly parallelizable algorithms.
    • In addition, an efficient algorithm should aim to reduce the number of operations required (such as modular multiplication) to reduce the overall computational overhead.
  • 2) Efficient code implementation:
    • Once an efficient, "hardware-friendly" algorithm is found, the algorithm needs to be tuned to better fit the hardware capabilities.
    • To improve performance, the actual code implementation should use as much hardware resources as possible. This usually requires utilizing the underlying assembly primitives.

The speedup is limited by:

  • Multiplication is not the only resource required
  • Other non-computing resources can also become bottlenecks, such as:
    • Memory: [For example, sometimes NTT will be limited by memory access speed.
      • Memory capacity (eg 12GB)
      • Fast memory (such as DDR, HBM)
    • High-speed data transmission communication (such as PCIe v4) [such as the current GPU and FPGA used for NTT acceleration, is not limited to computing resources, but depends on the data movement capabilities between the host and the accelerator.
    • Other Computing Resources and Arithmetic Units

The current acceleration traps are mainly:

  • 1) Communication. Over the past few years, data movement has increasingly become a bottleneck in 'big data' systems. Highly parallel algorithmic computations are often faster than data movement.
  • 2) Amdahl's Law (Amdahl's Law): "When improving the performance of a part of the system, the impact on the performance of the entire system depends on: 1. How important this part is; 2. How much the performance of this part has been improved."
    • If MSM/NTT/Arithmetic Hash operations account for about 65%, completely removing these operations will increase the proof generation speed by 3 times.

6. The Current State of Hardware Acceleration

An example of production-grade ZKP hardware acceleration:

  • Currently Filecoin is the largest production-level ZKP system, processing 1 million + ZKP per day.
  • Filecoin adopts 'Proof-of-Replication' (PoRep) ZKP proof.
  • Each PoRep contains:
    • 470GB Poseidon Hashing: ~100 minutes on CPU, 1 minute on GPU.
    • 10 Groth16 Proofs with about 13 million constraints:
      • Requires about 4.5 billion MSM operations (about 4.2 billion MSM operations in G1, and about 300 million MSM operations in G2)
      • It takes about 60 minutes to run on the CPU and 3 minutes on the GPU.

At present, there are different types of hardware acceleration libraries and experiments for GPU and FPGA on the Internet. ZPrize.io is an awesome resource to find code and learn about optimizations and current performance.
insert image description here

7. Future direction of hardware acceleration

The future directions of ZKP hardware acceleration are:

  • Improved algorithm (such as improved MSM algorithm)
  • Improved primitives (like new hash functions)
  • Improved proof system (e.g. less total operands)
  • Simplified proof:
    • reduced function and/or communication
    • STARKs: No MSM
    • Nova: No NTT
  • Improve code implementation
  • Custom hardware (such as ASICs)

References

[1] Kelly Olson's shared video on ZKP MOOC in May 2023 ZKP MOOC Lecture 16: Hardware Acceleration of ZKP

ZKP Acceleration Series Blog

Guess you like

Origin blog.csdn.net/mutourend/article/details/132623636