Baby-Step Giant-Step & Homomorphic DFT

references:

  1. [CT65] Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series[J]. Mathematics of computation, 1965, 19(90): 297-301.
  2. [Shoup95] Shoup V. A new polynomial factorization algorithm and its implementation[J]. Journal of Symbolic Computation, 1995, 20(4): 363-397.
  3. [HS14] Halevi S, Shoup V. Algorithms in helib[C]//Advances in Cryptology–CRYPTO 2014: 34th Annual Cryptology Conference, Santa Barbara, CA, USA, August 17-21, 2014, Proceedings, Part I 34. Springer Berlin Heidelberg, 2014: 554-571.
  4. [CHKKS18] Cheon J H, Han K, Kim A, et al. Bootstrapping for approximate homomorphic encryption[C]//Advances in Cryptology–EUROCRYPT 2018: 37th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Tel Aviv, Israel, April 29-May 3, 2018 Proceedings, Part I 37. Springer International Publishing, 2018: 360-384.
  5. [CHH18] Cheon J H, Han K, Hhan M. Faster homomorphic discrete fourier transforms and improved fhe bootstrapping[J]. Cryptology ePrint Archive, 2018.
  6. [HHC19] Han K, Hhan M, Cheon J H. Improved homomorphic discrete fourier transforms and fhe bootstrapping[J]. IEEE Access, 2019, 7: 57361-57370.
  7. Nussbaumer Transform 以及 Amortized FHEW bootstrapping
  8. Chimera: Hybrid RLWE-FHE scheme
  9. Quick Multiplication Techniques: Karatsuba, Toom, Good, Schonhage, Strassen, Nussbaumer
  10. Paterson-Stockmeyer polynomial evaluation algorithm

Baby-Step Giant-Step

Shop95

The article [Shoup95] studies and implements the BSGS factoring method for decomposing univariate polynomials into irreducible factors. CRT and FFT are used to represent polynomials (earlier than Doube-CRT of GHS12), and fast multiplication, division, inverse, square, GCD and other operations of polynomials are implemented.

Polynomial decomposition can be divided into three steps,

  1. square-free factorization:Shoda-style decomposition f = ∏ i f i f = \prod_i f_i f=ifi, inside f i f_i fiis square free
  2. distinct-degree factorization: Decompose the square free polynomial into f i = ∏ j f i , j f_i = \prod_j f_{i,j} fi=jfi,j, inside f i , j f_{i,j} fi,j is some degree j j Product of irreducible factors of j
  3. equal-degree factorization: The irreducible factors are square free polynomials of the same degree f i , j f_{i,j} fi,j, decomposed into these irreducible polynomials

The main steps are concentrated in step 2, [Shoup95] observed the fact: for any non-negative integer a , b ∈ Z + a,b \in \mathbb Z^+ a,bWITH+,多项式 h a , b ( x ) = x p a − x p b ∈ G F ( p ) [ x ] h_{a,b}(x) = x^{p^a} - x^{p^b} \in GF(p)[x] ha,b(x)=xpaxpbGF(p)[ x] possessive fullness deg ⁡ f ∣ ( a − b ) \deg f |(a-b) ofgf(ab) impossible multi-format f f f cause expression.

for deg ⁡ f ≤ n \deg f \le n ofgfThe square free polynomial of n has its true factor degree not exceeding n / 2 n/2 n/2, inequality f d , 1 ≤ d ≤ n f_d,1 \le d \le n fd,1dn is the whole thing d d The product of d times irreducible factors. We can enumerate all 1 ≤ a − b ≤ n 1\le a-b\le n 1abn,计算出 h a , b ( x ) h_{a,b}(x) ha,b(x),再设计 gcd ⁡ ( h a , b , f ) \gcd(h_{a,b},f) gcd(ha,b,f) f d f_d fd

[Shoup95] Use BSGS Calculation from calculation h a , b h_{a,b} ha,b, set the upper bound on the degree of the true factors B B B,将它分为 B = l ⋅ m B=l \cdot m B=lm,Baby-Step 就是 { i : 1 ≤ i ≤ l } \{i:1 \le i \le l\} { i:1il},Giant-Step Step { l ⋅ j : 1 ≤ j ≤ m } \{l \cdot j:1 \le j \le m\} { lj:1jm}

Insert image description here

However, if we simply calculate directly h i , H j h_i,H_j hi,Hj, the above algorithm is still impractical. [Shoup95] Compute them iteratively: h i + 1 = h i ( h 1 ) ( m o d f ) h_{i+1} = h_i(h_1) \pmod f hi+1=hi(h1)(modf) H j + 1 = H j ( H 1 ) ( m o d f ) H_{j+1} = H_j(H_1) \pmod f Hj+1=Hj(H1)(modf), the question now is how to quickly calculate thesemodular-composition, in the form g ( h ) ( m o d f ) g(h) \pmod f g(h)(modf)

[Shoup95] still adopts the BSGS algorithm (similar to the [PS73] polynomial evaluation algorithm) and selects parameters t ≈ n t \approx \sqrt n tn ,Definition h i ( m o d f ) , 0 ≤ i ≤ t h^i \pmod f , 0 \le i \le t hi(modf),0it 表格,那么:
g ( x ) = ∑ j = 0 n / t g j ( x ) ⋅ y j ,    y = x t ,    deg ⁡ g j < t g(x) = \sum_{j=0}^{n/t} g_j(x) \cdot y^j,\,\, y=x^t,\,\, \deg g_j < t g(x)=j=0n/tgj(x)andj,and=xt,ofggj<t
Therefore, directly use the contents of the precalculation table to simply calculate addition (and multiplication),
g j ( h ) = ∑ i = 0 t g j , i ⋅ h i ( m o d f ) g_j(h) = \sum_{i=0}^t g_{j,i} \cdot h^i \pmod f gj(h)=i=0tgj,ihi(modf)
接着,采取 Horner 法则,计算出
g ( x ) = ( ( g n / t ⋅ h t + ⋯   ) ⋅ h t + g 1 ) ⋅ h t + g 0 g(x) = ((g_{n/t} \cdot h^t + \cdots)\cdot h^t +g_1)\cdot h^t + g_0 g(x)=((gn/tht+)ht+g1)ht+g0
The polynomial operations are all calculated in Double-CRT mode, and the total complexity is O ( n 2.5 + n log ⁡ n log ⁡ log ⁡ n log ⁡ p ) O(n^{2.5}+n \log n \log\log n\log p) O(n2.5+nlognloglognlogp)

CT65

[CT65] gives a recursive form of DFT decomposition, which can actually be regarded as a BSGS version of the FFT algorithm. The DFT formula is:
A j : = ∑ k = 0 N − 1 a k ⋅ ζ j k A_j := \sum_{k=0}^{N-1} a_k \cdot \zeta^ {jk} Aj:=k=0N1akgjk
adopts the BSGS algorithm and is decomposed into N = N 1 ⋅ N 2 N=N_1 \cdot N_2 N=N1N2,设置索引
j : = N 1 ⋅ j 1 + j 0 ,    j 0 ∈ [ N 1 ] , j 1 ∈ [ N 2 ] k : = N 2 ⋅ k 1 + k 0 ,    k 0 ∈ [ N 2 ] , k 1 ∈ [ N 1 ] \begin{aligned} j &:= N_1 \cdot j_1 + j_0,\,\, j_0 \in [N_1], j_1 \in [N_2]\\ k &:= N_2 \cdot k_1 + k_0,\,\, k_0 \in [N_2], k_1 \in [N_1]\\ \end{aligned} jk:=N1j1+j0,j0[N1],j1[N2]:=N2k1+k0,k0[N2],k1[N1]
Free
Let j 1 , j 0 : = ∑ k 0 ∈ [ N 2 ] ∑ k 1 ∈ [ N 1 ] a k 1 , k 0 ⋅ ζ j k = ∑ k 0 ∈ [ N 2 ] ( ∑ k 1 ∈ [ N 1 ] a k 1 , k 0 ⋅ ζ N 2 j k 1 ) ⋅ ζ j k 0 = ∑ k 0 ∈ [ N 2 ] ( ∑ k 1 ∈ [ N a k 1 , k 0 ⋅ ζ N 2 j k 1 ) ⋅ ζ j 0 k 0 ⋅ ζ N 1 j 1 k 0 \begin{aligned} A_{j_1,j_0} &:= \sum_{k_0 \in [N_2 } \sum_{k_1\in[N_1]} a_{k_1,k_0} \cdot \zeta^{jk}\\ &= \sum_{k_0\in[N_2]}\left(\sum_{k_1\in[ N_1]} a_{k_1,k_0} \cdot \zeta^{N_2jk_1}\right) \cdot \zeta^{jk_0}\\ &= \sum_{k_0 \in [N_2]} \left(\sum_{k_1 \in [N_1]} a_{k_1,k_0} \cdot \zeta^{N_2jk_1}\right) \cdot \zeta^{j_0k_0} \cdot \zeta^{N_1j_1k_0}\\ \end{aligned} Aj1,j0:=k0[N2]k1[N1]ak1,k0gjk=k0[N2] k1[N1]ak1,k0gN2jk1 gjk0=k0[N2] k1[N1]ak1,k0gN2jk1 gj0k0gN1j1k0
You are, general a N a_N aN is arranged in the shape of row major order N 1 × N 2 N_1 \times N_2 < /span>N1×N2 目次阵 a N 1 × N 2 a_{N_1 \times N_2} aN1×N2

  1. For each k 0 k_0 k0, Usage shape N 1 × N 1 N_1 \times N_1 N1×N1的matrix
    W 1 := { ζ j 0 k 1 } j 0 , k 1 W_1:=\{\zeta^{j_0k_1}\}_{j_0,k_1 } IN1:={ ζj0k1}j0,k1
    The calculated length is N 1 N_1 N1Each partArrow arrow a k 0 a_{k_0} ak0NTT 变卢(单位底为 { ζ N 1 j 0 , j 0 ∈ [ N 1 ] } \{\zeta_{N_1}^{j_0},j_0 \in [N_1]\} { ζN1j0,j0[N1]}), achieved shape N 1 × N 2 N_1 \times N_2 N1×N2 的矩阵
    W 1 × a N 1 × N 2 = { A j 0 , k 0 ′ : = ∑ k 1 ∈ [ N 1 ] a k 1 , k 0 ⋅ ζ N 2 j k 1 } j 0 , k 0 W_1 \times a_{N_1 \times N_2} = \left\{A_{j_0,k_0}' := \sum_{k_1 \in [N_1]} a_{k_1,k_0} \cdot \zeta^{N_2jk_1}\right\}_{j_0,k_0} IN1×aN1×N2= Aj0,k0:=k1[N1]ak1,k0gN2jk1 j0,k0

  2. Used shape N 1 × N 2 N_1 \times N_2 N1×N2的matrix
    W 2 := { ζ j 0 k 0 } j 0 , k 0 W_2 := \{\zeta^{j_0k_0}\}_{j_0,k_0 } IN2:={ ζj0k0}j0,k0
    A makes the next operation a standard NTT (otherwise the unit root used in the subsequent NTT needs to be properly distorted), and the result at this time is the shape N 1 × N 2 N_1 \times N_2 N1×N2 的矩阵
    W 2 ⊙ A N 1 × N 2 ′ = { A j 0 , k 0 ′ ′ : = ζ j 0 k 0 ⋅ ∑ k 1 ∈ [ N 1 ] a k 1 , k 0 ⋅ ζ N 2 j k 1 } j 0 , k 0 W_2 \odot A_{N_1 \times N_2}' = \left\{A_{j_0,k_0}'' := \zeta^{j_0k_0} \cdot\sum_{k_1 \in [N_1]} a_{k_1,k_0} \cdot \zeta^{N_2jk_1}\right\}_{j_0,k_0} IN2AN1×N2= Aj0,k0′′:=gj0k0k1[N1]ak1,k0gN2jk1 j0,k0

  3. Each piece j 1 j_1 j1, Usage shape N 2 × N 2 N_2 \times N_2 N2×N2 Definition
    W 3 : = { ζ N 2 j 1 k 0 } j 1 , k 0 W_3 := \{\zeta_{N_2}^{j_1k_0}\} _{j_1,k_0}IN3:={ ζN2j1k0}j1,k0
    The calculated length is N 2 N_2 N2 each partline arrow A j 0 ′ ′ A_{j_0}'' Aj0′′NTT 变卢(单位底为 { ζ N 2 j 1 , j 1 ∈ [ N 2 ] } \{\zeta_{N_2}^{j_1},j_1 \in [N_2]\} { ζN2j1,j1[N2]}), achieved shape N 2 × N 1 N_2 \times N_1 N2×N1 的矩阵
    W 3 × ( A N 1 × N 2 ′ ′ ) T = { A j 1 , j 0 } j 1 , j 0 W_3 \times (A_{N_1 \times N_2}'')^T = \{A_{j_1,j_0}\}_{j_1,j_0} IN3×(AN1×N2′′)T={ Aj1,j0}j1,j0

Shape N 2 × N 1 N_2 \times N_1 N2×N1 目次阵 A N 2 × N 1 A_{N_2 \times N_1} AN2×N1,按懇读取为读取为 A N = N T T ( a N ) A_N = NTT (a_N) AN=NTT(aN)

总之, a N , A N a_N, A_N aN,AN are all arranged into matrices in row major order (different shapes), then there are:
A N 2 × N 1 = W 3 × ( W 2 ⊙ ( W 1 × a N 1 × N 2 ) ) T A_{N_2 \times N_1} = W_3 \times \Big(W_2 \odot \big(W_1 \times a_{N_1 \times N_2}\big)\Big)^T AN2×N1=IN3×(W2(W1×aN1×N2))T
In fact, this process can be expressed by the ring homomorphism of Nussbaumer Transform
F [ x ] / ( x N − 1 ) ≅ ( F [ y ] / ( y N 1 − 1 ) ) [ x ] / ( x N 2 − y ) ≅ ( F [ y ] / ( y N 1 − 1 ) ) [ z ] / ( z N 2 − 1 ) \mathbb F[x]/(x^N-1) \cong \Big(\mathbb F[y]/(y^{N_1}-1)\Big)[x]/(x^{N_2}-y ) \cong \Big(\mathbb F[y]/(y^{N_1}-1)\Big)[z]/(z^{N_2}-1) F[x]/(xN1)(F[y]/(yN11))[x]/(xN2y)(F[y]/(yN11))[z]/(zN21)

CHKKS18

The earliest [HS14] proposed the diagonal algorithm for matrix-vector multiplication: using SIMD technologyHadamard andRotate operations realize homomorphic linear operations. [CHKKS18] adopted the BSGS trick to calculate them. Our default index is automatic ( m o d n ) \pmod n (modn), basic code:

  • For any linear transformation M ∈ C n × n M \in \mathbb C^{n \times n} MCn×n,简记 d i a g i ( M ) = [ M 0 , i , M 1 , i + 1 , ⋯   , M n , i + n ] diag_i(M) = [M_{0,i}, M_{1,i+1},\cdots,M_{n,i+n}] diagi(M)=[M0,i,M1,i+1,,Mn,i+n] 是第 i ∈ Z i \in \mathbb Z iZ diagonals (can be negative − i -i ni 行对angled线)
  • For any vector v ∈ C n v \in \mathbb C^{n} inCn,简记 r o t i ( v ) = [ v i , v i + 1 , ⋯   , v i + n − 1 ] rot_i(v) = [v_i,v_{i+1},\cdots,v_{i+n-1}] roti(v)=[vi,ini+1,,ini+n1iZ Distance (possible is the number of losses − i -i i, circular right shift i i i distance)

Use BSGS algorithm to decompose n = l × k n=l \times k n=l×k, the linear transformation can be expressed as:

Insert image description here

Select during optimization k ≈ n k \approx \sqrt n kn ,Calculation rate: O ( n ) O(\sqrt n) O(n ) Next Rotate calculation (关于 v v v's secret sentence), O ( n ) O(n) O(n) Next Hadamard calculation. Fixed square screen in the public domain M M M, inside r o t − k i ( d i a g k i + j ( M ) ) rot_{-ki}(diag_{ki+j} (M)) rotoff(diagki+j(M)) is a precomputed constant polynomial (encoded with InvDFT) that does not The Rotate operation under CKKS ciphertext is required.

Insert image description here

[CHKKS18] Convert the above linear transformation to homomorphic calculation under slot-packing CKKS and use it to implement coeff-to-slot to batch CKKS bootstrapping. The linear transformations used are DFT and InvDFT, [CHKKS18] treats them as general linear transformations and uses this homomorphic matrix multiplication to implement.

However, for public linear transformations, it is much more efficient to directly use the Functional Key-Switch proposed by TFHE. For the secret linear transformation, TFHE also proposed according to M M M to construct KS-Key for M to support Private Functional Key-Switch. But if this special KS-Key for M is not provided, instead M M M is encrypted into a general CKKS ciphertext, then you can only use the above homomorphic matrix multiplication and slowly calculate using Rotate and Hadamard.

Faster Homomorphic DFT

[CHH18] observed that the DFT matrix hassparse decomposition (that is, the butterfly algorithm), so for this special linear transformation, we can Compared with the general matrix multiplication of [CHKKS18], the complexity is reduced by one n n n factor. It can be applied to the CKKS batch bootstrapping of [CHKKS18], increasing the computational speed by hundreds of times. The content of [HHC19] (published in IEEE Access) is basically the content of [CHH18] (hanging on eprint), just with a different name.

Sparse-Diagonal matrix Factorization

[CHH18] It is said that according to the recursive FFT of [CT65], the DFT matrix can be sparsely decomposed as follows:

Insert image description here

Continuing to decompose the former iteratively, we can finally obtain:

Insert image description here

Easy to find, d i a g i ( D 2 i ( n ) ) ≠ 0 ⃗ ⟺ k ∈ { 0 , ± n 2 i } diag_i(D_{2^i}^ {(n)}) \neq \vec 0 \iff k\in \{0,\pm \dfrac{n}{2^i}\} diagi(D2i(n))=0 k{ 0,±2in}, only 3 diagonals are non-zero, so it is very efficient to calculate by slope multiplication,
D 2 i ( n ) ⋅ v = ∑ k ∈ { 0 , ± n / 2 i } d i a g k ( D 2 i ( n ) ) ⊙ r o t k ( v ) D_{2^i}^{(n)} \cdot v = \sum_ {k\in \{0,\pm n/2^i\}} diag_k(D_{2^i}^{(n)}) \odot rot_{k}(v) D2i(n)in=k{ 0,±n/2i}diagk(D2i(n))rotk(v)
Arithmetic:

Insert image description here

Attention r o t 0 ( v ) = v rot_0(v)=v rot0(v)=v Unnecessary calculation, special circumstances i = 1 i=1 i=1 根据 r o t n / 2 i ( v ) = r o t − n / 2 i ( v ) rot_{n/2^i}(v)=rot_{-n/2^i}(v) rotn/2i(v)=rotn/2i(v) Possible calculation. Finally, D F T ⋅ v = ∏ i = 0 log ⁡ 2 n D 2 i ( n ) ⋅ v DFT \cdot v = \prod_{i=0}^{\log_2 n} D_ {2^i}^{(n)} \cdot v DFTin=i=0log2nD2i(n)v 目复杂率为 O ( log ⁡ 2 n ) O(\log_2 n) O(log2n)

For the inverse transform, since the inverse matrix of the DFT is exactly its Hermitian,

Insert image description here
It is therefore clear that the CT butterfly, like the GS butterfly above, is also sparsely diagonal, and thus similarly fast matrix multiplications exist.

Radix-r

However, although the number of operations mentioned above is very small, the calculation depth is log ⁡ 2 ( n ) \log_2(n) log2(n) requires multi-layer Rotate and CMult serialization, which may cause noise control problems. question. We can merge certain consecutive k k k matrices, then the depth is reduced to log ⁡ r n , r = 2 k \log_r n, r=2^k < /span>logrn,r=2k, the price is the increase in computational complexity.

Insert image description here

According to the multiplication property of diagonal matrices: the product of two diagonal matrices is still a diagonal matrix,
d i a g i ( a ) ⋅ d i a g j ( b ) = d i a g i + j ( a ⊙ r o t i ( b ) ) diag_i(a) \cdot diag_j(b) = diag_{i+j}(a \odot rot_i(b)) diagi(a)diagj(b)=diagi+j(aroti(b))
可以证明,连续 k k Merge of k matrices
D k , s = D 2 s + k ( n ) ⋯ D 2 s + 2 ( n ) ⋅ D 2 s + 1 ( n ) D_{k,s} = D_{2^{s+k}}^{(n)} \cdots D_{2^{s+2}}^{(n)} \cdot D_{2^{s+1}}^{(n)} Dk,s=D2s+k(n)D2s+2(n)D2s+1(n)
The index of the non-zero diagonal of is
e 1 ⋅ n 2 s + 1 + e 2 ⋅ n 2 s + 2 + ⋯ + e t ⋅ n 2 s + k e_1 \cdot \dfrac{n}{2^{s+1}} + e_2 \cdot \dfrac{n}{2^{s+2}} + \cdots + e_t \cdot \dfrac{n}{2^ {s+k}} It is12s+1n+It is22s+2n++It ist2s+kn
Inside e i ∈ { 0 , ± 1 } e_i \in \{0,\pm1\} It isi{ 0,±1},easy to know index capital city n 2 s + k \dfrac{n}{2^{s+k}} 2s+knMultiples of , the upper bound of the absolute value is ( 2 k − 1 ) n 2 s + k \dfrac{(2^k-1)n}{2^{s +k}} 2s+k(2k1)n, the number of these indexes is at most 2 k + 1 − 1 2^{k+1}-1 2k+11

The complexity of DFT at this time is: O ( r log ⁡ r n ) O(r \log_r n) O(rlogrn) Next Rotate sum Hadamard, depth O ( log ⁡ r n ) O(\ log_r n) O(logrn)

Hybrid method

Since the non-zero index of the above decomposition presents an arithmetic series (both n 2 s + k \dfrac{n}{2^{s+k}} 2s+knmultiples), so BSGS techniques can be used to extract the common calculation part to further improve calculation efficiency.

Insert image description here

Optimization setting k 2 ≈ t k_2 \approx \sqrt t k2t , the complexity of DFT at this time is: still O ( r log ⁡ r n ) O(r \log_r n) O(rlogrn) Next Hadamard, but it's just demand O ( r log ⁡ r n ) O(\ sqrt r \log_r n) O(r logrn) Next Ratate(However r r r is a constant), the depth is also O ( log ⁡ r n ) O(\log_r n) O(logrn)

Result

Execution time of homomorphic DFT: On the order of seconds

Insert image description here

CKKS bootstrapping time: 2 minutes delay, amortized at 4 milliseconds

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_44885334/article/details/134672030