深度学习的低秩优化:在紧凑架构和快速训练之间取得平衡(上)

论文出处:[2303.13635] Low Rank Optimization for Efficient Deep Learning: Making A Balance between Compact Architecture and Fast Training (arxiv.org)

由于篇幅有限,本篇博客仅引出问题的背景、各种张量分解方法及其分解FC/Conv层的方法,欢迎各位讨论!

高度的计算复杂度和存储成本使得深度学习难以在资源受限的设备上使用,并且不环保,功耗高。本文专注于高效深度学习技术的低秩张量优化方法。在空间域,深度神经网络通过对网络参数进行低秩逼近( low rank approximation )进行压缩,以更少的网络参数直接降低了存储需求。在时域中,网络参数可以在几个子空间中进行训练,从而实现快速收敛的高效训练。这两种类型可以通过低秩优化联系起来,发现时域和空间域的冗余是同源的,为了加快神经网络的训练速度,需要在空间效率和时间效率之间取得平衡。
An Overview of Low-Rank Tensor Optimization for Efficient Deep Learning

1. Problem background

  1. There are two main challenges in deep learning: high complexity and slow convergence .

    • The high complexity means that there are millions of parameters in the deep neural network, and the calculation between a large number of parameters and inputs is very troublesome , which emphasizes the need for compression and acceleration of efficient algorithms, especially for devices with limited resources (such as mobile phones, IoT devices). Moreover, since deeper or wider structures can lead to better performance, deep neural networks are gradually characterized by their overparameterization , which means too much redundancy in DNNs, leading to overfitting.

    • The traditional BP algorithm has slow convergence speed and long training time. At the same time, the convergence speed is also sensitive to the setting of the learning rate and the way of initializing the weights.

  2. Reducing the high complexity of deep neural networks falls into two categories: reducing the number of parameters , and reducing the average bit width of data representation .

    • Methods to reduce the amount of parameters: low rank approximation (the focus of this article), pruning, weight-sharing, sparsity, knowledge distillation

    • Reduce the average bit width of data representation: Quantization and entropy coding

  3. The low-rank optimization of model compression is divided into three categories: pre-training methods, preset methods and compressed sensing methods (the difference lies in the way of training)

    • The pre-training method directly decomposes the pre-trained model to obtain hot initialization of the compressed format, and then retrains the compressed model to restore performance.

    • Preset methods train a network from scratch with no pre-training in a compact format.

    • Different from the above two methods, the compressed sensing method explicitly considers the compression during training by gradually making the network have a low-rank structure.

    While a discussion of low-rank optimization can also be found in [2302.09019] Tensor Networks Meet Neural Networks: A Survey and Future Perspectives (arxiv.org) , this paper further examines how it can be combined with other compression techniques in pursuit of lower , and recommend effective rank as the most effective metric to use in low-rank optimization.

  4. Accelerating Convergence of Deep Neural Networks: Subspace Training

    • First-order optimization methods have some inherent flaws, such as slow theoretical and empirical convergence.

    • Second-order methods can handle this kind of problems well, but pre-DNNs are not suitable due to the large amount of computation. Projecting the parameters onto a tiny subspace represented by many independent variables is an efficient way to solve this problem , with only a few variables to optimize.

2. Low-rank optimization of model compression

Compression methods such as quantization, pruning, and low-rank approximation can reduce the complexity of DNNs without significantly reducing accuracy. Among them, low-rank approximation is widely adopted due to the solid theoretical basis of tensor decomposition .

(1) Various tensor decomposition methods for network compression

Low-rank approximation can provide ultra-high compression ratios for recurrent neural networks with little loss of accuracy. However, when it comes to Convolutional Neural Networks (CNNs), the compression performance is not as satisfying as RNNs. The operation of reconstructing the 4D convolution kernel into a matrix and using singular value decomposition (SVD) to decompose the matrix into two factors leads to distortion of structural information . Therefore, more efficient tensor factorization has attracted interest. Applying CP decomposition to decompose the convolutional layer into four consecutive convolutional layers significantly speeds up the CNN. Tucker decomposition utilizes channel redundancy to decompose the 4D kernel into a 4D compact kernel and two matrices. On the basis of these three classic decompositions, many other flexible methods have emerged, including HT, TT, TR, BTD, GKPD, tensor decomposition based on semi-tensor product (STP), etc., which have significantly improved DNN compression performance. Table 1 presents the performance of common tensor decomposition methods for compressing ResNet32 using the Cifar10 dataset.
Performance of common tensor decomposition methods

The time-space complexity of different methods, for the FC layer, is shown in Table 3. Table 4 is the Conv layer.

Time and space complexity of different methods for FC layer
Time and space complexity of different methods for Conv layer

The basic representation and operation of tensors will not be expanded here, and some learning materials are recommended:

Talking about Tensor Decomposition (2): The Mathematical Basis of Tensor Decomposition (zhihu.com)

Tensor Basics | Tensor Rank and Rank Tensor - Zhihu (zhihu.com)

( Xinyu Chen - Zhihu (zhihu.com)

Tensor Computation for Data Analysis | SpringerLink

The following is an introduction to the nine decomposition methods mentioned in this article:
insert image description here

1) Singular Value Decomposition

For a given matrix X ∈ RN × NX∈R^{N×N}XRN × N , its SVD can be written as:

X = U d i a g ( s ) V T X = U diag(s)V^T X=U d ia g ( s ) VT

Let R be the rank of the matrix, R ≤ min ( M , N ) R ≤ min{(M, N)}Rmin(M,N ) . Note thatU ∈ RM × NU∈R^{M×N}URM × N andV ∈ RN × RV ∈ R^{N×R}VRN × R minutes completeUUT = I UU^T = IU U UT=IVVT = I VV^T = IVVT=I。而 s ∈ R R s∈\Bbb{R}^R sRThe elements of R decrease in order, that is,s 1 ≥ s 2 ≥ . . . ≥ s R s_1≥s_2≥...≥s_Rs1s2...sR

Since the weight format of the FC layer is a natural matrix, the singular value decomposition can be directly used. Using singular value decomposition (SVD), the FC layer can be approximated as two consecutive layers. The weights of the first layer and the second layer can be expressed as B = diag ( s ) VTB = diag(\sqrt{s} )V^TB=diag(s )VT A = U d i a g ( s ) A = Udiag(\sqrt{s}) A=U d ia g (s ) . For the Conv layer, the 4D kernel is first reconstructed into a 2D matrix. By exploiting different types of redundancies, there are two decomposition schemes, one is to reshape W intoS − by − CHW S-by-CHWSbyC H W matrix, that is,channel decomposition, the other isspace decomposition, getSH − by − CW SH-by-CWSHbyCW matrix . Then, calculate the SVD of the reconstruction matrix. Similar to the process of compressing the FC layer, one can use the factorBBB andAAA reshaped tensor representation of the two Conv layers to replace the original layers.

然而,这两种方法都只能利用一种冗余。此外,输入通道之间也存在冗余。可以通过张量分解同时利用各种冗余,帮助我们获得更高的压缩比。

2) CP Decomposition

SVD将一个矩阵分解为秩一矩阵的和,而CP分解将一个张量分解为秩一张量(rank-one tensors)的和。对于n阶张量 X ∈ R I 1 × I 2 × ⋅ ⋅ ⋅ × I N \mathcal{X}∈R^{I_1×I_2×···×I_N} XRI1×I2×⋅⋅⋅×IN, CP分解可表示为:

X = ⟦ λ ; A ( 1 ) , A ( 2 ) , ⋯   , A N ⟧ = ∑ r = 1 R λ r a r ( 1 ) ∘ a r ( 2 ) ∘ ⋯ ∘ a r ( N ) . \mathcal{X}=\llbracket \lambda ; \mathbf{A}^{(1)}, \mathbf{A}^{(2)}, \cdots, \mathbf{A}^{N} \rrbracket=\sum_{r=1}^{R} \lambda_{r} \mathbf{a}_{r}^{(1)} \circ \mathbf{a}_{r}^{(2)} \circ \cdots \circ \mathbf{a}_{r}^{(N)} . X=[[λ;A(1),A(2),,AN]]=r=1Rlrar(1)ar(2)ar(N).

Each ar ( n ) \mathbf{a}_{r}^{(n)}ar(n)means A ( n ) A^{(n)}A( n ) thrrr列, λ ∈ R R λ∈\Bbb{R}^R lRR stands forrrThe significance of the r components. TensorX \mathcal{X}The rank of X is RRR representation, defined as the minimum number of rank 1 tensors.

When using CP to compress the FC layer, it is first necessary to expand the weight matrix W into a 2d-order tensor W ∈ RO 1 × O 2 ⋅ ⋅ × O d × I 1 × I 2 × ⋅ ⋅ ⋅ × I d \mathcal{ W}∈R^{O_1×O_2··×O_d×I_1×I_2×···×I_d}WRO1×O2⋅⋅×Od×I1×I2×⋅⋅⋅×Id. At the same time, the input vector x ∈ RI x ∈ R^IxRI should be expressed as addd -order tensorx ∈ RI 1 × I 2 × ⋅ ⋅ ⋅ × I dx∈R^{I_1×I_2×···×I_d}xRI1×I2×⋅⋅⋅×Id. For convolutional kernels, by directly performing CP on the 4-D kernel tensor, the layer will be approximated by four consecutive convolutional layers, whose weights are represented by four factor matrices respectively.

3) Tucker Decomposition

Tucker decomposition can be seen as a higher-order generalization of principal component analysis (PCA). It represents a rank-n tensor and a rank-n core tensor multiplied by a basis matrix along each modality. Therefore, for X ∈ RI 1 × I 2 × ⋅ ⋅ ⋅ × IN \mathcal{X}∈R^{I_1×I_2×···×I_N}XRI1×I2×⋅⋅⋅×IN,have:

X = G × 1 A ( 1 ) × 2 A ( 2 ) × 3 ⋯ × N A ( N ) \mathcal{X}=\mathcal{G} \times_{1} \mathbf{A}^{(1)} \times_{2} \mathbf{A}^{(2)} \times_{3} \cdots \times_{N} \mathbf{A}^{(N)} X=G×1A(1)×2A(2)×3×NA(N)

其中 G ∈ R R 1 × R 2 × ⋯ × R N \mathcal{G} \in \mathbb{R}^{R_{1} \times R_{2} \times \cdots \times R_{N}} GRR1×R2××RNCalled the core tensor. Among them, " × n \times{ }_{n}×n"means nnn modulus product, i.e. tensor with modalitynnMatrix multiplication of n . On elements, " $\times{ }_{n} $" can be expressed as:

( G × 1 A ( 1 ) ) i 1 , r 2 , ⋯   , r N = ∑ r 1 = 1 R 1 G r 1 , r 2 , ⋯   , r N A i 1 , r 1 ( 1 ) \left(\mathcal{G} \times{ }_{1} \mathbf{A}^{(1)}\right)_{i_{1}, r_{2}, \cdots, r_{N}}=\sum^{R_{1}}_{r_1 = 1} \mathcal{G}_{r_{1}, r_{2}, \cdots, r_{N}} \mathbf{A}_{i_{1}, r_{1}}^{(1)} (G×1A(1))i1,r2,,rN=r1=1R1Gr1,r2,,rNAi1,r1(1)

Factor matrix A ( n ) ∈ RI n × R n A(n)∈R^{I_n×R_n}A(n)RIn×RnThe column of can be seen as the principal components of the nth modality. The core tensor G \mathcal{G}G can be seen asX \mathcal{X}A compressed version of X , or coefficients in a low-dimensional subspace. In this case we can say thatX \mathcal{X}X is arank − ( R 1 , R 2 , ⋅ ⋅ ⋅ , RN ) rank-(R_1, R_2,···,R_N)rank(R1,R2⋅⋅⋅RN) tensor

In the case of compressed FC layers, similar to CP, the same compactification is performed on weights and inputs by directly performing Tucker decomposition on second-order tensors. For the Conv layer, since the spatial size of the kernel is too small, we can only use Tucker2 to exploit the redundancy between filters and between input channels to generate 1×1 convolution, H×W convolution and 1×1 convolution.

4) Block Term Decomposition

Block Term Decomposition (Block Term Decomposition, BTD) is introduced in as a more powerful tensor decomposition, which combines CP decomposition and Tucker decomposition. Therefore, BTD is more robust than the original CP and Tucker decomposition . CP approximates a sum of tensors of rank tensors, while BTD is a sum of tensors of low rank Tucker format. Alternatively, BTD can be viewed as an instance of Tucker by concatenating the factor matrices in each modality and arranging all the core tensors of each subtensor into a block-diagonal core tensor . Therefore, consider an n-order tensor X ∈ RI 1 × I 2 × ⋅ ⋅ ⋅ × I d \mathcal{X} ∈ R^{I_1×I_2×···×I_d}XRI1×I2×⋅⋅⋅×Id, its BTD can be expressed as:

X = ∑ n = 1 NG n × 1 A n ( 1 ) × 2 A n ( 2 ) × 3 ⋯ × d A n ( d ) \mathcal{X}=\sum^N_{n=1}\mathcal{ G}_n \times_{1} \mathbf{A}^{(1)}_n \times_{2} \mathbf{A}^{(2)}_n \times_{3} \cdots \times_{d} \ mathbf{A}^{(d)}_nX=n=1NGn×1An(1)×2An(2)×3×dAn(d)

In the above formula, N represents the CP rank, that is, the number of block items, G ∈ RR 1 × R 2 × ⋯ × R d \mathcal{G} \in \mathbb{R}^{R_{1} \times R_{2 } \times \cdots \times R_{d}}GRR1×R2××Rdis the core tensor of the Nth multilinear rank block term with rank ( R 1 , R 2 , ⋅ ⋅ ⋅ , R d ) (R_1, R_2,···,R_d)(R1,R2⋅⋅⋅Rd)

When BTD is applied to a compressed FC layer, the resulting compact layer is called a Block Term layer (BTL). In BTL, the input tensor X ∈ RI 1 × I 2 × ⋅ ⋅ ⋅ × I d \mathcal{X} ∈ R^{I_1×I_2×···×I_d}XRI1×I2×⋅⋅⋅×IdFrom the original input vector X ∈ RIX ∈ R^IXRI tensorization, the original weight matrix W is reshaped asW ∈ RO 1 × I 1 × O 2 × I 2 ⋅ ⋅ × O d × I d W ∈ R^{O_1×I_1×O_2×I_2 ×O_d× I_d}WRO1×I1×O2×I2⋅⋅×Od×Id

Then, we can use the factor tensor { A n ( d ) ∈ RO d × I d × R d A^{(d)}_n∈R^{Od×Id×Rd} via BTD An(d)ROd×Id×Rd} n = 1 d {}^d_{n=1} n=1dBreak down W. By BTD ( W ) BTD(\mathcal{W} )BT D ( W ) andX \mathcal{X}The tensor contraction operator is performed between X , and the output tensor Y ∈ RO 1 × O 2 × ⋅ ⋅ ⋅ × O d \mathcal{Y} ∈ R^{O_1×O_2×···×O_d}YRO1×O2×⋅⋅⋅×Od, which can be vectorized as the final output vector.

For the Conv layer, the literature claims that by reconstructing the 4D kernel into a matrix W ∈ RS × CHWW ∈ R^{S×CHW}WRS × CH W , this layer can be converted into a BTL. Specifically, the matrix should be further reconstructed as1 × H × 1 × W × S 1 × C 1 × S 2 × C 2 × ⋅ ⋅ ⋅ × S d × C d 1 × H × 1 × W × S_1 × C_1 × S_2 × C_2 ×···× S_d × C_d1×H×1×W×S1×C1×S2×C2×⋅⋅⋅×Sd×Cd

5) Hierarchical Tucker Decomposition

Hierarchical Tucker Decomposition (HT) is a hierarchical variant of Tucker Decomposition that iteratively represents a high-rank tensor and two low-rank sub-tensors and the resulting transfer matrix using Tucker decomposition. For tensor X ∈ RI 1 × I 2 × ⋅ ⋅ ⋅ × IN \mathcal{X} ∈ R^{I_1×I_2×···×I_N}XRI1×I2×⋅⋅⋅×IN, we can simply divide the index set {1,2,···,N} into two subsets, namely T = { t 1 , t 2 , ⋅ ⋅ , tk } , S = { s 1 , s 2 , ⋅ ⋅ , s N − k } T = \{t_1, t_2,··,t_k\}, S = \{s_1, s_2,··,s_{N−k}\}T={ t1,t2⋅⋅tk}S={ s1,s2⋅⋅sNk}

U 12 ⋅ ⋅ ⋅ N ∈ R I t 1 I t 2 ⋅ ⋅ ⋅ I t k I s 1 I s 2 ⋅ ⋅ ⋅ I s N − k × 1 U_{12···N}∈R^{I_{t_1} I_{t_2}···I_{t_k}I_{s_1} I_{s_2}···I_{s_{N-k}} ×1} U12⋅⋅⋅NRIt1It2⋅⋅⋅ItkIs1Is2⋅⋅⋅IsNk× 1 represents the matrix reconstructed from X, the truncated matrixU t ∈ RI t 1 I t 2 ⋅ ⋅ ⋅ I tk × R t U_t∈R^{I_{t_1} I_{t_2}···I_{t_k} ×R_t}UtRIt1It2⋅⋅⋅Itk×Rt, U s ∈ R I s 1 I s 2 ⋅ ⋅ ⋅ I s N − k × R s U_s∈R^{I_{s_1} I_{s_2}···I_{s_{N-k}} ×R_s} UsRIs1Is2⋅⋅⋅IsNk×RsRepresents the column basis matrix corresponding to the two subspaces. Then, we can:

U 12 ⋅ ⋅ ⋅ N = ( U t ⊗ U s ) B 12 ⋅ ⋅ ⋅ N , U_{12···N} = (U_t ⊗ U_s)B_{12···N} , U12⋅⋅⋅N=(UtUs)B12⋅⋅⋅N,

Among them, B 12 ⋅ ⋅ N ∈ RR t R s × 1 B_{12··N}∈R^{R_tR_s×1}B12⋅⋅NRRtRs× 1 is called the transition matrix, and "⊗" means the Kronecker product between the two matrices. Then, the setT \mathbb{T}T is divided into two subsetsL = { l 1 , l 2 , ⋅ ⋅ ⋅ , lq } \mathbb{L}=\{l_1, l_2,···,l_q\}L={ l1l2⋅⋅⋅lq}V = { v 1 , v 2 , ⋅ ⋅ ⋅ , vk − q } \mathbb {V}=\{v_1,v_2,···,v_{k−q}\}V={ v1v2⋅⋅⋅vkq}。我们可以将 U l ∈ R I l 1 I l 2 ⋅ ⋅ I l q × R l U^l∈\mathbb{R}^{I_{l_1} I_{l_2}··I_{l_q}×R_l} UlRIl1Il2⋅⋅Ilq×Rl U v ∈ R I v 1 I v 2 ⋅ ⋅ I v q × R v U^v∈\mathbb{R}^{I_{v_1} I_{v_2}··I_{v_q}×R_v} UvRIv1Iv2⋅⋅Ivq×Rv, U t ∈ R I t 1 I t 2 ⋅ ⋅ I t q × R t U^t ∈\mathbb{R} ^{I_{t_1} I_{t_2}··I_{t_q}×R_t} UtRIt1It2⋅⋅Itq×Rt U t U_t UtExpressed as:

U t = ( U l ⊗ U v ) B t U_{t} = (U_l ⊗ U_v)B_{t} Ut=(UlUv)Bt

A similar factoring procedure applies to both U s U_sUs. By repeating this process until the index set cannot be partitioned, we can finally obtain the tree-like HT format of the target tensor.

To achieve FC layer compression through HT, the weight matrix needs to be transformed into W ∈ R ( I 1 ⋅ O 1 ) × ( I 2 ⋅ O 2 ) × ⋅ ⋅ × ( I d ⋅ O d ) \mathcal W∈\mathbb{ R}^{\left(I_1·O_1 \right )×\left(I_2·O_2\right)×··×\left(I_d·O_d\right)}WR(I1O1)×(I2O2)×⋅⋅×(IdOd) , and tensorize the input data asX ∈ RI 1 × I 2 × ⋅ ⋅ ⋅ × I d \mathcal{X} ∈ R^{I_1×I_2×···×I_d}XRI1×I2×⋅⋅⋅×Id. In order to reduce the calculation complexity, the chain calculation shown in the figure is adopted. However, since there is no law relating convolution and shrinkage, the kernel of the Conv layer must be recovered from the HT format. By the way, to maintain balance, the 4D kernel should be tensorized as W ∈ R ( H ⋅ W ) × ( C 1 ⋅ S 1 ) × ( C 2 ⋅ S 2 ) × ⋅ ⋅ × ( C d ⋅ S d ) \ mathcal W∈R^{\left(H·W \right )×\left(C_1·S_1 \right )×\left(C_2·S_2 \right )×··×\left(C_d·S_d \right )}WR(HW)×(C1S1)×(C2S2)×⋅⋅×(CdSd)

6) Tensor Train Decomposition

A sequence of tensors (TT) is a special case of HT, which is a degenerate HT format. TT decomposes high-rank tensors into collections of third-rank or second-rank tensors. These core tensors are concatenated via the contraction operator. Suppose we have a tensor of order N, X ∈ RI 1 × I 2 × ⋅ ⋅ × IN \mathcal X∈R^{I_1×I_2×··×I_N}XRI1×I2×⋅⋅×IN, on elements, we can decompose it into TT format:

X i 1 , i 2 , ⋅ ⋅ ⋅ , i N = ∑ r 1 , r 2 , ⋅ ⋅ ⋅ , r N G i 1 , r 1 1 G r 1 , i 2 , r 2 2 ⋅ ⋅ ⋅ G r N − 1 , i N N \mathcal X_{i_1,i_2,··· ,i_N} =\sum _{r_1,r_2,··· ,r_N} \mathcal G^1_{i_1,r_1}\mathcal G^2_{r_1,i_2,r_2} · · · \mathcal G^N_{r_{N-1},i_{N}} Xi1,i2,⋅⋅⋅,iN=r1,r2,⋅⋅⋅,rNGi1,r11Gr1,i2,r22⋅⋅⋅GrN1,iNN

Among them, { G n ∈ RR n − 1 × I n × R n } n = 1 N \{\mathcal G_n∈\mathbb{R}^{R_{n−1}×I_n×R_n}\}^N_{n =1}{ GnRRn1×In×Rn}n=1Nwith R 0 = 1 R_0 = 1R0=1 andR n = 1 R_n = 1Rn=A collection of 1 's is called a TT core. Rank{ R n } n = 0 N \{Rn \}^N_{n = 0}{ Rn}n0NThe set of is called TT rank.

TT is used to compress the FC layer, where the weight matrix is ​​reshaped into a high-order tensor, W ∈ R ( I 1 ⋅ O 1 ) × ( I 2 ⋅ O 2 ) × ⋅ ⋅ × ( I d ⋅ O d ) \mathcal W∈\mathbb{R}^{\left(I_1·O_1 \right )×\left(I_2·O_2\right)×··×\left(I_d·O_d\right)}WR(I1O1)×(I2O2)×⋅⋅×(IdOd) . In TT formatW \mathcal WAfter W , the obtained TT kernel{ G n ∈ RR n − 1 × I n × O n × R n } n = 1 N \{\mathcal G_n∈\mathbb{R}^{R_{n−1}×I_n ×O_n×R_n}\}^N_{n=1}{ GnRRn1×In×On×Rn}n=1NCan be directly contracted with tensorized input. It was proposed in [141] that TT compresses Conv layers more efficiently than HT, while HT is more suitable for compressing FC layers whose weight matrices prefer to be reshaped into balanced tensors.

Using TT on the Conv layer was introduced in [37], where the 4D kernel tensor should be reshaped as (H ⋅ W) × (C 1 ⋅ S 1 ) × (C 2 ⋅ S 2 ) × ⋅ ⋅ ⋅ × (C d ⋅ S d ) (H W) × (C_1 S_1) × (C_2 S_2) × × (C_d S_d)HW×C1S1×C2S2×⋅⋅⋅×CdSd) , and the input feature map is reshaped toX ∈ RH × W × C 1 × ⋅ ⋅ ⋅ × C d X∈\mathbb R^{H×W×C_1×···×C_d}XRH×W×C1×⋅⋅⋅×Cd。在前馈阶段,张量化的输入X将与每个TT核心逐一收缩。尽管TT可以显著节省存储成本,但计算复杂度可能高于原始的Conv层。因此,[149]中提出了HODEC(高阶分解卷积),以同时降低计算和存储成本,从而将每个TT核心进一步分解为两个三阶张量。

7) Tensor Ring Decomposition

由于边缘TT核的不统一(即 R 0 = 1 R_0=1 R0=1 R n = 1 R_n=1 Rn=1),如何排列张量的维数以找到最优的TT格式仍然是一个悬而未决的问题。为了解决这个问题,张量环(TR)分解提出了在核上进行循环多线性乘积的分解。考虑一个给定的张量, X ∈ R I 1 × I 2 × ⋅ ⋅ × I N \mathcal X∈R^{I_1×I_2×··×I_N} XRI1×I2×⋅⋅×IN,在元素上,我们可以将其TR表示公式化为:

X i 1 , i 2 , ⋅ ⋅ ⋅ , i N = ∑ r 1 , r 2 , ⋅ ⋅ ⋅ , r N G r 1 , i 1 , r 2 1 G r 2 , i 2 , r 3 2 ⋅ ⋅ ⋅ G r N , i N , r 1 N = t r ( ∑ r 2 , ⋅ ⋅ ⋅ , r N G : , i 1 , r 2 1 G r 2 , i 2 , r 3 2 ⋅ ⋅ ⋅ G r N , i N , : N ) \mathcal X_{i_1,i_2,··· ,i_N} =\sum _{r_1,r_2,··· ,r_N} \mathcal G^1_{r_1,i_1,r_2}\mathcal G^2_{r_2,i_2,r_3} · · · \mathcal G^N_{r_{N},i_{N},r_1} = tr( \sum _{r_2,··· ,r_N} \mathcal G^1_{:,i_1,r_2}\mathcal G^2_{r_2,i_2,r_3} · · · \mathcal G^N_{r_{N},i_{N},:} ) Xi1,i2,⋅⋅⋅,iN=r1,r2,⋅⋅⋅,rNGr1,i1,r21Gr2,i2,r32⋅⋅⋅GrN,iN,r1N=tr(r2,⋅⋅⋅,rNG:,i1,r21Gr2,i2,r32⋅⋅⋅GrN,iN,:N)

where all kernels { G n ∈ RR n × I n × R n + 1 } n = 1 N \{\mathcal G_n∈\mathbb{R}^{R_{n}×I_n×R_{n+1}}\ }^N_{n=1}{ GnRRn×In×Rn+1}n=1NAnd RN + 1 = R 1 R_{N+1}=R_1RN+1=R1Called the TR nucleus. This form is equivalent to R 1 R_1R1Sum of the form TT. Thanks to the circular multilinear product obtained by using trace operations, TR treats all cores equally and becomes more powerful and general than TT.

Furthermore, TR corrects for the variability of gradients in TT due to the recurrent strategy. Therefore, TR is also suitable for compressing FC layers. In [137], TR was first used to compress DNNs. Specifically, the weight matrix of the FC layer should be reshaped to be of size I 1 × ⋅ ⋅ × I d × O 1 × ⋅ × O 2 I_1×··×I_d×O_1×·×O_2I1×⋅⋅×Id×O1××O2A rank 2 tensor of , and then represent the tensor in TR format. For the feed-forward process, first merge the first d cores and the last d cores to obtain F 1 ∈ RR 1 × I 1 × ⋅ ⋅ × I d × R d + 1 \mathcal F_1∈ \mathbb R^{R_1× I_1×··×I_d×R_{d+1}}F1RR1×I1×⋅⋅×Id×Rd+1F 2 ∈ RR d + 1 × O 1 × ⋅ ⋅ × O d × R 1 \mathcal F_2∈ \mathbb R^{R_{d+1}×O_1×··×O_d×R_{1}}F2RRd+1×O1×⋅⋅×Od×R1. Then, we can calculate the input X ∈ RI 1 × I 2 × ⋅ ⋅ × I d \mathcal X∈R^{I_1×I_2×··×I_d}XRI1×I2×⋅⋅×Idand F 1 \mathcal F_1F1The contraction between, get a can use F 2 \mathcal F_2F2Shrunken matrix. The size of the final output tensor is O 1 × O 2 × ⋅ ⋅ × O d O_1×O_2× ··×O_dO1×O2×⋅⋅×Od. For the Conv layer, if the kernel tensor is kept at rank 4 and the spatial information is kept, its TR format can be formulated as:

K s , c , h , w = ∑ r 1 = 1 R 1 ∑ r 2 = 1 R 2 ∑ r 3 = 1 R 3 U r 1 , s , r 2 V r 2 , c , r 3 Q r 3 , h , w , r 1 K_{s,c,h,w} = \sum^{R_1}_{r_1=1}\sum^{R_2}_{r_2=1}\sum^{R_3}_{r_3=1}\mathcal U_{r_1,s,r_2}\mathcal V_{r_2,c,r_3}\mathcal Q_{r_3,h,w,r_1} Ks,c,h,w=r1=1R1r2=1R2r3=1R3Ur1,s,r2Vr2,c,r3Qr3,h,w,r1

Therefore, the original layer can be represented by three consecutive layers with weight tensors V, Q, and U \mathcal V, \mathcal Q, and \mathcal UV , Q and U. _ If a higher compression ratio is required, we can further convertU \mathcal UUwa V\mathcal VV are regarded as fromddTensors obtained by merging d core tensors, thus increasing the computational burden of merging.

8) Generalized Kronecker P.roduct Decomposition

Kronecker product decomposition (KPD) can decompose a matrix into two smaller factor matrices interconnected by a Kronecker product, which is very efficient when applied to compressed RNNs. To further compress the Conv layer, it is generated as a generalized Kronecker integral decomposition (GKPD), which represents a tensor by the sum of multidimensional Kronecker products between two factor tensors. Formally, A ∈ RJ 1 × J 2 × ⋅ ⋅ × JNA∈R^{J_1×J_2×··×J_N}ARJ1×J2×⋅⋅×JNand B ∈ RK 1 × K 2 × ⋅ ⋅ ⋅ × KNB∈R^{K_1×K_2×···×K_N}BRK1×K2×⋅⋅⋅×KNThe multidimensional Kronecker product between is formulated as:

( A ⊗ B ) i 1 , i 2 , ⋅ ⋅ ⋅ , i N = A j 1 , j 2 , ⋅ ⋅ ⋅ , j N B k 1 , k 2 , ⋅ ⋅ ⋅ , k N (A⊗B)_{i_1,i_2,··· ,i_N} = \mathcal A_{j_1,j_2,··· ,j_N} B_{k_1,k_2,··· ,k_N} (AB)i1,i2,⋅⋅⋅,iN=Aj1,j2,⋅⋅⋅,jNBk1,k2,⋅⋅⋅,kN

Among them jn = bin / K nc j_n = bi_n/K_ncjnbin/Knc andkn = inmod Kn k_n = i_nmodK_nkninm o d Kn. Based on this, for a given N-order tensor X ∈ RJ 1 K 1 × J 2 K 2 × ⋅ ⋅ ⋅ × JNKN \mathcal X∈R^{J_1K_1×J_2K_2×···×J_NK_N}XRJ1K1×J2K2×⋅⋅⋅×JNKN, GKPD can be expressed as:

X = ∑ r = 1 R A r ⊗ B r \mathcal X= \sum^R_{r=1} \mathcal A_r ⊗\mathcal B_r X=r=1RArBr

where R is called the Kronecker rank. For a given RRTo find the best approximation in R 's GKPD, we can transform this optimization problem into finding the best rank of the matrix -RRR approximation,SVD SVDS V D can be achieved by carefully placingX \mathcal XX is rearranged into a matrix, andA \mathcal AA sumB \mathcal BB is rearranged into vectors for easy solution.

In order to decompress the Conv layer using GKPD, the 4D kernel is expressed as:

W s , c , h , w = ∑ r = 1 R ( A r ) s 1 , c 1 , h 1 , w 1 ⊗ ( B r ) s 2 , c 2 , h 2 , w 2 W_{s,c,h,w} = \sum^R_{r=1}(\mathcal A_r)_{s_1,c_1,h_1,w_1} ⊗ (\mathcal B_r)_{s_2,c_2,h_2,w_2} Ws,c,h,w=r=1R(Ar)s1,c1,h1,w1(Br)s2,c2,h2,w2

Where S 1 S 2 = S , C 1 C 2 = C , H 1 H 2 = H , W 1 W 2 = W S_1S_2=S, C_1C_2=C, H_1H_2=H, W_1W_2=WS1S2=SC1C2=CH1H2=HW1W2=W. _ EachA r ⊗ B r A_r⊗B_rArBrA 2D convolution between and input can be converted to a depth equal to C 2 C_2C2A 3D convolution, followed by multiple 2D convolutions. In addition, RRThe R Kronecker product group can be seen as calculatingthe RRR parallel channels. Analysis shows that the largerS 1 S_1S1and C 2 C_2C2Helps to get more FLOP reduction.

9) Semi-tensor Product-based Tensor Decomposition

Semi-tensor product (STP) is a generation of traditional matrix product, which extends the calculation of two-dimensional matching matrix to the calculation of two-dimensional arbitrary matrix. Since tensor contraction is based on traditional matrix product, we can further replace STP with tensor contraction, which will lead to a more general and flexible tensor decomposition method. Proposed in [152], half-tensor product-based tensor decomposition aims to enhance the flexibility of Tucker decomposition, TT decomposition, and TR decomposition by replacing the traditional matrix product in tensor shrinkage with STP, which shows that the much higher efficiency. Consider a special case where the number of columns X ∈ RM × NPX ∈ \mathbb R^{M×NP}XRM × NP andW ∈ RP × QW∈\mathbb R^{P×Q}WRProportional to the number of rows in P × Q , STP can be expressed as:
STP

Alternatively, elements are represented as:

Y m , g ( n , q ) = ∑ p = 1 P X m , g ( n , p ) W p , q Y_{m,g(n,q) }=\sum^P_{p=1}X_{m,g(n,p)}W_{p,q} Ym,g(n,q)=p=1PXm,g(n,p)Wp,q

Equation, Y ∈ RM × NQY∈\mathbb R^{M×NQ}YRM × NQ ,g ( N , Q ) = ( Q − 1 ) N + n and g ( N , p ) = ( p − 1 ) N + ng(N,Q)=(Q−1)N+n and g(N,p)=(p−1)N+ng(N,Q)=(Q1)N+n and g ( N ,p)=(p1)N+n is a heavy index function.

Therefore, taking the STP-based Tucker decomposition as an example, namely STTu, can be formulated as:

insert image description here

其中 G ∈ R R 1 × R 2 × ⋅ ⋅ × R N , A ( n ) ∈ R I n / t × R n / t \mathcal G∈\mathbb R^{R_1×R_2×··×R_N},A^{(n)}∈\mathbb R^{I_n/t × R_n/t} GRR1×R2×⋅⋅×RNA(n)RIn/t×Rn/ t . Compared with ordinary Tucker, the number of parameters is changed from( ∏ n = 0 NR n + ∑ n = 1 NI n R n ) (\prod^ N_{ n=0}R_n+ \sum^N_{n=1}I_nR_n)(n=0NRn+n=1NInRn)减少到 ( ∏ n = 0 N R n + ∑ n = 1 N I n R n / t 2 ) (\prod^ N_{ n=0}R_n+ \sum^N_{n=1}{I_nR_n}/t^2) (n=0NRn+n=1NInRn/t2)

(2) Optimization method for low-rank approximation

We have introduced various tensor decomposition methods, but how to apply these methods to DNN without significant loss of accuracy is an optimization problem to be discussed. Since the lower the tensor rank, we will achieve higher compression ratio , we hope that each layer of DNN can be decomposed by very low rank tensor decomposition. However, as the rank decreases, the approximation error increases, leading to a drastic loss of accuracy . Therefore, there is a tradeoff between accuracy and compression ratio, which is a widely studied problem. There are mainly three low-rank optimization methods to achieve a good trade-off: pre-training methods, preset methods, and compressed sensing methods.

Due to space limitations, it is unfinished and to be continued.

1) Pre-training method

2) Default method

3) Compressed sensing method

(3) Effective sparsity measure

3. Combination with other compression techniques

4. Low rank optimization of subspace training

Guess you like

Origin blog.csdn.net/SmartLab307/article/details/131134669