On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Arora S, Cohen N, Hazan E, et al. On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization[J]. arXiv: Learning, 2018.

lead

I really like this article, as proof of use of knowledge is not difficult, but with a very clever, mathematics is really too much bad, how these people's sense of smell so good?

This article, the final analysis is to illustrate a problem, and the general perception is different, with the deepening of neural network parameters updated convergence rate will not fall, I feel there are many paper discusses the importance of depth of depth.

However, this article is an analysis done on the linear neural network, in addition, the title of Acceleration and not well supported theory, the author gives a few exceptions and some experimental evidence. I think the authors acknowledge tried, but to prove it is not easy to think about, at least have to come up with a \ (O (T ^?) \) And the like.

Although the theoretical support is not enough, but I still feel very much.

main content

First, in order to exclude a number of confounding factors, is the ability to express Acceleration from two different networks, neural networks \ (N_1, N_2 \) , if the speed of convergence of the two different, possibly because \ (N_1 \) and \ (N_2 \) different degrees to make the loss decrease, while the linear network, and does not change the number of layers increases the expression capability of the network.
\ (L (W is) \) is about \ (W \ in \ R ^ {k \ times d} \) loss function, the expression capability of the network and \ (L (W_NW_ {N- 1} \ cdots W_1) \) the expression is the same, if \ (W_NW_ {N-1} \ cdots W_1 \ in \ R & lt ^ {K \ D} Times \) .

Conclusion of the above, there is little doubt, assuming the latter optimally \ ((^ * w_n, W_. 1} ^ {N-*, \ ldots, W_1 * ^) \) , then just let \ (W = w_N ^ * W_ {N-1 } ^ * \ cdots W_1 ^ * \) can, therefore \ (L (W ^ *) \ le L (w_N ^ * W_ {N-1} ^ * \ cdots W_1 ^ * ) \) .

Conversely seems not necessarily, assumed that \ (N = 2 \) , \ (W_2 \ in \ R & lt ^ {K \ Times. 1}, W_1 \ in \ {R & lt ^. 1 \ D} Times \) , but the use here results , as long as the \ (W_i \ in \ R & lt ^ {D_i \ Times D_ {I-. 1}} \) , satisfies \ (d_i \ ge \ min \ {k, d \} \) and \ (L \) About \ ( W \) is a convex function, able to explain equivalent. gall to spend a result seen before.

Symbol may be a bit more, it simplifies points as possible. \ (X \ in \ R & lt ^ D \) sample, \ (Y \ in R & lt ^ K \) is the output,
\ [\ Phi ^ N: = \ {X \ rightarrow W_NW_ {N-1} \
cdots W_1 x | w_j \ in \ R ^ {n_j \ times n_ {j-1}}, j = 1, \ cdots, N \}, \] obviously \ (n_N = k, N - 0 = D \) . Suppose \ (L ^ N (\ cdot ) \) is about \ ((w_N, W_ {N -1}, \ cdots, W_1) \) function, available
\ [L ^ N ( W_ {N}, W_ {N
-1}, \ ldots, W_1) = L ^ 1 (W_NW_ {N-1} \ cdots W_1), \] do not think that they superfluous, or later time will prove disjointed .

Gradient descent with a similar sense of momentum, but it is a bit different:

\ [W_j ^ {(t + 1)} \ leftarrow (1- \ eta \ lambda) W_j ^ {(t)} - \ eta \ frac {\ partial L ^ N} {\ partial W_j} (W_1 ^ {( T)}, \ ldots, w_n ^ {(T)}), \: J =. 1 \ ldots N. \]
\ (\ ETA> 0 \) learning rate, \ (\ the lambda \ GE 0 \) is the weight decreasing coefficient
defined
\ [W_e = W_NW_ {N-
1} \ cdots W_1, \] so \ (L ^ N (w_n, \ ldots, W_1) = L ^. 1 (W_e) \) .

On the assumption \ (\ ETA \) , which is a learning rate is small, so the above equation may be seen in the perspective of differential equations
\ [\ dot {W} _j (t) = - \ eta \ lambda W_j ( t) -. \ eta \ frac
{\ partial L ^ N} {\ partial w_j} (W_1 ^ {(t)}, \ ldots, w_N ^ {(t)}) \] how to say, so that I understanding that \ (\ eta \) very young, \ (w_j (t) \) is very gentle, it can be considered derivative and \ (\ Delta t = 1 \ ) time is the same?

Read preceded a similar Oja'rule also use this approach, feeling that if the author's meaning should be:
\ [\ DOT {W} _J (t) = - \ w_j the lambda (t) - \ FRAC {\ partial L ^ N} {\ partial w_j
} (W_1 ^ {(t)}, \ ldots, w_N ^ {(t)}), \] At this time,
\ [\ Array the begin {} {} {W is LL} _J ( t + \ eta) & = W (t) - [\ lambda w_j (t) + \ frac {\ partial L ^ N} {\ partial w_j} (W_1 ^ {(t)}, \ ldots, w_N ^ {(t )})] \ eta + O (\ eta ^ 2) \\ & \ approx (1- \ eta \ lambda) w_j ^ {(t)} - \ eta \ frac {\ partial L ^ N} {\ partial w_j } (W_1 ^ {(t) }, \ ldots, w_N ^ {(t)}). \ end {array} \]

I think it should be like this, but no effect on the final result.

Theorem 1

1 Theorem assumed weight matrix \ (W_1, \ ldots W_N \ ) satisfies the differential equation:
\ [\ DOT {W is} _J (T) = - \ ETA \ the lambda w_j (T) - \ ETA \ FRAC {\ partial L ^ N } {\ partial w_j} (W_1
^ {(t)}, \ ldots, w_N ^ {(t)}), j = 1, \ ldots, N, \] and
\ [W_ {j + 1} ^ T ( t_0) W_ {j + 1}
(t_0) = w_j (t_0) w_j ^ T (t_0), \:. j = 1, \ ldots, N-1 \] is the weighting matrix \ (W_e \) changes satisfy the following Equations:
\ [\ LL the begin {Array} {} \ {W_e DOT} (T) = & \ ETA \ the lambda N \ CDOT W_e (T) & \\ - \ ETA \ sum_. 1} = {J ^ N [ W_e (t) W_e ^ T ( t)] ^ {\ frac {j-1} {N}} \ cdot \\ & \ quad \ frac {\ mathrm {d} L ^ 1} {\ mathrm {d} W } (W_e (t)) \
cdot [W_e ^ T (t) W_e (t)] ^ {\ frac {Nj} {N}}. \ end {array} \] where \ ([\ cdot] ^ { \ frac {q} {p}} \) about a defined semi-definite matrices, if:
\ [a = VDV ^ T, a ^ {\ FRAC {Q} {P}} = VD ^ {\ FRAC {Q} {p}} V ^ T,
\] diagonal matrix \ (D ^ {\ frac { q} {p}} \)Is to make diagonal elements \ (D_ {II} ^ {\ FRAC {Q} P {}} \) .

Therefore, the weight \ (W_e \) of the updated conversion approximates:
\ [\ LL the begin {Array} {} {} W_e (. 1 + T) = & (l- \ ETA \ the lambda N) \ CDOT W_e (T) \ \ & - \ eta \ sum_ { j = 1} ^ N [W_e (t) W_e ^ T (t)] ^ {\ frac {j-1} {N}} \ cdot \\ & \ quad \ frac {\ mathrm {d} L ^ 1} {\ mathrm {d} W} (W_e (t)) \ cdot [W_e ^ T (t) W_e (t)] ^ {\ frac {Nj} {N}}. \ end {array} \]

Claim 1

The above update a fact people do not see it herself, so the author gives a vector in the form of updates, it can be more intuitive to show where the mystery.
The Claim 1 for any matrix \ (A \) , defined \ (vec (a) \) as a matrix \ (a \) by the vector form of the rearranged columns. Thus,
Here Insert Picture Description
where \ (P_ {W_e ^ {( t)}} \) is a semi-definite matrix, dependent on the \ (W_e \) , assuming
\ [W_e ^ {(t)
} = UDV ^ T, \] where \ (U = [u_1, u_2 , \ ldots, u_k] \ in \ R ^ {k \ times k}, V = [ V_1, V_2, \ ldots, V_D] \ in \ R & lt ^ {D \ D} Times \) , \ (D \) diagonal elements, namely \ (W_e ^ {(t) } \) singular values from large to small as \ (\ sigma_1, \ sigma_2, \ ldots, \ Sigma _ {\ max \ {K, D \}} \) , then \ (P_ {W_e ^ {( t)}} \) eigenvectors and the corresponding eigenvalues:
Here Insert Picture Description

What does this show? I.e. the updated overparameterization, \ (W_e ^ {(T +. 1)} \) update, i.e. \ (vec (W_e ^ {( t + 1)}) \) updating tends to \ (vec (u_1v_1 {T} ^) \) , and some of you feel that the idea of gradient descent method is somewhat similar, and the results of previous borrowing. Moreover, this borrowing, there will be a mutual communication between the coordinates, the general drop method is not available this.

Claim 2

Here Insert Picture Description

Theorem 2

Theorem 2 Suppose \ (\ frac {\ mathrm { d} L ^ 1} {\ mathrm {d} W} \) in \ (W = 0 \) at a definition, \ (W = 0 \) of a o continuous, then for a given region \ (N \ in \ N, N> 2 \) , is defined:
Here Insert Picture Description
it does not exist on \ (W is \) of a function, which is a gradient field \ (F. \) .

Meaning Theorem 2 is that it tells us the method overparameterization is not by adding a regularization term to achieve, because \ (F (W) \) the original function does not exist, such as
\ [L (W) + \ lambda \ | W \ | \]
operation is not possible to update the change overparametrization.

Proved idea is to construct a closed curve, proof \ (F (W) \) in which the line integral is not 0. (handsome ...)

prove

Theorem 1

首先是一些符号:
\[ \prod_a^{j=b} W_j := W_b W_{b-1} \cdots W_a \\ \prod_{j=a}^b W_j^T := W_a^TW_{a+1}^T \cdots W_b^T \]

Here Insert Picture Description
表示块对角矩阵.

容易证明(其实费了一番功夫,但是不想写下来,因为每次都会忘,如果下次忘了,就再推一次当惩罚):

Here Insert Picture Description

于是
Here Insert Picture Description
\(j\)个等式俩边右乘\(W_j^T(t)\), 第\(j+1\)个等式俩边左乘\(W_{j+1}^T(t)\)可得:
Here Insert Picture Description
俩边乘以2
Here Insert Picture Description

\(C_j(t):=W_j(t)W_j^T(t), C_j'(t):=W_j^T(t)W_j(t)\), 则
Here Insert Picture Description

注意,我们将上面的等式改写以下,等价于
\[ \dot{(C'_{j+1}-C_j)}(t) = -2\eta \lambda (C'_{j+1}-C_j)(t), \]
\(y(t):=(C'_{j+1}-C_j)(t)\), 则
\[ \dot{y}(t)=-2\eta \lambda y, \]
另外有初值条件\(y(t_0)=0\)(这是题设的条件).
容易知道,上面的微分方程的解为\(y\equiv0\).
所以
\[ C'_{j+1}(t)=C_j(t), j=1,\ldots, N-1. \]
假设\(W_j(t)\)的奇异值分解为
\[ W_j(t)=U_j \Sigma_jV_j^T. \]
且假设\(\Sigma_j\)的对角线元素,即奇异值是从大到小排列的.
则可得
Here Insert Picture Description

显然\(\Sigma_{j+1}^T\Sigma_{j+1}=\Sigma_j \Sigma_j^T\), 这是因为一个矩阵的特征值是固定的(如果顺序固定的话),特征向量是不一定的,因为可能有多个相同的特征值,那么对于一个特征值的子空间的任意正交基都可以作为特征向量,也就是说
Here Insert Picture Description
Here Insert Picture Description

其中\(I_{d_r} \in \R^{d_r \times d_r}\)是单位矩阵, \(O_{j,r} \in \R^{d_r \times d_r}\)是正交矩阵.

所以对于\(j=1\ldots N-1\), 成立
Here Insert Picture Description

\(j=N\)
Here Insert Picture Description

Here Insert Picture Description
注意,上面的推导需要用到:
\[ (diag(O_{j,1},\ldots, O_{j,m}))^T diag((\rho_1)^c I_{d_1},\ldots, (\rho_,)^j I_{d_m})(diag(O_{j,1},\ldots, O_{j,m})) = diag((\rho_1)^c I_{d_1},\ldots, (\rho_,)^j I_{d_m}) \]
既然
Here Insert Picture Description
那么
Here Insert Picture Description
Here Insert Picture Description
上式左端为\(\dot{W}_e(t)\), 于是
Here Insert Picture Description

再利用(23)(24)的结论

Here Insert Picture Description

Claim 1 的证明

Kronecker product (克罗内克积)

网上似乎都用\(\otimes\), 不过这里还是遵循论文的使用规范吧, 用\(\odot\)来表示Kronecker product:
\[ A \odot B := \left [ \begin{array}{ccc} a_{11} \cdot B & \cdots & a_{1n_{a}} \cdot B \\ \vdots & \ddots & \vdots \\ a_{m_a1} \cdot B & \cdots & a_{m_a n_a} \cdot B \end{array} \right ] \in \R^{m_am_b \times n_an_b}, \]
其中\(A \in \R^{m_a \times n_a}, B \in \R^{m_b \times n_b}\).

容易证明 \(A \odot B\)的第\(rn_b + s, r = 0, 1, \ldots, n_a-1, s = 0, 1, \ldots, n_b-1\)列为:
\[ vec(B_{*s+1}A_{*r+1}^T), \]
其中\(B_{*j}\)表示\(B\)的第\(j\)列, 沿用\(vec(A)\)\(A\)的列展开. 相应的,\(A \odot B\)的第\(pm_b+q, p=0,1,\ldots,m_a-1,q=0, 1, \ldots, m_b-1\)行为:
\[ vec(B_{q+1*}^TA_{p+1*})^T, \]
其中\(A_{i*}\)表示\(A\)的第\(i\)行.

\([A\odot B]_{(p,q,r,s)}\)表示\([A \odot B]\)的第\(rn_b+s\)\(pm_b+q\)行的元素, 则
\[ [A\odot B]_{(p,q,r,s)} = a_{p+1,r+1}b_{q+1,s+1} \]

另外\(I_{d_1} \odot I_{d_2} = I_{d_1d_2}\).

下面再证明几个重要的性质:

\((A_1 \odot A_2)(B_1 \odot B_2) = (A_1 B_1) \odot (A_2B_2)\)

假设\(A_1 \in \R^{m_1 \times l_1}, B_1 \in \R^{l_1 \times n_1}, A_2 \in \R^{m_2 \times l_2}, B_2 \in \R^{l_2 \times n_2}\), 则
\[ (A_1 \odot A_2)(B_1 \odot B_2) = (A_1 B_1) \odot (A_2B_2) \]

考察俩边矩阵的\((pm_2+q,rn_2+s)\)的元素,
\[ \begin{array}{ll} [(A_1 \odot A_2)(B_1 \odot B_2)]_{(p,q,r,s)} &= (A_1 \odot A_2)_{pm_2+q*} (B_1 \odot B_2)_{*rn_2+s} \\ &= vec({A_2}_{q+1*}^T{A_1}_{p+1*})^T vec({B_2}_{*s+1}{B_1}_{*r+1}) \\ & = tr({A_1}_{p+1*}^T{A_2}_{q+1*}{B_2}_{*s+1}{B_1}_{*r+1}^T) \\ & = ({A_1}_{p+1*}{B_1}_{*r+1}) ({A_2}_{q+1*}{B_2}_{*s+1}) \\ & = (A_1B_1)_{p+1,r+1} (A_2B_2)_{q+1,s+1} \\ & = [(A_1 B_1) \odot (A_2B_2)]_{(p,q,r,s)}. \end{array} \]
得证. 注意,倒数第四个等式到倒数第三个用到了迹的可交换性.

\((A \odot B)^T=A^T \odot B^T\)

\[ \begin{array}{ll} [(A \odot B)^T]_{(p, q, r, s)} &= [A \odot B]_{(r, s, p, q)} = a_{r+1,p+1}b_{s+1,q+1} \\ & = a^T_{p+1,r+1}b^T_{q+1,s+1}=[A^T \odot B^T]_{(p,q,r,s)}. \end{array} \]

\(A^T=A^{-1},B^T=B^{-1} \Rightarrow (A \odot B)^T = (A \odot B)^{-1}\)

\[ \begin{array}{ll} (A \odot B)^T(A \odot B) & = (A^T \odot B^T)(A \odot B) \\ &= (A^TA) \odot (B^TB) \\ &= I_{n_a} \odot I_{n_b} \\ & = I_{n_a n_b}, \end{array} \]
所以\((A \odot B)^T = (A \odot B)^{-1}\).

回到Claim 1 的证明上来,容易证明
Here Insert Picture Description
于是
Here Insert Picture Description
第二个等式用到了\((A_1 \odot A_2)(B_1 \odot B_2) = (A_1 B_1) \odot (A_2B_2)\).

只需要证明:
Here Insert Picture Description

等价于\(P_{W_e}\). 令
\[ W_e = UDV^T, \]
其中\(U \in \R^{k \times k}, V \in \R^{d \times d}\).
所以

Here Insert Picture Description

第三个等式用了俩次\((A_1 \odot A_2)(B_1 \odot B_2) = (A_1 B_1) \odot (A_2B_2)\).

定义:

Here Insert Picture Description

\[ Q = O \Lambda O^T. \]

The rest, about \ (O \) column
Here Insert Picture Description
\ (\ Lambda \) diagonal elements:
Here Insert Picture Description
just some simple derivation of Bale.

Proof of Theorem 2

I do not want this to prove posted here, because I can understand this proof, I would like to know directly see the original bar.

Code

Here Insert Picture Description
Here Insert Picture Description
Although only used a very simple example to experiment, but the feeling, the initial value of the iterative algorithm is eating. Claim 1 as just explained, this drop method, will be more inclined direction before, that is, before wrong behind will be wrong?

y1 is set to 100, y2 is set to 1, lr = 0.005, there will be (there may not be convergent 0):
Here Insert Picture Description

This decline is pretty horrible way ah, but the feeling is not stable. Of course, there may be a program written sucks.


"""
On the Optimization of Deep
Net works: Implicit Acceleration by
Overparameterization
"""

import numpy as np
import torch
import torch.nn as nn
from torch.optim.optimizer import Optimizer, required



class Net(nn.Module):
    def __init__(self, d, k):
        """
        :param k:  输出维度
        :param d:  输入维度
        """
        super(Net, self).__init__()
        self.d = d
        self.dense = nn.Sequential(
            nn.Linear(d, k)
        )

    def forward(self, input):
        x = input.view(-1, self.d)
        output = self.dense(x)
        return output




class Overparameter(Optimizer):
    def __init__(self, params, N, lr=required, weight_decay=1.):
        defaults = dict(lr=lr)
        super(Overparameter, self).__init__(params, defaults)
        self.N = N
        self.weight_decay = weight_decay

    def __setstate__(self, state):
        super(Overparameter, self).__setstate__(state)
        print("????")
        print(state)
        print("????")

    def step(self, colsure=None):
        def calc_part2(W, dw, N):
            dw = dw.detach().numpy()
            w = W.detach().numpy()
            norm = np.linalg.norm(w, 2)
            part2 = norm ** (2-2/N) * (
                dw +
                (N - 1) * (w @ dw.T) * w / (norm ** 2 + 1e-5)
            )
            return torch.from_numpy(part2)

        p = self.param_groups[0]['params'][0]
        if p.grad is None:
            return 0
        d_p = p.grad.data
        part1 = (self.weight_decay * p.data).float()
        part2 = (calc_part2(p, d_p, self.N)).float()
        p.data -= self.param_groups[0]['lr'] * (part1+part2)

        return 1

class L4Loss(nn.Module):
    def __init__(self):
        super(L4Loss, self).__init__()

    def forward(self, x, y):
        return torch.norm(x-y, 4)

x1 = torch.tensor([1., 0])
y1 = torch.tensor(10.)
x2 = torch.tensor([0, 1.])
y2 = torch.tensor(2.)
net = Net(2, 1)
criterion = L4Loss()
opti = Overparameter(net.parameters(), 4, lr=0.01)


loss_store = []
for epoch in range(500):
    running_loss = 0.0
    out1 = net(x1)
    loss1 = criterion(out1, y1)
    opti.zero_grad()
    loss1.backward()
    opti.step()
    running_loss += loss1.item()
    out2 = net(x2)
    loss2 = criterion(out2, y2)
    opti.zero_grad()
    loss2.backward()
    opti.step()
    running_loss += loss2.item()
    #print(running_loss)
    loss_store.append(running_loss)

net = Net(2, 1)
criterion = nn.MSELoss()
opti = torch.optim.SGD(net.parameters(), lr=0.01)
loss_store2 = []
for epoch in range(500):
    running_loss = 0.0
    out1 = net(x1)
    loss1 = criterion(out1, y1)
    opti.zero_grad()
    loss1.backward()
    opti.step()
    running_loss += loss1.item()
    out2 = net(x2)
    loss2 = criterion(out2, y2)
    opti.zero_grad()
    loss2.backward()
    opti.step()
    running_loss += loss2.item()
    #print(running_loss)
    loss_store2.append(running_loss)


import matplotlib.pyplot as plt


plt.plot(range(len(loss_store)), loss_store, color="red", label="Over")
plt.plot(range(len(loss_store2)), loss_store2, color="blue", label="normal")
plt.legend()
plt.show()

Guess you like

Origin www.cnblogs.com/MTandHJ/p/11701133.html