Stochastic Gradient Descent

First, start with Multinomial Logistic Model

1、Multinomial Logistic       

       So X\in R^nas ddimensional input vector;

            C  \in \{0,1,......k-1\}An output label; (a total of class k);

            \beta \in \{\beta_0,....\beta_{k-2}\} \in R^nIs a model parameter vector;

Multinomial Logistic model means that the following form:

                                                        p(c|x,\beta)=\begin{equation} \begin{cases} \frac{e^{\beta_c \cdot x}}{Z_x}\quad &if \quad c<k-1\\ \frac{1}{Z_x} \quad& if \quad c=k-1 \end{cases} \end{equation}

among them:

                                                        \beta_c \cdot x=\sum\limit_{i<d}{\beta_{c,i} \cdot x_i}

                                                        Z_x=1+\sum\limit_{c<k-1}e^{\beta_c \cdot x}

For example: k=2when the output is 0 and 1 label, are:

                                                        p(c|x,\beta)=\begin{equation} \begin{cases} \frac{e^{\beta_c \cdot x}}{1+e^{\beta_c \cdot x}}\quad &if \quad c=0\\ \frac{1}{1+e^{\beta_c \cdot x}} \quad& if \quad c=1 \end{cases} \end{equation}

 

2、Maximum Likelihood Estimate and Maximum a Posteriori Estimate

(1)、Maximum Likelihood Estimate

        Suppose the data sets D =<x_j,c_j>_{j<n}, in order to train a model commonly used to determine the maximum likelihood model parameters:

                                                         \beta\limit^{\^}=arg\limit_\beta max \quad p(D|\beta)=arg\limit_\beta max \quad log{\Pi}\limit_{j<n}p(c_j|x_j,\beta)

                                                             =arg\limit_\beta max \sum\limit_{j<n}log \quad p(c_j|x_j,\beta)

(2)、Maximum a Posteriori Estimate       

        Hypothetical model parameter \betadistributions obey p(\beta), then the optimal parameters on a given set of data we want to find satisfying the following relationship:

                                                         \beta\limit^{\^}=arg\limit_\beta max\quad p(\beta|D)

                                                             =arg\limit_\beta max \quad\frac{p(D|\beta)p(\beta)}{p(D)}

                                                             =arg\limit_\beta max\quad p(D|\beta)p(\beta)

Using the above formula can be defined loss function to solve the problem:

                                                         \beta\limit^{\^}=arg\limit_\beta max\quad p(D|\beta)p(\beta)

                                                             Beta = arg \ limit_ \ \ quad -p (D | \ beta) \ p (beta)

                                                             \Leftrightarrow  arg\limit_\beta min\quad -log(p(D|\beta)p(\beta))

                                                             \Leftrightarrow  arg\limit_\beta min\quad -[log \quad p(D|\beta)+log \quad p(\beta)]

                                                             \Leftrightarrow  arg\limit_\beta min\quad -[ \sum\limit_{j<n}log \quad p(c_j|x_j,\beta)+ \sum\limit_{j<n}log \quad p(\beta_j|\delta^2) ]

Personally I think that from the perspective of statistical learning, the first part of the above equation describes the deviation (ERM), while the second part describes the variance (confidence risk).

3、L1-regularized model and L2-regularized model

        Model parameters for \betathe distribution p(\beta), there may be the following assumptions:

(1)、Gaussian Prior

                                                        p(\beta) =\Pi\limit_{c<k-1}\Pi\limit_{i<d}Norm(0,\delta_i^2)(\beta_{c,i})

                                                         Norm(0,\delta_i^2)=\frac{1}{\delta_i\sqrt{2\pi}}\cdot e^{-\frac{\beta_{c,i}^2}{2\delta_i^2}}

(2)、Laplace Prior

                                                         p(\beta) =\Pi\limit_{c<k-1}\Pi\limit_{i<d}Laplace(0,\delta_i^2)(\beta_{c,i})

                                                         Laplace(0,\delta_i^2)=\frac{1}{\delta_i\sqrt{2}}\cdot e^{-\sqrt{2}\cdot \frac{|\beta_{c,i}|}{\delta_i^}}

When \beta \~Gaussian Priorthe time, called the L2-regularized:

                                                         \beta_{MAP}=arg\limit_\beta min\quad -[ \sum\limit_{j<n}log \quad p(c_j|x_j,\beta)- C \cdot \sum\limit_{j<n}\beta_{j}^2]

When \beta \~Laplace Priorthe time, called the L1-regularized:

                                                         \beta_{MAP}=arg\limit_\beta min\quad -[ \sum\limit_{j<n}log \quad p(c_j|x_j,\beta)- C \cdot \sum\limit_{j<n}|\beta_{j}|]

Where the constant Cis a control deviation and variance for the regulator:

        ● Cvery hour, stressed likelihood, at this time will cause Overfit;

        ● CWhen great emphasis regularization, at this time will cause Underfit.

In the same \participate inunder the conditions, Gaussian Prior and Laplace Prior comparison are as follows:

image

Figure 1 - red for the Laplace Prior, black as the Gaussian Prior          

 

4、L1-regularized model ?or L2-regularized model?

Are currently the mainstream method of selecting L1-regularized, including various L-BFGS (eg: OWL-QN) SGD and various methods, the following main reasons:

        ● Our goal is to be optimized:

                                                       \beta=  arg\limit_\beta min\quad -[ \sum\limit_{j<n}log \quad p(c_j|x_j,\beta)+ \sum\limit_{j<n}log \quad p(\beta_j|\delta^2) ]

           As can be seen from Figure 1, wants to log \quad p(\beta_j|\delta^2)obtain the maximum value, the weight vectors needed near its mean value (i.e. 0), obviously subject to the right of the weight vector Laplace Prior decrease faster than the Gaussian Prior subject;

        ● at k=2the time of the gradient descent algorithm, for example, the right to re- \betaupdate as follows:

           ○ Gaussian Prior:         

                                                      \ Beta_ {i + 1} = \ beta_i + \ lambda_i [(y_i-p_i) x_i- \ frac {\ beta_i} {\ delta_i ^ 2}]

           ○ Laplace Prior:

                                 When , \beta_i>0the \beta_{i+1}=\beta_i+\lambda_i [(y_i-p_i)x_i-\frac{\sqrt{2}}{\delta_i}];

                                 When \beta_i<0when\beta_{i+1}=\beta_i+\lambda_i [(y_i-p_i)x_i+\frac{\sqrt{2}}{\delta_i}] .

                                 当y_i-p_ix_i同号时表明没有误分,权重的绝对值会以一个比较小的速度更新,而当y_i-p_ix_i异号时误分发生,权重的绝对值会以一个比较大的速度更新。

        ●将权重更新看成两个阶段:likelihood + regularization,暂时不考虑likelihood,那么k次迭代后有下面关系:

           ○ Gaussian Prior:        

                                                     

           ○ Laplace Prior:

                                 当\beta_i>0时,

                                 当\beta_i<0时,

  

            当,虽然前者的极限值为0,但是不会精确为0,而后者每次更新一个常数,这就意味着理论上后者可能会精确的将权重更新为0。

        ●L1-regularized能够获得稀疏的feature,因此模型训练过程同时在进行feature selection。

        ●如果输入向量是稀疏的,那么Laplace Prior能保证其梯度也是稀疏的。

 

二、L1-Stochastic Gradient Descent

1、Naive Stochastic Gradient Descent

        随机梯度下降算法的原理是用随机选取的Training Set的子集来估计目标函数的梯度值,极端情况是选取的子集只包含一条Sample,下面就以这种情况为例,其权重更新方式为:

                                                       \beta_{i}^{k+1}=\beta_i^k+\lambda_k [(y_i-p_i)x_i- Csign(\beta_i)]

                                                       sign(x)=\begin{equation} \begin{cases} &1  &x>0\\ &0  &x=0\\ -&1 &x<0\\ \end{cases} \end{equation}

这种更新方式的缺点如下:

        ●每次迭代更新都需要对每个feature进行L1惩罚,包括那些value为0的没有用到的feature;

        ●实际当中在迭代时能正好把权重值更新为0的概率很小,这就意味着很多feature依然会非0。

2、Lazy Stochastic Gradient Descent

        针对以上问题,Carpenter在其论文《Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression》(2008)一文中进行了有效的改进,权重更新采用以下方式:

                                                       \beta_{i}^{k+1}=\beta_i^k+\lambda_k (y_i-p_i)x_i

                                                       if \quad\quad\quad\quad \beta_i^{k+1}>0 \quad\quad\quad\quad then

                                                             \beta_{i}^{k+1}=\max(0,\beta_i^{k+1}-C\cdot \lambda_k)

                                                       else \quad \quad if \quad\quad\quad\quad \beta_i^{k+1}<0 \quad\quad\quad\quad then

                                                             \beta_{i}^{k+1}=\min(0,\beta_i^{k+1}+C\cdot \lambda_k)

这种更新方式的优点如下:

        ●通过这样的截断处理,使得惩罚项不会改变函数值符号方向,同时也使得0权重能够自然而然地出现;

        ●算法中使用lazy fashion,对那些value为0的feature不予更新,从而加快了训练速度。

这种方式的缺点:

        ●由于采用比较粗放的方式估计真实梯度,会出现权重更新的波动问题,如下图:

image

3、Stochastic Gradient Descent with Cumulative Penalty

        这个方法来源于Yoshimasa Tsuruoka、Jun’ichi Tsujii和 Sophia Ananiadou的《Stochastic Gradient Descent Training for L1-regularized Log-linear Models with

Cumulative Penalty》(2009)一文,其权重更新方法如下:

 

                                                       \beta_{i}^{k+\frac{1}{2}}=\beta_i^k+\lambda_k (y_i-p_i)x_i

                                                       if \quad\quad\quad\quad \beta_i^{k+\frac{1}{2}}>0 \quad\quad\quad\quad then

                                                             \beta_{i}^{k+1}=\max(0,\beta_i^{k+\frac{1}{2}}- (u^k+q_i^{k-1}))

                                                       else \quad \quad if \quad\quad\quad\quad \beta_i^{k+\frac{1}{2}}<0 \quad\quad\quad\quad then

                                                             \beta_{i}^{k+1}=\min(0,\beta_i^{k+\frac{1}{2}}} + (u^k-q_i^{k-1}))

其中:

           u^k=C\cdot \sum\limit_{t=1}^k{\lambda_t},表示每个权重在第k次迭代时,理论上能够得到的累积惩罚值;

            q_i^k=\sum\limit_{t=1}^k{(w_i^{t+1}-w_i^{t+\frac{1}{2}})},表示当前权重已经得到的累加惩罚值。

算法描述如下:

image

 

 

 

 

 

 

 

 

 

 

 

 

       

 

 

 

 

 

关于学习率的确定,传统的方法是:

                                                      \lambda_k=\frac{\lambda_0}{1+k}  , 其中k为第k次迭代

这种方法在实际当中的收敛速度不太理想,这篇论文提出以下方法:

                                                      \lambda_k=\lambda_0 \alpha^{-k}, 其中k为第k次迭代

在实际当中表现更好,但要注意在理论上它不能保证最终的收敛性,不过实际当中都有最大迭代次数的限制,因此这不是什么大问题。

        与Galen Andrew and Jianfeng Gao的《 Scalable training of L1-regularized log-linear models》(2007)提出的OWL-QN方法相比较如下:

image

image

 

4、Online Stochastic Gradient Descent

        由于L1-regularized权重迭代更新项为常数,与权重无关,因此以N为单位批量更新Sample一次的效果和每次更新一个Sample一共更新N次的效果是一样一样的,因此采用这种方法只用在内存中存储一个Sample和模型相关参数即可。

5、Parallelized Stochastic Gradient Descent

        Martin A. Zinkevich、Markus Weimer、Alex Smola and Lihong Li.在《Parallelized Stochastic Gradient Descent》一文中描述了简单而又直观的并行化方法:

image

image

 

 

 

 

以及

image

        下一步考虑把这个算法在Spark上实现试试,还得用时实践来检验的。

 

三、参考资料

1、Galen Andrew and Jianfeng Gao. 2007. 《Scalable training of L1-regularized log-linear models》. In Proceedings of ICML, pages 33–40.

2、Bob Carpenter. 2008.《 Lazy sparse stochastic gradient descent for regularized multinomial logistic regression》.Technical report, Alias-i.

3、Martin A. Zinkevich、Markus Weimer、Alex Smola and Lihong Li. 《Parallelized Stochastic Gradient Descent》.Yahoo! Labs

4、John Langford, Lihong Li, and Tong Zhang. 2009. 《Sparse online learning via truncated gradient》. The Journal of Machine Learning Research (JMLR), 10:777–801.

5、Charles Elkan.2012.《Maximum Likelihood, Logistic Regression,and Stochastic Gradient Training》.

 

Fourth, the relevant open source software

1、wapiti:http://wapiti.limsi.fr/ 

2、sgd2.0:http://mloss.org/revision/view/842/ 

3、 scikit-learn:http://scikit-learn.org/stable/

4、 Vowpal Wabbit:http://hunch.net/~vw/

5、deeplearning:http://deeplearning.net/

6、LingPipe:http://alias-i.com/lingpipe/index.html

Reproduced in: https: //www.cnblogs.com/vivounicorn/archive/2012/02/24/2365328.html

Guess you like

Origin blog.csdn.net/weixin_33862188/article/details/93642203