Deep learning: forty-three (Deep Network training method with Hessian Free)

 

Deep learning: forty-three (Deep Network training method with Hessian Free)

At present, the mainstream method of training value of the depth of the network (Deep Nets) right or gradient descent (combined with BP algorithm), of course you can use unsupervised methods (such as RBM, Autoencoder) to pre-training parameter weights before, and a disadvantage of depth gradient descent method applied in the network iterative change value will be small weights, it is easy to converge to a local optimum; another disadvantage is that the gradient descent method does not handle sickly curvature (such Rosenbrock function) error function. The Hessian Free method (hereinafter referred to as HF) described herein can not pre-weight training network, the effect is pretty good, and its wider application (RNN peer network can be used for learning), while overcoming the above gradient that two disadvantages descent. The main idea is similar to HF Newton iterative method, but does not show the Hessian matrix H to calculate a point function error surface, but the product is calculated and Hv arbitrary vector v H (the matrix by some technique directly - vector multiplication We need to use the form in the back of the optimization process), hence the name "Hessian Free". This article is read Martens papers Deep learning via Hessian-free optimization some notes you wrote down (For details, please refer to the paper section).

  Hessian Free method in fact has been used many years ago, but why Martens papers he reverted to this approach? In fact, they differ in that different H matrix "free" way, because the method implicit calculation of H can have a variety, and the name of the algorithm can be called HF, and therefore can not simply think Martens article HF method in the Deep Learning in is the method previously existing for a long time (if this is the case, Deep Learning in N years ago on fire!), they can only show a similar idea. When you apply the idea DL HF network, you need to use a variety of techniques, and mathematical optimization itself is a variety of techniques, similar combinations. Matrens mainly used two big ideas +5 little skill (paper given about five, but there is also a lot of skills that code is not mentioned in the paper) to complete the network of training DL.

  idea 1: Calcd of Hv (v arbitrary) using some method, for example, it is only common to use a diagonal matrix approximation error function is turned on finite difference approximation to accurately calculate Hv (see equation below), than before Hessian matrix preserved a lot more information. By calculating an implicit calculation Hv, avoid direct inversion of H, first, because H is too large, the second is the inverse H there may not even exist.

   

  idea 2: by the following formula approximated quadratic function value close to the value of θ. And the best search direction p by the CG iterative method (the algorithm briefly see previous blog post: Machine Learning & Data Mining notes _12 (simple understanding of the Conjugate Gradient optimization) ) is obtained.

   

  Tip 1: Hv not directly calculating the finite difference method, but using the R-operator method Pearlmutter (did not understand, but there are many advantages of this method).

  Techniques 2: Gauss-Newton matrix G instead of Hessian matrix H, so the final calculation is implicit Gv.

  Techniques 3: shows the use of the algorithm CG (CG with a pre-condition, i.e., the first parameter θ conducted a linear coordinate transformation) Iteration termination condition when the search direction θ p, namely:

   

  Tip 4: When dealing with large data during the study, when a linear search algorithm with CG did not use all the samples, but the mini-batch used as samples from some of the mini-batch can already obtain information about some effective surface curvature .

  Tip 5: obtained by heuristic methods (Levenburg-Marquardt) system damping factor [lambda], the coefficients in the CG algorithm is useful to pre-condition, preconditioner matrix M is calculated as:

   

  The above 7:00 Martens needs to read the paper, and the paper combined with its corresponding code to read (code see his profile http://www.cs.toronto.edu/~jmartens/research.html ).

  Let's look at a simple flow chart of HF:

  

  It is a flowchart for explaining: the objective function defined the system, see the blog this second equation, and then defines the maximum number of times a weight updating max-epochs, each first obtains a gradient of the objective function value by BP algorithm cycle followed by a heuristic method to obtain the value of λ, then preconditioned CG method to optimize the parameters of the search direction p (Hv need to use this process in the form of analytic function), and finally the updated parameter to the next round.

 

  About code part, simply the following:

  Run the program, iteration to 170 times, share almost 20 hours, the output of the program are as follows:  

Copy the code
maxiters = 250; miniters = 1
CG steps used: 250, total is: 28170
ch magnitude : 7.7864
Chose iters : 152
rho = 0.50606
Number of reductions : 0, chosen rate: 1
New lambda: 0.0001043
epoch: 169, Log likelihood: -54.7121, error rate: 0.25552
TEST Log likelihood: -87.9114, error rate: 3.5457
Error rate difference (test - train): 3.2901
Copy the code

  CURVES classification code is complete database, using a Autoencoder network, the network hierarchy is: [50 100 784 400 200 is 2,562,550,100,200,400,784].

  conjgrad_1 (): the complete optimization algorithm preconditioned CG, CG function to obtain the number of steps and the results of the iterative optimization (search direction).

  computeGV (): Gv matrix calculation is completed, the combined operation of the R and Gauss-Newton method.

  computeLL (): calculated value of the sample output and an error log likelihood (different excitation output node of the error function equation different).

  nnet_train_2 (): This function is of course the core function, and directly used to train a DL network, which when ultimately to call conjgrad_1 (), computeGV (), computeLL (), additional details regarding the determination of the number of the error surface function guide is a classic BP algorithm.

 

 

  References:

  Martens, J. (2010). Deep learning via Hessian-free optimization. Proceedings of the 27th International Conference on Machine Learning (ICML-10).

     Machine Learning & Data Mining notes _12 (simple understanding of the Conjugate Gradient optimization)

     http://www.cs.toronto.edu/~jmartens/research.html

     http://pillowlab.wordpress.com/2013/06/11/lab-meeting-6102013-hessian-free-optimization/

Guess you like

Origin www.cnblogs.com/think90/p/11619616.html