(花书) Deep Learning

Chapter 4 Numerical Computation
Numerical computation generally refers to algorithms that solve mathematical problems by updating estimates of the solution through an iterative process.


4.1 Overflow and Underflow
A very devastating rounding error is underflow. Underflow occurs when a number close to 0 is rounded to 0. Many functions behave qualitatively differently when their argument is 0 rather than a small positive number. Another very damaging numerical error is overflow. Overflow occurs when a number of large series is approximated to positive or negative infinity. Further operations usually cause these infinite values ​​to become non-numeric.


4.2 Ill-conditioned conditions The
condition number refers to how fast a function changes with respect to small changes in the input.


4.3 Gradient-based optimization methods
SGD will get stuck as saddle point. Saddle points appear much more common in high dimension. In recent years, problems raised that, when you are training very large neural networks, the problem is more about saddle points and less about local minima. By the way, that's not only the problem at the saddle point, but also near the saddle point.
Gradient descent/steepest descent: reduce f(x) by moving x a small step in the opposite direction of the derivative; Moving in the negative gradient direction decreases f(x).
A common way to choose a learning rate is to choose a small constant. Another approach is to compute f(x') from several different learning rates, choosing the one that yields the smallest objective function value. This strategy is often referred to as line search.


4.4 Constrained Optimization
Sometimes we may wish to find the maximum or minimum value of f(x) in some set S of x rather than all possible values ​​of x. This is called constrained optimization. In constrained optimization terminology, a point x within the set S is called a feasible point.
A simple approach to constrained optimization is to simply modify gradient descent to take the constraints into account. If using a small constant step size e, we can first take the single-step result of gradient descent and then project the result back to S. If using a line search, we can only search for feasible new x-points in a range of step size e, or we can project each point on the line to the constrained region. If possible, it is more efficient to project the gradient into the tangent space of the feasible region before gradient descent or line search.
A more sophisticated approach is to design a different, unconstrained optimization problem whose solution can be transformed into the solution of the original constrained optimization problem. For example, we want to minimize f(x) where x belongs to a 2D plane, where x is constrained to have the unit L2 sweet potato. We can minimize g(\theta) = f([cos \theta, sin \theta]T) with respect to \theta, and finally return [cos \theta, sin \theta] as the solution to the original problem. This approach requires creativity: transitions between optimization problems must be specially designed for each situation we encounter.
The KKT method is a very general solution for constrained optimization. To introduce this method, we introduce the generalized Lagrange function. To define the Lagrangian, we first describe S in terms of equations and inequalities. S = {x|g (i) (x) = 0, h (j) (x) <= 0}, the former is an equality constraint and the latter is an inequality constraint.
We can describe the optimality of a constrained optimization problem with a simple set of properties, called KKT conditions:
(1) The generalized Lagrangian gradient is 0
(2) All constraints on x and KKT multipliers satisfy
( 3) The "complementary slackness" shown by the inequality constraints: \alpha and h(x) inner product is 0


About stochastic gradient descent:
our gradients come from mini-batches so they can be noisy. If there are noises in your gradient estimates, then vanilla SGD kind of meanders around the space, and might actually take a long time to get towards the minima.

Therefore, in time SGD has these shortcomings, and full gradient cannot solve the problems existing in SGD.


Idea: Consider all \theta when making predictions
In general, machine learning practitioners will choose a fairly broad (i.e., high-entropy) prior distribution that reflects the highly inconsistency of the parameter \theta before any data is observed. certainty.
In the context of Bayesian estimation commonly used, the prior begins with a relatively uniform distribution or a Gaussian distribution with high entropy, and the observed data usually reduces the entropy of the posterior, and concentrates on several highly probable values ​​of the parameter
. There are two important differences between Bayesian estimation and maximum likelihood estimation. First, instead of using the point estimate of \theta for prediction by ML methods, the Bayesian method uses the full distribution of \theta. Each value of \theta with a normal probability density contributes to the prediction of the next sample, where the contribution is weighted by the posterior density itself, after observing the dataset [x(1), ..., x(m)] , if we are still very uncertain about the value of \theta, then this uncertainty is directly included in any predictions we make.
The second biggest difference between Bayesian and ML is caused by the Bayesian prior distribution. The prior can influence the shift of the probability mass density function towards regions of the parameter space where the prior is preferred.
Bayesian methods generally generalize well when the training data is limited, but are usually computationally expensive when the number of training samples is large.


5.6.1 Maximum a posteriori (MAP) estimation
One reason to wish to use point estimation is that the computation of most designed Bayesian posteriors is very intractable for most meaningful models, and point estimation provides a feasible get an approximate solution.
And only need to use in the calculation of MAP, log(p(x|\theta)) and log(p(\theta))
Just like full Bayesian inference, the advantage of MAP Bayesian inference is that it can take advantage of the a priori information, which cannot be obtained from the training data. This additional information is beneficial in reducing the variance of the maximum a posteriori point estimate (compared to the ML estimate). However, this advantage comes at the cost of increased bias.
Many regularization estimation methods, such as maximum likelihood learning with weight decay regularization, can be interpreted as MAP approximations for Bayesian inference.
MAP Bayesian inference provides an intuitive way to design complex but interpretable regularization terms.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325938114&siteId=291194637