Reinforcement Learning: Stochastic Approximation and Stochastic Gradient Descent

mean estimation

  Through the previous study, we know that expectations can be obtained through many samples. And to find x ˉ \bar xxThere are two methods of ˉ , one is to directly add the sampled data and then divide by the number, but this method is less efficient.
insert image description here
The second method is iterative calculation, that is, several data are counted as several data. The specific calculation is as follows:
insert image description here

Random approximation: Robbins-Monro (RM)

  Suppose we now need to solve the equation:
g ( w ) = 0 g(w)=0g(w)=0

Then there are two cases, one is the function expression we know, and the other is the expression we don't know. I don't know the expression, how to solve it? In fact, the RM algorithm can solve. Next, take solving the equation as an example, let's learn the RM algorithm, which is an iterative algorithm.
insert image description here
Among them, ak a_kakis a positive coefficient. { wk } {\{w_k}\}{ wk} is the input sequence,g ~ { wk , η k } \widetilde{g}{\{w_k,\eta _k}\}g { wk,thek} is the output sequence. For the convenience of understanding, we give a concrete example:
g ( w ) = tanh ( w − 1 ) g(w)=tanh(w-1)g(w)=English ( w _1 )
We know thatg(w) g(w)The real solutionw ∗ w^* of g ( w )w =1, given the initial valuew 1 w_1w1=3, a k = 1 / k a_k=1/k ak=1/ kη k = 0 \eta _k=0thek=0 , the calculated resultw ∗ = 1 w^*=1w=1 as follows:
insert image description here

RM Algorithm: Convergence Analysis

  The above analysis is intuitive and not strict, now we give the following mathematically strict convergence conditions.

insert image description here
  Condition 1: Indicates that the required function ggg is an increasing function whose gradient is bounded.

  Condition 2: Regarding the coefficient ak a_kak
   a k a_k akThe sum of should be equal to infinity, why? We write out the MR algorithm separately, and the following formula can be obtained after adding:
insert image description here
It shows that if ak a_kakThe sum of is less than infinity, then akg ~ ( wk , η k ) a_k\widetilde{g}{(w_k,\eta _k)}akg (wk,thek) is bounded, thenw 1 w_1w1Can't be set arbitrarily, but ak a_kakThe sum of is equal to infinity, so that we can safely choose the initial value w 1 w_1w1

   a k 2 a_k^2 ak2The sum of should be less than infinity, indicating that ak a_kakIt will definitely converge to 0, why?
insert image description here
It can be seen that if ak a_kaktends to 0, then there is akg ~ ( wk , η k ) a_k\widetilde{g}{(w_k,\eta _k)}akg (wk,thek) tends to 0, that is,wk + 1 − wk w_{k+1}-w_kwk+1wktends to 0.

  We will find that in many reinforcement learning algorithms, ak a_k is usually chosenakAs a small enough constant, the efficiency of the algorithm is low because 1/k will become smaller and smaller. Even though the second condition is not satisfied in this case, the algorithm can still work efficiently because the actual number of iterations is limited.

  Condition 3: Regarding the coefficient η \etathe , statement\etaThe expectation of η is 0 and the variance is bounded.

Stochastic Gradient Descent: SGD

  The stochastic gradient descent (SGD) algorithm is widely used in the field of machine learning and RL. SGD is a special RM algorithm, and the mean estimation algorithm is a special SGD algorithm.

  Suppose our purpose is to solve an optimization problem, there is an equation:
insert image description here
www is the parameter to be optimized. XXX is a random variable. Expectations are about X. wherewww andXXX can be either a scalar or a vector. functionf ( . ) f(.)f ( . ) is a scalar.

  Now, our goal is to find the optimal www minimizes the objective function. There are three ways to solve it:

Gradient Descent (GD)

insert image description here
The idea of ​​the GD algorithm is to descend along the direction of the gradient to find the minimum value, so that the obtained solution is close to the real value w ∗ w^*w a k a_k akis the step size, which controls the descending speed. Its characteristic is to know the gradient of the function. If you don't know the gradient function, how to solve it?

Batch Gradient Descent (BGD)

  Batch gradient descent uses data to avoid solving the gradient function. Through a large number of samples, the average value is used to approximate the expectation of the gradient function, as follows:

insert image description here
But the problem is that for each wk w_kwkBoth require sampling many samples.

Stochastic Gradient Descent (SGD)

insert image description here
  The stochastic gradient descent method is compared with the gradient descent method: use ▽ kf ( wk , xk ) ▽_kf(w_k,x_k)kf(wk,xk) to replace the true gradientE [ ▽ kf ( wk , xk ) ] E[▽_kf(w_k,x_k)]E[kf(wk,xk)] , compared with the batch gradient descent method, n=1, that is, only sample once.

.We
  consider such an optimization example:
insert image description here
   We know that the optimized solution w ∗ = E [ X ] w^*=E[X]w=E [ X ] , we know thatJ ( m ) J(m)A necessary condition for J ( m ) to reach the minimum value is▽ J ( w ) = 0 ▽ J(w)=0J(w)=0 , we can getw ∗ = E [ X ] w^*=E[X]w=E [ X ]

   Solve with gradient descent (GD):
insert image description here

   Solve with gradient descent (GD):
insert image description here

SGD Algorithm: Convergence Analysis

  One of the basic ideas of SGD is from GD to SGD, because E is unknown, so simply remove E, and then use a sample to approximately replace this E, and replace the real gradient with a random gradient. This is SGD . We can easily know that there must be errors between them, as follows:
insert image description here

In the case of errors, can SGD be used to find the optimal solution? The answer is yes, so why? Actually, SG is a special RM algorithm. The proof is as follows:

  The problem to be solved by SGD is to optimize and solve such an equation. We can convert this optimization problem into a solution equation g ( w ) g(w)g ( w ) , and then use the RM algorithm to solve it. So the SGD algorithm actually solves the equationg ( w ) g(w)An RM algorithm for the special problem of roots of g ( w ) .

insert image description here

  Because SGD is a special RM algorithm, the convergence of the previous RM algorithm can be applied to the convergence analysis of SGD.
insert image description here

Properties of SGD Algorithm

  SGD uses stochastic gradients instead of real gradients, and stochastic gradients are random. Will it cause the randomness of SGD convergence to be relatively large? To answer this question, the relative error is used to analyze the difference between random and batch gradients. Using Lagrange's median value theorem, the following formula is obtained:

insert image description here
We assume ffThe second-order gradient of f is a positive number
insert image description here
to get the following deduction:
insert image description here
According to the above equation, we can get an interesting convergence property of SGD. Relative errorδ k δ_kdkGiven ∣ wk − w ∗ ∣ |w_k - w^*|wkwis inversely proportional; when∣ wk − w ∗ ∣ |w_k - w^*|wkwδ k δ_kobtaineddkSmaller, SGD is similar to GD. when wk w_kwkclose to w ∗ w^*w , the relative errorδ k δ_kdkmay be larger, and at w ∗ w^*wConvergence near ∗ exhibits more randomness.
.
  Now, we use an example to understand this property. X ∈ R 2 X ∈ R^2XR2 represents a random position on the plane. Its distribution is evenly distributed in a square area with the origin as the center and a side length of 20. Now randomly sample 100 samples{ xi } i = 0 100 {\{x_i}\}_{i=0}^{100}{ xi}i=0100, its true mean value is E[X] = 0, now we use the above algorithm to estimate the mean value, the result is as follows:

insert image description here
It can be found that when the initial guess of the mean is far from the true value, the SGD estimate can quickly approach the neighborhood of the true value, although when the estimate is close to the true value, it shows some randomness, but it still gradually approaches the true value.
.We
  may often encounter a deterministic SGD formula that does not involve any random variables. How to solve such a problem? As shown in the figure below:
insert image description here
f ( w , xi ) f(w, x_i)f(w,xi) is a parameterized function,www is the parameter to be optimized.
{ xi } i = 0 n {\{x_i}\}_{i=0}^{n}{ xi}i=0nis a set of real numbers, where xi x_ixiis not a sample of any random variable.

.So
  , can we use the SGD algorithm to solve it? We use its mean xk x_kxkInstead of the averaging process, the resulting formula is very similar to SGD, but the difference is that no random variables are involved. In order to use the SGD algorithm, we manually introduce random variables, we introduce a random variable XXX , X X X is defined in the set{ xi } i = 0 n {\{x_i}\}_{i=0}^{n}{ xi}i=0non, xi x_ixiObey the uniform distribution, the probability of each being taken is 1 / n 1/n1/ n . So we transform intoE [ f ( w , X ) ] E[f(w,X)]E [ f ( w ,X )] , solving this problem is naturally the SGD algorithm. as follows:
insert image description here

Comparison: BGD, MBGD, SGD

  Suppose we want to minimize J ( w ) = E [ f ( w , X ) ] J(w) = E[f(w,X)]J(w)=E [ f ( w ,X )] , given a set of random samples{ xi } i = 0 n {\{x_i\}_{i=0}^n}{ xi}i=0n. Solve this problem with batch gradient descent (BGD), stochastic gradient descent (SGD) and mini-batch gradient descent (MBGD), respectively.

insert image description here

  In the BGD algorithm, all samples are used in each iteration. When nnn 很大时, ( 1 / n ) ∗ Σ i = 1 n ▽ w f ( w k , x i ) (1/n) * Σ_{i=1}^n▽wf(w_k,x_i) (1/n)Si=1nwf(wk,xi) close to the real gradientE [▽ wf ( wk , xi ) ] E[▽wf(w_k,x_i)]E [ w f ( wk,xi)]

  In the MBGD algorithm, I k I_kIkis 1 , . . , n {1,..,n}1,..,A subset of n whose size is ∣ I k ∣ = m |I_k| = mIk=m , setI k I_kIkby mmm times of independent sampling.

  In the SGD algorithm, xk x_kxkis from { xi } i = 0 n {\{x_i\}_{i=0}^n}{ xi}i=0nobtained by random sampling.

Summary:
  To a certain extent, MGBD includes BGD and SGD. when m = 1 m=1m=1 becomes SGD, whenmmWhen m is large, it does not completely become BGD because MBGD uses n samples taken randomly, while BGD uses all n samples, where MBGD may use the same value multiple times, while BGD only uses each number once. Compared with SGD, MBGD has less randomness because it uses more samples than just one sample in SGD; compared with BGD, MBGD does not need to use all samples in each iteration, making it More flexible and efficient.

case
   at a given xi {x_i}xi, our goal is to calculate the mean ( 1 / n ) ∗ Σ i = 1 n ∣ ∣ w − xi ∣ ∣ (1/n) * Σ_{i=1}^n||w-x_i||(1/n)Si=1n∣∣wxi∣∣ . This problem can be equivalently formulated as the following optimization problem:
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_50086023/article/details/131286483