Deep understanding of reinforcement learning - Markov decision process: Monte Carlo method - [Basic knowledge]

Category Catalog:General Catalog of "In-depth Understanding of Reinforcement Learning"


Monte-Carlo Methods, also known as statistical simulation methods, are a numerical calculation method based on probability statistics. When using the Monte Carlo method, we usually use repeated random sampling, and then use probability and statistics methods to summarize the numerical estimate of the target we want to obtain from the sampling results. A simple example is using the Monte Carlo method to calculate the area of ​​a circle. For example, if a number of points are randomly generated inside the square shown in the figure below, and the number of points falling on the midpoint of the circle is counted, the ratio of the area of ​​the circle to the area of ​​the square is equal to the ratio of the number of midpoints of the circle to the number of midpoints of the square. . The more points we randomly generate, the closer the calculated area of ​​the circle will be to the real area of ​​the circle.
Estimating the area of ​​a circle using the Monte Carlo method
We now introduce how to use Monte Carlo methods to estimate the state value function of a policy in a Markov decision process. Recall that the value of a state is its expected return, then a very intuitive idea is to use a strategy to sample many sequences in the Markov decision process, calculate the return starting from this state, and then find its expectation: < /span>
V π ( s ) = E π [ G t ∣ S t = s ] ≈ 1 N ∑ i = 1 N G t i V_\pi(s)=E_\pi[G_t|S_t=s ]\approx\frac{1}{N}\sum_{i=1}^NG_t^i INπ(s)=ANDπ[GtSt=s]N1i=1NGti

In a sequence, this state may not appear before, it may only appear once, or it may appear many times. The Monte Carlo value estimation method we introduce calculates the payoff for that state every time it occurs. Another option is to only calculate the reward once for a sequence, that is, calculate the cumulative reward when the state appears for the first time in the sequence, and ignore it when the state appears again later. Suppose we now use the strategy π \pi π 从电影 s s s starts the sampling sequence, based on which the state value is calculated. We maintain a counter and total return for each state. The specific process of calculating the state value is as follows:

The Monte Carlo method calculates the state value of the Markov decision process
(1) The strategy is used π \pi π采样若干条序列: s 0 ( i ) ⟶ a 0 ( i ) r 0 ( i ) , s 1 ( i ) ⟶ a 1 ( i ) r 1 ( i ) , s 2 ( i ) ⟶ ⋯ ⟶ r T − 2 ( i ) , s T − 1 ( i ) ⟶ a T − 1 ( i ) r T − 1 , s T s_0^{(i)}\stackrel{a_0^{(i)}}{\longrightarrow}r_0^{(i)},s_1^{(i)}\stackrel{a_1^{(i)}}{\longrightarrow}r_1^{(i)},s_2^{(i)}\longrightarrow\cdots\longrightarrow r_{T-2}^{(i)},s_{T-1}^{(i)}\stackrel{a_{T-1}^{(i)}}{\longrightarrow}r_{T-1},s_T s0(i)a0(i)r0(i),s1(i)a1(i)r1(i),s2(i)rT2(i),sT1(i)aT1(i)rT1,sT
(2) For each time step in each sequence t t t status s s s,Update status s s scounter N ( s ) = N ( s ) + 1 N(s)=N(s)+1 N(s)=N(s)+1和电影 s s s的总回报 M ( s ) = M ( s ) + G t M(s)=M(s)+G_t M(s)=M(s)+Gt
(3) The value of each state is estimated as the average return: V ( s ) = M ( s ) N ( s ) V(s)=\frac {M(s)}{N(s)} V(s)=N(s)M(s)

According to the law of large numbers, when N ( s ) → ∞ N(s)\rightarrow\infty N(s),有 V ( s ) → V π ( s ) V(s)\rightarrow V_\pi(s) V(s)INπ(s). When calculating return expectations, in addition to adding up all returns and dividing by the number of times, there is also an incremental update method. For each state s s s and corresponding returns G G G, the following updates can be made:
V ( s ) = V ( s ) + 1 N ( s ) ( G − V ( s ) ) V(s)=V(s)+\frac{1}{N(s)}(G-V(s)) V(s)=V(s)+N(s)1(GV(s))

References:
[1] Zhang Weinan, Shen Jian, Yu Yong. Hands-on reinforcement learning [M]. People's Posts and Telecommunications Press, 2022.
[2] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning (2nd Edition) [M]. Electronic Industry Press, 2019
[3] Maxim Lapan. Deep Reinforcement Learning Practice (2nd edition of the original book) [M]. Beijing Huazhang Graphic Information Co., Ltd., 2021
[4] Wang Qi, Yang Yiyuan, Jiang Ji. Easy RL: Reinforcement Learning Tutorial[M] . People's Posts and Telecommunications Press, 2022

Guess you like

Origin blog.csdn.net/hy592070616/article/details/134675055