Intensive Study Notes-05 Monte Carlo Method Monte Carlo Method

This article is the blogger's reading notes on "Reinforcement Learning- An introduction". It does not involve the translation of the content, but mainly for personal understanding and thinking.

The previous section introduced the dynamic programming method to solve the learning problem in the enhanced Markov decision process MDP environment. The dynamic programming method assumes that the environment is completely known, that is, for the transition probability between state actions p(s',r|s,a) It is completely knowable, or given the state state and operation action, the next state can be known with certainty.

However, when the environment is unknown, the Monte Carlo method can be used to solve it. It obtains real rewards by sampling a series of states and actions from the actual environment. At this time, the value of the state action can be obtained by Average rewards to estimate.

The overall Monte Carlo learning is still based on the general policy two iterative process (GPI), which is divided into Policy Evaluation (Prediction) and Policy Improvement. The Policy Evaluation stage fixes the Policy and completes the estimation of the value function. In Policy Improvement, the policy is optimized through value function estimation. For the Monte Carlo method, the key lies in the estimation of the value function value function. After the value function estimation is completed, it is logical to optimize the policy.

1. Monte Carlo method

A. Policy Evaluation

In the previous analysis, there are two types of value functions: state value function and action value function. However, when the environment is unknown and the value function of the known state is known, the Policy function cannot be derived because the current state and the next state of the action cannot be determined. Therefore, what Monte Carlo needs to estimate is the action value function $q(s,a)$ , that is, its estimated value is the average value of the current state and the future income under the action:

$q(s,a)=mean(G(s,a))$

The Monte Carlo method divides the training process into multiple rounds, and each round is called an episode. In each round, it starts from a certain initial state $S_o$ and initial action $Oh$ , and samples a series of states, actions, and reward sequences from the environment:

$S_o,A_o,R_1,S_1,A_1,...,S_{T-1},A_{T-1},R_{T},S_{T}$

where $G(s,a)$ represents the cumulative payoff after the first appearance of the state $s$ and action , namely: $a$

$G(s=S_t,a=A_t)=\sum^{T}_{i=t}\gamma^{i-t} R_i$

B. Policy Improvement

$\pi(a|s)=\begin{cases} 1& \text{ if } a= \text{argmax}_a\ q(s,a)\\ 0& \text{ if } a\neq \text{argmax}_a\ q(s,a) \end{cases}$

The policy function can be calculated directly based on the value function $q(s,a)$ , but because the Monte Carlo method samples a series of state actions based on the policy function, if it is $\pi(a|s)$ too hard, some state actions will never have the opportunity to be sampled. In other words, exploration may be lost (explore) the opportunity of the optimal solution, so one way of thinking is to use the ε-greedy method, and the following formula $\Lambda (s)$ represents the number of $s$ actions that may be taken in the state $a$ .

$\pi(a|s)=\begin{cases} 1-\varepsilon +\frac{\varepsilon }{\lambda(s)}& \text{ if } a= \text{argmax}_a\ q(s, a) \\ \frac{\varepsilon }{\Lambda(s)}& \text{ if } a\neq \text{argmax}_a\ q(s,a) \end{cases}$

Another method is to choose the initial state and action, so that each pair of state and action has a certain probability of selection, called exploring starts.

2. off-policy Monte Carlo method

In the previous article, we discussed the trade-off between explore and exploit in the reinforcement learning method. On the one hand, we need to find a better direction, and on the other hand, we need to explore new directions. Our previous method is to adjust the policy function according to ε-greedy and other methods to take into account both explore and exploit. This method of generating the next round of training data through the target policy function is called the on-policy method.

Another more intuitive method is to split the explore and exploit processes into two policy functions, among which the optimization policy function is called target policy, and the other one is specially used to generate state and action pairs, and the decision function for exploration is called It is a behavior policy, and the next round of training data is not output through the target policy function, so it is called an off-policy method. The off-policy approach describes a more general approach to reinforcement learning.

Another problem is that in off-policy, the distribution of the target policy function and the behavior policy function will be inconsistent. This inconsistency will lead to estimated deviations. Therefore, most off-policy strategies introduce importance sampling to adjust the target policy function and Deviation between behavior policy functions.

A. Importance-sampling

For the Monte Carlo method, this deviation is mainly $\pi(s|a)$ different in the generation probability of the sampling sequence produced by the policy function. It is assumed that $S_t$ the action is determined by the decision function from the initial state $A_t$ , and finally a series of sequences are generated:

$S_t,A_t,R_{t+1},S_{t+1},A_{t+1},...,S_{T-1},A_{T-1},R_{T},S_{T}$

Its generation probability can be expressed as:

$P(R_{t+1},S_{t+1},A_{t+1},...,S_{T-1},A_{T-1},R_{T},S_{T}|S_t,A_t\sim \pi)=\Pi^{T-1}_{k=t} \pi(A_k|S_k)P(S_{k+1}|S_k,A_k)$

The deviation between the target policy function $\pi_\year (s|a)$ and the behavior policy function $\pi_b (s|a)$ can be expressed as:

$\rho(A_k,S_k)=\frac{\Pi^{T-1}_{k=t} \pi_\tau (A_k|S_k)P(S_{k+1}|S_k,A_k)}{\Pi^{T-1}_{k=t} \pi_b(A_k|S_k)P(S_{k+1}|S_k,A_k)}=\frac{\Pi^{T-1}_{k=t} \pi_\tau (A_k|S_k)}{\Pi^{T-1}_{k=t} \pi_b(A_k|S_k)}$

At this point we correct for the bias in cumulative returns:

$G_\rho(A_k,S_k)=\rho(A_k,S_k)G(A_k,S_k)$

B. Weighted importance sampling weighted importance sampling

Through importance sampling, we will multiply a deviation factor on the original cumulative return. This deviation factor may become very large due to the deviation between the target policy function and the behavior policy function, resulting in too much deviation from the real observed reward. Large, it is difficult to help the learning of the target policy function, so weighted importance sampling is to substitute the deviation factor into the calculation of the value function:

$q(s,a)=\frac{mean(\rho(A_k,S_k)G(A_k,S_k))}{mean(\rho(A_k,S_k))}$

However, this method is biased, and its value function estimation is more biased towards behavior policy, but its estimation is more stable.

C. Selection of behavior policy functions:

In fact, for the Monte Carlo method, any soft function can be used for the function of the behavior policy, and two conditions need to be met:

It can ensure that the target policy function converges
All state-actions have a chance to be chosen