Chapter 5: Monte Carlo Method

Unlike DP, we don't need to know the model of the environment completely. MC only needs experience, which is the simulated data. The Monte Carlo method is to find the average return of all episodes, so it is an episode-by-episode scenario.

1.Monte Carlo Prediction

The Policy Evaluation of the First-visit Monte Carlo method is to average the accumulated returns after the first state s observed by countless episodes until it converges to the desired value.

 

 DP's backup diagram shows all possible successor states, and MC only has the sampled episode; DP only includes a one-step transfer, and MC goes all the way to the end of the episode. Simply put, DP's estimation of the current state function value should be based on the subsequent state value, and each state of MC is independent of each other ------- DP requires bootstrap, but MC does not.

 

 

 2.Monte Carlo Control

 

   Here we have two assumptions. One is the exploratory start of episode, but it can be done through endless episodes and policy evaluation. For the second assumption, there are two solutions.

 1. Assuming theta, when it is less than theta, we can guarantee convergence; 2. Instead of waiting for the completion of the policy evaluation, and then upgrade, but using the value iteraion method, each iteration, it is upgraded once.

 

  Because the MC method cannot wait until the end of the episode to carry out Policy evaluation and Policy improvement; but in some scenarios, an episode will take a long time, or it will not be terminated, so this is also a disadvantage of MC. In addition, because MC averages the returns from samples to episodes, this will cause some of the scenes that have occurred to be used directly in other places, and the applicability is not strong.

3. Remove Exploring Starts

  This is an explanation for another hypothesis in 2. When there is no Exploring Starts in the system, some actions may not be selected, so an error will occur when evaluating Q (s, a ).

The solution is to use the epsilon-greedy policy. In this case, even if there is no Exploring Starts, there will be a probability that epsilon can choose other actions, which can solve the action that always chooses the maximum Q (s, a) Optimal) problem.

 

 4.Off-policy Prediction via Importance Sampling

  The policy introduced earlier, because the most policy it finally gets is epsilon-greedy poliy, which is still not the best greedy policy. If we want to get the best greedy policy, we need off-policy. The use of off-policy requires that evaluation and improvement are not the same policy, so you must use import sampling to convert the data sampled in the evaluation policy into the data of the improvement policy. In simple terms, two policies are needed, one is target policy (learn) and the other is behaviou policy (data).

4.1 weighted improtance sampling 和 ordinary sampling

  There are two types of import sampling here, one is ordinary sampling and the other is weighted improtance sampling.

  The difference is: weighted is biased and low variance; ordinary is unbiased and high variance. In practice, I prefer to use the weighted method.

5. off-policy MC control

  In general, the data of the behaviour policy is used to obtain the optimal policy π according to Q (s, a) using greedy. It should be noted here that the policy of b must contain π, that is, the selected action in π, b can also be selected.

 

6. Summary

1. The 3 advantages of MC for DP

(1) No environment model required

(2) Simulation or sample model can be used

(3) For part of the data, we can analyze only the useful part of the data, and use the MC method, which is very efficient.

2. Problems with MC

  Since Exploring Starts are required, when this assumption is removed, it is likely that only the actions that are currently sampled will be selected, and then better actions may never be selected. Therefore, the optimal policy obtained by on-policy MC is the epsilon-greedy policy.

 

 

 

  

 

Guess you like

Origin www.cnblogs.com/xsy123/p/12730493.html