Intensive study notes-08 Planning and Learning

 In the previous articles, we introduced a key concept model-base and model-free. The model-base is based on the existence of a certain environment model, and the transfer information of state, action and reward can be obtained from the model, such as the dynamic programming method. When We have determined the transition probability of the state-action, and at this time we can quickly obtain the estimate of the value function through recursion.

Q(s,a)\\ =\sum P(s',r|s,a)(r+V(s'))\\ =\sum P(s',r|s,a)(r+ \sum \pi (a'|s')Q(s',a'))

In the update process of the value function, one way is to traverse all states-actions to complete the update, but if there are too many states-actions, and some states are completely useless for our goal, the efficiency of traversing all states for update is very low , on the other hand, the value function updates of each state are interdependent, so the update sequence will also affect the efficiency of training, because the so-called planning is to plan the state update steps reasonably.

And when we are completely unknown to the environment model, we must obtain the real cumulative income through interactive sampling with the environment G_t, and then update the value function through it. This method is called model-free, and MC and TD algorithms belong to this class, which is learned by sampling. The advantage of this type of method is that the benefits obtained are truly unbiased, but its training speed is much slower than the model-base method, especially when the cost of interactive sampling with the real environment is high.

Q_t(s,a)=Q_{t-1}(s,a) + \alpha (G_t(s,a) - Q_{t-1}(s,a))

Therefore, this section introduces a general way to combine the two strategies. When the environment is completely unknown, by constructing a learnable environment model, on the one hand, the environment model can also be learned through real sampling, on the other hand, through The environment model is sampled for simulation, and the value function can also be updated through simulation sampling.

 The above figure describes the overall structure. On the one hand, the value function is directly learned through real environment sampling, and on the other hand, the environment model is learned through real environment sampling, and then the value function is indirectly learned through the environment model.

1. Dyna algorithm

The Dyna algorithm is the most basic implementation of the above model, which assumes that the environmental state transition is deterministic, not probabilistic, namely:

P(s',r|s,a)=1\text{ or }0

So in the Dyna algorithm, it uses a table to represent the environment model Model(S,A)=[S',R]which points to another deterministic state.

The Dyna algorithm divides the training process into multiple iterations. In each iteration, a round of real environment sampling is performed, and the value function is updated. At the same time, the environment model is updated according to the sampling results, and then iteratively updated according to the environment model. value function n times.

2. Dyna-Q+ 

In the Dyna algorithm, it is assumed that the model will not change after learning a certain state and action. In other words, the model will not be updated or make mistakes, but when this happens, then the indirect learning of the (f) process in the Dyna algorithm There will be problems, so Dyna-Q+ introduces a heuristic algorithm in the process, which represents the environment model as Model(S,A)=[S',R,T], where T represents the update time. The R value used for the value function update will increase as the non-updated time increases. Thereby encouraging the exploration of states that have not been explored for a long time.

R_t=R+k\sqrt{t-T}

3. Prioritized Sweeping

In the Dyna algorithm, when the actual sampling is completed, the value function hardly changes. At this time, it is meaningless to conduct simulation training through the model, because the value function will not be updated at this time. Therefore, the idea of ​​Prioritized Sweeping judges whether it is necessary to use the model for simulation training by whether the value function changes after real sampling. On the other hand, only other states affected by this state will be considered for updates.

4. Expected vs. Sample Updates 

In the simulation training through the model, there are two ways to update the value function, through the expected update in dynamic programming:

Q(s,a)=\sum p(s',r|s,a)(r + \gamma \sum\pi (a'|s') Q(s',a'))

Or similar to how sampling updates are done in Q-learning:

Q_t(s,a)=Q_{t-1}(s,a)+\alpha (R+\lambda \text{ }\underset{a'}{max}\text{ }Q_{t-1}(s',a')-Q_{t-1}(s,a))

Compared with the former, the calculation amount of the latter is much smaller in a single iteration, but the convergence speed of the former is faster, while the latter requires multiple updates. However, when there are many state-dynamics, the overall convergence speed of sampling update will be faster due to less calculation.

5. Trajectory Sampling

There are two sampling methods, one is sampling through on-policy, and the other is direct uniform sampling. We can intuitively feel that the model training can be accelerated through the on-policy method, but on the other hand, some states may be missed, resulting in failure to achieve best point.

The above figure compares the effect of on-policy sampling and direct uniform sampling. First, on-policy converges faster, especially when the number of states increases. In addition, the left figure shows that when there are many branches (current state The number of possible next states), on-policy may not be able to reach the optimal point.

6. Real-time Dynamic Programming

RTDP speeds up the convergence speed of the original DP algorithm through the on-policy Trajectory Sampling method mentioned above. The original value function update formula is:

V(s) =\sum \pi(s|a)\sum P(s',r|s,a)(r+ V(s'))

The value function update formula of RTDP is:

V(s) =\underset{a}{max} \ \sum P(s',r|s,a)(r+ V(s'))

7. Monte Carlo Tree Search

The traditional Monte Carlo method needs to traverse the entire state-action space, which is very inefficient, especially when the state space is very large, it is even more difficult to implement. Therefore, the method of Tree Search is to store the historical state access path through the tree structure. Each state is a node on the tree, and the subsequent state is accessed from a certain root node. This method avoids the access of a large number of invalid states.

We assume that there are three situations for a certain state node in Monte Carlo Tree:

  • Endpoint: The node is an endpoint
  • Intermediate node: the node has a successor node and has been fully explored
  • Exploration point: the node has a follow-up node and has not been fully explored

Monte Carlo Tree Search starts from a certain root node, and iterates according to the following four steps:

  1. select: Starting from the root node, explore downwards one by one. When encountering an intermediate node, select a child node according to the Tree Policy until it encounters an end point (updates the value function of the entire path state) or encounters an exploration point
  2. Expansion: When an exploration point is encountered, select an incompletely explored child node from it

  3. Simulation: Sub-nodes that have not been fully explored use the Rollout Policy to select subsequent state actions until they encounter an endpoint

  4. Backup: When an endpoint is encountered, update the value function of the state-action on the entire path.

 

Guess you like

Origin blog.csdn.net/tostq/article/details/131007968