Reinforcement Learning Chapter VI

1, the context manager (used in combination yield)

@contextmanager   
def timer(name):
with timer('Timer PolicyEval'):

Then the section of code will be generated automatically with context

 

2, policy iteration of policy evaluation and strategies to enhance the two parts, each iteration will be through the loss function two parts, policy evaluation is the square of the difference between the current value and previously calculated value of the strategy to enhance by selecting a value in the high that a action as the next policy iteration strategy. Policy evaluation will be the value of the function converges to a certain extent, and improvement strategies based on the maximum value function. When the policy change will not happen, stop iteration.

(Strategic Assessment section :)

for i in range(1, agent.s_len): # for each state
ac = agent.pi[i]
transition = agent.p[ac, i, :]
   value_sa = np.dot(transition, agent.r + agent.gamma * agent.value_pi)
   new_value_pi[i] = value_sa# value_sas[agent.policy[i]]
diff = np.sqrt(np.sum(np.power(agent.value_pi - new_value_pi, 2)))

Value iteration is to make value function convergence, there is no pipe policy function each time it chose the optimal value (corresponding to the highest value of all the action that a) as the current value, and the value of the previous constitute reverse the loss of function update . When the optimal value function of time, stop the iteration, it will come after the policy. It is no "one" cause of fast iterative strategy is that it is to pursue the "best" value function.

(Value iteration :)

for i in range(1, agent.s_len): # for each state
value_sas = []
for j in range(0, agent.a_len): # for each act
value_sa = np.dot(agent.p[j, i, :], agent.r + agent.gamma * agent.value_pi)
value_sas.append(value_sa)
new_value_pi[i] = max(value_sas)
diff = np.sqrt(np.sum(np.power(agent.value_pi - new_value_pi, 2)))

 There is a good map, reinforcement learning is the essence of the 167 two pictures.

There is also a generalization iteration, policy iteration and the value function iteration has several things in common: (1) have finally obtained a policy function and value of the function (2) optimal policy functions are obtained by the convergence of the value of the function (3) Bellman equation by the value functions are convergent. Their focus is different: the core policy function is policy, his strategy in order to enhance the value of the function may be less accurate, core values ​​is the value of the iteration, and no part of its core strategy.

Broad generalization iteration: conduct policy iteration in the optimization of the early rounds, then the value of the iteration, as described on page 170 two pictures 6-10 \ 6-11.

 

Guess you like

Origin www.cnblogs.com/lin-kid/p/11520184.html