Online Learning (On-Policy): Do what you say, personally participate in the training process, instead of simulating selection, directly choose behavior execution
Sarsa algorithm
初始化 Q(s, a) 为任意值
重复以下步骤 (每个训练回合):
初始化状态 s
从状态 s 中根据来源于 Q 的策略选择行为 a (例如: ε-greedy ε=0.9: 90%情况取最优行为, 10% 情况随机采取行为)
重复以下步骤 (单个回合中的每一步):
执行行为 a, 观察奖励 r, 下一状态 s'
从状态 s' 中根据来源于 Q 的策略选择行为 a' (例如: ε-greedy ε=0.9: 90%情况取最优行为, 10% 情况随机采取行为)
更新 Q(s, a): Q(s, a) <- Q(s, a) + α * [r + γ * Q(s', a') - Q(s, a)]
(α: 学习效率, 表示单次差距有多少被学习, γ: 衰减因子, Q 现实: r + γ * max(Q(s', a')), Q 估计: Q(s, a))
更新下一状态 s' 为当前状态 s, 更新下一状态 a' 为当前状态 a
直到状态 s 全部结束
The difference between Sarsa and Q-Learning is that: when updating the calculated Q value, Sarsa has determined the next step, and after updating the Q value, the next step will definitely perform this behavior; while Q-Learning is updating the Q value value, the behavior with the largest Q value in the next environment is used in the calculation, but in the next environment, Q-Learning has only a 90% probability, and the behavior is executed according to the principle of the largest Q value, not necessarily. this behavior
Sarsa algorithm is a bit timid and conservative, because it is the actual implementation, the model will avoid many risks, instead of directly selecting the risky but most direct behavior