Reinforcement study notes (4)

Sarsa

  • Online Learning (On-Policy): Do what you say, personally participate in the training process, instead of simulating selection, directly choose behavior execution

Sarsa algorithm

初始化 Q(s, a) 为任意值
重复以下步骤 (每个训练回合):
	初始化状态 s
	从状态 s 中根据来源于 Q 的策略选择行为 a (例如: ε-greedy  ε=0.9: 90%情况取最优行为, 10% 情况随机采取行为)
	重复以下步骤 (单个回合中的每一步):
		执行行为 a, 观察奖励 r, 下一状态 s'
		从状态 s' 中根据来源于 Q 的策略选择行为 a' (例如: ε-greedy  ε=0.9: 90%情况取最优行为, 10% 情况随机采取行为)
		更新 Q(s, a): Q(s, a) <- Q(s, a) + α * [r + γ * Q(s', a') - Q(s, a)] 
		(α: 学习效率, 表示单次差距有多少被学习, γ: 衰减因子, Q 现实: r + γ * max(Q(s', a')), Q 估计: Q(s, a))
		更新下一状态 s' 为当前状态 s, 更新下一状态 a' 为当前状态 a
	直到状态 s 全部结束
  • The difference between Sarsa and Q-Learning is that: when updating the calculated Q value, Sarsa has determined the next step, and after updating the Q value, the next step will definitely perform this behavior; while Q-Learning is updating the Q value value, the behavior with the largest Q value in the next environment is used in the calculation, but in the next environment, Q-Learning has only a 90% probability, and the behavior is executed according to the principle of the largest Q value, not necessarily. this behavior
  • Sarsa algorithm is a bit timid and conservative, because it is the actual implementation, the model will avoid many risks, instead of directly selecting the risky but most direct behavior

Guess you like

Origin blog.csdn.net/weixin_40042498/article/details/113884918