Reinforcement Learning Cheatsheet

MDPs

What makes MDPs different from k-bandit algo?

Whereas in bandit problems we estimated the value \(q_*(a)\) of each action \(a\), in MDPs we estimate the value \(q_*(s,a)\) of each action \(a\) in each state \(s\), or we estimate the value \(v_*(s)\) of each state given optimal action selections.

猜你喜欢

转载自www.cnblogs.com/DianeSoHungry/p/11444968.html