Strategy learning means to learn 最优策略函数
π ( a ∣ s ) \pi(a|s) by solving an optimization problemπ ( a ∣ s ) or its approximate function (such as policy network).
1. Strategic Network
In Atari games, Go and other applications, the state is a tensor (such as a picture), then the input should be processed by a convolutional network as shown in Figure 7.1. In applications such as robot control, the state s is a vector whose elements are the values of multiple sensors, so the convolutional network should be replaced by a fully connected network.
2. The objective function of policy learning
- The state value depends both on the current state st and on the parameter θ of the policy network π.
- Objective Function for Policy Learning
3. Strategy Gradient Theorem
四、Actor-Critic
1. Value network
Actor-critic methods use a neural network to approximate the action-value function Q π ( s , a ) Q _π (s,a)Qp(s,a ) , this neural network is called "value network", denoted asq ( s , a ; w ) q(s,a;\bf{w})q(s,a;w )
Note: The difference between the DQN network:
2.Actor-critic
电视网络π ( a ∣ s ; θ ) π(a|s;θ)π ( a ∣ s ;θ ) is equivalent to an actor, which makes action a based on state s. Value networkq ( s , a ; w ) q(s,a;w)q(s,a;w ) is equivalent to the judges, who give the actor's performance a score and evaluate how good or bad the action a is in the state s.
Note:
- Training the policy network (actor) requires a reward U, not a reward R. The value network (judges) is able to estimate the expected return U and thus helps train the policy network (actor).
(1) Training policy network (actor)
Then do the algorithm update:
(2) Training value network
Update ww with SARSA algorithmw , improve the level of judges. Each time a reward rris observed from the environmentr , putrrr is regarded as the truth, userrr to calibrate the judges' scoring.
-------------------------------------------------- --------Overall training steps: ----------------------------------- ---------------------