Reinforcement Learning DRL--Strategy Learning (Actor-Critic)

Strategy learning means to learn 最优策略函数 π ( a ∣ s ) \pi(a|s) by solving an optimization problemπ ( a s ) or its approximate function (such as policy network).

1. Strategic Network

insert image description here
In Atari games, Go and other applications, the state is a tensor (such as a picture), then the input should be processed by a convolutional network as shown in Figure 7.1. In applications such as robot control, the state s is a vector whose elements are the values ​​of multiple sensors, so the convolutional network should be replaced by a fully connected network.

2. The objective function of policy learning

  • The state value depends both on the current state st and on the parameter θ of the policy network π.
  • Objective Function for Policy Learning
    insert image description here

3. Strategy Gradient Theorem

四、Actor-Critic

insert image description here

1. Value network

Actor-critic methods use a neural network to approximate the action-value function Q π ( s , a ) Q _π (s,a)Qp(s,a ) , this neural network is called "value network", denoted asq ( s , a ; w ) q(s,a;\bf{w})q(s,a;w )
insert image description here
Note: The difference between the DQN network:
insert image description here

2.Actor-critic

电视网络π ( a ∣ s ; θ ) π(a|s;θ)π ( a s ;θ ) is equivalent to an actor, which makes action a based on state s. Value networkq ( s , a ; w ) q(s,a;w)q(s,a;w ) is equivalent to the judges, who give the actor's performance a score and evaluate how good or bad the action a is in the state s.
insert image description here
Note:

  • Training the policy network (actor) requires a reward U, not a reward R. The value network (judges) is able to estimate the expected return U and thus helps train the policy network (actor).

(1) Training policy network (actor)

Then do the algorithm update:
insert image description here

(2) Training value network

Update ww with SARSA algorithmw , improve the level of judges. Each time a reward rris observed from the environmentr , putrrr is regarded as the truth, userrr to calibrate the judges' scoring.
insert image description here
-------------------------------------------------- --------Overall training steps: ----------------------------------- ---------------------
insert image description here

5. Policy Gradient Method with Baseline

Guess you like

Origin blog.csdn.net/qq_45889056/article/details/129695893