Li Hongyi Intensive Learning (Mandarin) Course (2018) Notes (1) Policy Gradient (Review)

Li Hongyi Intensive Learning (Mandarin) Course (2018)

 https://www.bilibili.com/video/BV1MW411w79n?spm_id_from=333.337.search-card.all.click&vd_source=a4c529a804be1b8a88658c292d9065f9

         PPO is a variant of Policy Gradient. After changing from on policy to off policy, add some constraints to become PPO.

        Three elements of reinforcement learning, Actor, Environment, Reward Function.

         Actor's behavior is controlled by Policy. The Policy part is usually a Neural Network function.

        Give an example of how an Actor interacts with the environment. It is more effective to design observation to let the machine see what kind of game screen.

         Note: A game is called an Episode, and the sum of the rewards of an Episode is called Total reward. The purpose of the Actor is to find a way to Maximize the reward that can be obtained.

         In a game, combining ( s , a ) sequence pairs in order is Trajectory. Assuming that the parameter θ of the Network controlling the Actor is given, the probability of this Trajectory occurring in each Episode can be calculated.

        Since the behavior of Actor in the same state is random, the new state generated by Environment after a given action is also random, so it R(T)is a random variable. However, the expected value of , given θR(T) , can be calculated.

        R(T)The algorithm of the expected value is to enumerate all possible Trajectories \ can, that is, calculate the probability of a certain occurrence according to θ , and then calculate the Total Reward. However, it is usually unknown, therefore, a suitable calculation method needs to be found.\ can\ canp_{\theta }(\tau)

  Policy Gradient uses the Gradient descent method to Maximize Reward.

When the Policy is a Neural Network function,  \bigtriangledown log^{}p_{\theta }(a_{t}^{n}|s_{t}^{n})it is equivalent to backpropagation (gradient descent) in the neural network in supervised learning.

 

        If any action is taken and the Total Reward is positive, then the probability of all actions must be increased. When the sum of the action probabilities is 1, the adjustment of the action probability is unscientific.

 

 

        In the same game, some actions may be good, and some actions may be bad. Assuming that the final result of the game is good, it does not mean that every action is right, and the result is not good, it does not mean that every action in it Behavior is wrong. Enough samples are required. Because there are not enough samples, it is necessary to give reasonable credit to the action of each state, and how much it contributes to the score. Only the rewards obtained after the execution of this action are calculated.

        Now imagine how good it is to perform a certain action in a certain state, compared to other possible actions. What we care about is not an absolute goodness, but how good it is to execute a certain action in the same state, compared with other possible actions.

 

Add a discount factor. 

 

Guess you like

Origin blog.csdn.net/qq_22749225/article/details/125474814