ChatGPT's deep reinforcement learning DRL understanding

Reference: Part 1: Key Concepts in RL — Spinning Up documentation

In a nutshell, RL is the study of agents and how they learn by trial and error. (RL是对agent如何试错的一个研究)It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

RL is the abbreviation of Reinforcement Learning. DRL is the abbreviation of Deep Reinforcement Learning, which is a combination of deep learning and reinforcement learning.

The research of reinforcement learning is aimed at the continuous trial and error process of an agent (agent). By continuously rewarding or punishing the agent, it is ensured that it can maximize the repetition of beneficial behaviors in the future and give up unfavorable behaviors.

The two main roles in RL are agent (intelligent body) and environment. The environment is the main life and interaction world of the agent. In each step of the interaction between the agent and the environment, the agent decides to take actions based on the observation data of the environment state action. When the action is implemented, the environment will change, and the agent will change accordingly.

   The agent will get a reward signal from the environment, a number, to tell the agent whether the state of the environment is good or bad. The agent's goal is to maximize the cumulative reward value, also known as the rate of return. The RL method is a way for the agent to learn behaviors to achieve its own goals.

Main concepts: 

States and Observations

state s, is a complete description of the state of the environment. observation o is a partial description of the state, and some information may be omitted.

An environment is said to be a fully observed environment when an agent observes the state of an entire environment.

An environment is said to be partially observed when an agent observes the state of a part of the environment.

Action Spaces

Different environments require different types of behavior. The collection of behaviors that are valid for the environment is called Action Spaces. (Behavior Space).

According to different scenes, it is divided into discrete action space collections, discrete action spaces discrete action spaces. It means that the behavior is irregular and discontinuous. 

Continuous action spaces, continuous action spaces. In continuous spaces, actions are real vectors. For example, the movement of robots. This difference will have profound consequences for DRL algorithms. Different types of action spaces correspond to different DRL algorithms .

PoliciesStrategy

  Policies are specific rules for the agent to decide its behavior. Policies are trying to maximize the agent's rate of return.

 parameterized policies Parameterized strategy: The policy result is a computable function that depends on a set of parameters to change behavior through an optimization algorithm.

Deterministic Policies deterministic policy: 

Stochastic  [stə'kæstɪk]  Policies Stochastic policies: 

The two most common stochastic strategies in deep RL are the classification strategy and the Diagonal Gaussian strategy.

Classification strategies are available for discrete action spaces. Diagonal Gaussian strategies are used for continuous action spaces.

Trajectories   [trəˈdʒektəri] Trajectories, orbits, ballistics.

Trajectories are a continuous series of sequences of states and actions in the environment.

Trajectories are also frequently called episodes  (一集)or rollouts.

Reward and Return Rewards and Benefits

infinite-horizon discounted return (infinite-horizon discounted return) cumulative return in an infinite time window.

finite-horizon undiscounted return (limited period undiscounted return) the cumulative return of a certain time window.

Value Functions value function

Knowing the value of a state or state-action pair is very useful. Through the value function, we can predict the expected rate of return based on a certain state or policy behavior. The value function is used in various ways in almost every RL algorithm.

There are 4 main categories of value functions: 

1. On-Policy Value Function If it starts from state s and always executes the behavior according to the policy π, it returns the expected rate of return.  

 2. On-Policy Action-Value Function If starting from state s, execute an arbitrary action a, and always use the strategy π, calculate the rate of return.

           

3. Optimal Value Function Optimal value function.

Always follow the optimal strategy, starting from state s, and calculate the expected rate of return.

4. Optimal Action-Value Function. Starting from state s, execute an arbitrary action a, always follow the optimal strategy, and then calculate the expected rate of return.

   

The Optimal Q-Function and the Optimal Action (The Optimal Q-Function and the Optimal Action)

Q-function, starting from the state s, takes an arbitrary action, and then follows the optimal strategy to obtain the best rate of return.

Bellman Equations Bellman Equations

 The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.

The value of your starting point is the return you expect to get from there, plus the value of your next stop.

The biggest difference between the Bellman equation of the value function and the Bellman equation of the optimal value function is whether to perform max processing on the action. It reflects the fact that no matter when the agent chooses its behavior, in order to optimize the behavior, it has to choose An action that leads to a higher value.

Bellman backup has higher frequency words in RL description. The term "Bellman backup" appears frequently in the RL literature. The Bellman backup of a state or state-action pair is the right-hand side of the Bellman equation: the reward plus the next value.

Advantage Functions Advantage Functions

Get the relative advantage of the action.

Advantage functions are crucial for policy gradient methods.

 Formalism Formalism

So far we have discussed agent environments in an informal way, but if you try to delve into the literature, you will most likely come across the standard mathematical form of such settings: Markov decision processes (MDPs). MDP is a 5-tuple, \langle S, a, R, P, \rho_0\rangle, where

S is the set of all valid states,

A is the set of all valid actions,

R: is the reward function, where R_t=R(S_t, A_t, S_{t+1}),

P: is the transition probability function, where P(S'|S,A) is the probability of transitioning to state S' if you start in state S and take action A,

Po is the starting state distribution.

The name Markov decision process refers to the fact that systems obey Markov properties: transitions only depend on recent states and actions, not on previous history.

Guess you like

Origin blog.csdn.net/gridlayout/article/details/129498374