[Reinforcement Learning] 01 - Introduction to Reinforcement Learning

Two types of machine learning

Insert image description here

The relationship between supervised learning/unsupervised learning/reinforcement learning/machine learning
  • predict
    • Predict the desired output based on the data (supervised learning) P ( y ∣ x ) P(y|x)P(yx)
    • Generate data instances (unsupervised learning) P ( x , y ) P(x,y)P(x,y)
  • decision making
    • Taking action in dynamic environments (reinforcement learning)
      • transition to new state
      • Get instant rewards
      • Maximize cumulative rewards over time

The difference between prediction and decision-making: whether actions will cause the environment to change.

The difference between reinforcement learning and other machine learning :

  • No supervision, only reward signals;
  • feedback delay;
  • Time series, there are correlations or dependencies between different data (Non iid data);
  • The actions of agents will affect the data sequence they receive.

In reinforcement learning, data is obtained during the interaction between the agent and the environment. If the agent does not take a certain decision-making action, the data corresponding to the action can never be observed, so the training data of the current agent comes from the decision-making results of previous agents. Therefore, the data distribution generated by the agent's interaction with the environment is different depending on the agent's strategy.

Insert image description here

There is a concept about data distribution in reinforcement learning, called occupancy measure . The normalized occupancy measure is used to measure a specific state action sampled during the interaction between an agent's decision-making and a dynamic environment. Probability distribution of state-action pairs.

Occupancy metrics have a very important property: given two policies and two occupancy metrics obtained by interacting with a dynamic environment, then the two policies are the same if and only if the two occupancy metrics are the same. That is, if an agent's strategy changes, the occupancy measure obtained from its interaction with the environment will also change accordingly.

reinforcement learning definition

Reinforcement learning uses the concept of agent to represent a machine that makes decisions. Compared with the "model" in supervised learning, the "agent" in reinforcement learning emphasizes that the machine can not only perceive the surrounding environment information, but also directly change the environment by making decisions, rather than just giving some prediction signals .

Insert image description here
Reinforcement learning: Computational methods for achieving goals by learning from interactions

  • Perception: Being aware of one’s surroundings to some extent
  • Action: Taking action to affect a state or achieve a goal
  • Goal: Maximize rewards over time

Reinforcement learning interaction process

Insert image description here

Every step of Agent ttt:

  • Get observations O t O_tOt
  • Get reward R t R_tRt
  • Execute action A t A_tAt

Every step of the environmentttt:

  • Get action A t A_tAt
  • Given the observation O t + 1 O_{t+1}Ot+1
  • Give reward R t + 1 R_{t+1}Rt+1

In the environment step t = t + 1 t=t+1t=t+1

Reinforcement learning system elements

History

Past O i , R i , A i O_i,R_i,A_iOi,Ri,Ai的序列
H t = O 1 , R 1 , A 1 , . . . , A t − 1 , O t , R t H_t = O_1, R_1, A_1, ..., A_{t−1}, O_t, R_t Ht=O1,R1,A1,...,At1,Ot,Rt

  • Until ttAll observable variables at time t
  • Determine the next step based on history: (Agent: A i A_iAi; Env: O i + 1 , R i + 1 O_{i+1},R_{i+1} Oi+1,Ri+1)

State

Used to determine what will happen next ( O , R , AO,R,AO,R,A)

  • is a function about history S t = f ( H ​​t ) S_t = f (H_t)St=f(Ht) f ( H t ) f(H_t) f(Ht) is difficult to obtain directly in some cases (POMDP)

Policy

  • The behavior of the agent, mapping from state to action
  • Deterministic policy: a = π ( s ) a=\pi(s)a=π ( s )
  • 随机策略(Stochastic policy): π ( a ∣ s ) = P [ A t = a ∣ S t = s ] π(a|s) = \mathbb P[A_t = a|S_t = s] π ( a s )=P[At=aSt=s]

Reward

  • Define the scalar of the reinforcement learning goal and evaluate the quality of the state

Value Function

  • Forecast of future cumulative rewards
  • Used to evaluate the quality of the state under a given strategy
    v π ( s ) = E π [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . ∣ S t = s ] v_\pi(s)=\mathbb{E}_\pi\left[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...\mid S_t =s\right]vp(s)=Ep[Rt+1+γRt+2+c2 Rt+3+...St=s]

Model

  • Used to predict what the environment will do next
  • Predict the next state P ss ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] \mathcal{P}_{ss^{\prime}}^a=\mathbb{P }[S_{t+1}=s^{\prime}\mid S_t=s,A_t=a]Pssa=P[St+1=sSt=s,At=a]
  • Predict the next (immediate) reward R sa = E [ R t + 1 ∣ S t = s , A t = a ] \mathcal{R}_s^a=\mathbb{E}\left[R_{t+1 }\mid S_t=s,A_t=a\right]Rsa=E[Rt+1St=s,At=a]

Maze example

find the shortest path
Insert image description here

  • Reward RRR : -1 for each step;
  • Movement AAA: N, E, S, W;
  • Status SSS : Agent’s location;
  • The arrow represents the strategy π ( s ) π(s) of each step state.π ( s ) ;
  • The number represents the value of each step v π ( s ) v_\pi(s)vp( s ) (number of cells from Goal);

Reinforcement learning agent classification

  • Model-based reinforcement learning
    • Strategy (and/or) Value Function
    • environment model
    • Example: Maze, Go
  • Model-agnostic reinforcement learning (usually we don’t know the model of the environment accurately)
    • Strategy (and/or) Value Function
    • No environment model
    • Atari Example

Atari Example
Insert image description here

  • Rules unknown
  • Learning from interaction (the environment is a black box)
  • Select actions on the joystick and view scores and pixel graphics

Other types

  • value based

    • No strategy (implied)
    • value function
  • Based on strategy

    • Strategy
    • no value function
  • Actor-Critic

    • Strategy
    • value function

Insert image description here

relationship between types

The essential way of thinking of reinforcement learning:

The strategy of reinforcement learning will be continuously updated during training, and its corresponding data distribution (i.e., occupancy measure) will also change accordingly. Therefore, a major difficulty in reinforcement learning is that the data distribution seen by the agent constantly changes as the agent learns.

Since rewards are based on state-action pairs, the value corresponding to a strategy is actually the expectation of the corresponding reward under an occupancy metric. Therefore, finding the optimal strategy corresponds to finding the optimal occupancy metric.

Reinforcement learning focuses on finding an agent strategy that produces an optimal data distribution in the process of interacting with a dynamic environment, that is, maximizing the expectation of a given reward function under the distribution.

reference

[1] Boyu AI
[2] https://www.deepmind.com/learning-resources/introduction-to-reinforcement-learning-with-david-silver
[3] Hands-on reinforcement learning

Guess you like

Origin blog.csdn.net/sinat_52032317/article/details/133106834