Introduction and reinforcement learning Markov Decision Process

1. What is reinforcement learning

  Reinforcement learning (reinforcement learning, RL) in recent years, a lot of people mention a concept, then, what is called reinforcement learning?

  Reinforcement learning is a branch of machine learning and supervised learning, unsupervised learning side by side.

  References [1] is given in the definition:

Reinforcement learning is learning what to do ----how to map situations to actions ---- so as to maximize a numerical reward signal.

  That reinforcement learning is learning by converting the state of the environment for the action of the strategy, so as to obtain a maximum return.

  For Chestnut [2], in flappy bird game, we want to design a strategy to get a high score, but do not know his kinetic model and so on. This is what we can through reinforcement learning, so that the agent himself for the game, if hit the post, then give negative returns, otherwise it returns to 0. (Can not hit the post to continue to 1:00 return, not to hit the post in return). Through continuous feedback, we can get a superb flying skills of birds.

  By the above example, we can see several characteristics of reinforcement learning [3]:

  1. No label, only reward (reward)
  2. Reward signal is not necessarily real-time, it is likely delayed.
  3. Effect of the current behavior data subsequently received
  4. Time (series) is an important factor

2. Strengthen the modeling study

img

  Above the brain represent our agent, the agent by selecting the appropriate action (Action) \ (A_t \) , Earth behalf we need to study the environment, it has its own state model, the agent chose the appropriate action \ (A_t \) , the state of the environment \ (S_T \) is changed, becomes \ (. 1 of S_ {T} + \) , we take action while obtaining \ (A_T \) delay bonus \ (R_t \) , then select the next appropriate action, environmental conditions continue to change ...... this is to strengthen the idea of learning.

  In this idea of ​​reinforcement learning, sorting out the following elements [4]:

  State (1) Environment \ (S \) , \ (T \) state time environment \ (S_T \) is the state of its environment a certain set of state;

  Action (2) Agent \ (A \) , \ (T \) operation time taken by the agent \ (A_T \) is the set of its operation a certain action;

  (3) bonus environment \ (R & lt \) , \ (T \) time agent in state \ (S_T \) actions taken \ (A_T \) corresponding bonus \ (R_ {t + 1} \) will \ (t + 1 \) time obtained;

In addition, there are more complex model elements:

  Strategy (4) Agent \ (\ pi \) , which represents the basis for the agent to take action, based on experience that is smart strategy \ (\ pi \) select the action. The most common strategy is a way to express conditional probability distribution \ (\ PI (A | S) \) , that is, in the state \ (s \) actions to be taken when (a \) \ probabilities. I.e., \ (\ PI (A | S) = P (A = A_T | S_T = S) \) , the greater the probability of action more likely to be selected;

  (5) agent in the policy \ (\ pi \) and state \ (s \) , the value after taking action \ (V_ \ PI (S) \) . Value is generally a function of expectation. Although this action will delay corresponding to a reward \ (. 1 + R_ {T} \) , but look at this delay is not acceptable award, reward because of the high current delay, does not represent the \ (t + 1, t + 2, \ dots \) follow-up time is also high reward, such as chess, we can be certain action can eat each other's cars, this delay reward is high, but then behind us the Shu Qi. At this point the car to eat high-action award value but the value is not high. Therefore, we should consider the value of the current delay rewards and subsequent delay of reward. \ (v_ \ pi (s) \) is generally expressed as:
\ [V_ \ PI (S) = E (R_ {T +. 1} + \ Gamma R_ {T + 2} + \ Gamma ^ 2R_ {T +. 3} + \ DOTS | S_T = S) \]
  (. 6) where \ (\ Gamma \) as a reward attenuation factor, the \ ([0,1] \) between, If 0, it is greedy method, i.e., only the value of The current delay award decision. If 1, all subsequent current state of rewards and incentives alike. Most of the time a selected number between 0 and 1

   (7) transformation model state of the environment, can be understood as the probability of a state machine, which can be expressed as a probability model, i.e. in the state \ (S \) action is taken under \ (A \) , go to the next state \ (S ^ { '} \) is the probability, expressed as \ (P_ {ss {'} } ^ {a} \)

  (8) to explore the rate of $ \ Epsilon \ (mainly used in reinforcement learning training iterative process, because we usually choose the maximum value of the current iteration of the action, but this will lead to some good but we have no other actions are miss. so we choose the best action in training, there will be some probability \) \ Epsilon $ do not choose the largest value of the current iteration of the action, and choose other actions.

3. Markov Decision Process (Markov Decision Process, MDP)

  State of the environment transformation model, represented as a probability model \ (of P_ {SS { '}} ^ {A} \) , it can be expressed as a probability model, i.e. in the state \ (S \) action is taken under \ (A \ ) , go to the next state \ (s ^ { '} \ ) probability. In a real environment conversion, the conversion to the next state \ (s { '} \) the probability of both the state and \ (S \) , but also on the previous state and, on the upper and related states. Such transformation model our environment very, very complex, as complex as difficult to model.

  Therefore, we need to strengthen the environmental transformation model to simplify learning. Simplified approach is to assume that the state transformation Markov: the conversion to the next state \ (s { '} \) the probability of the current state and only \ (S \) , but not with the previous state , is represented by the formula:
\ [P_ {ss '} ^ {
a} = E (S_ {t + 1} = s' | S_t = s, A_t = a) \]   simultaneously for the fourth elements of a policy (\ PI \) \ , we also Markov assumption that the state \ (s \) to take action under \ (a \) and the probability of the current state only \ (s \) relating to, and independent of other elements:
\ [\ PI (a | S) = P (A_t = a | S_t =
s) \]   value function \ (v_ \ pi (s) \) Markov assumptions:

\[ v_\pi(s)=E(G_t|S_t=s)=E_\pi(R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\dots|S_t=s) \]

\ (G_t \) represents the harvest (return), an MDP from one state \ (S_t \) to start sampling until the termination of all state awards attenuation of the sum.

  Recursion relations derived function value, it is easy to obtain the following equation:
\ [V_ \ PI (S) of E_ = \ PI (. 1 + R_ {T} + \ Gamma V_ \ PI (+. 1 of S_ {T}) | = S_T S) \]
formula generally called Bellman equation, which represents, a state value of the state by the subsequent state values and a certain percentage of attenuation coalition.

4. action value function and Bellman equation

  For Markov decision process, we found that its value function \ (v_ \ pi (s) \) does not consider the action, merely represents the current state of a strategy and the value of the final steps, consider taking action now to bring Effect:
\ [Q_ \ PI {(S, A)} = E (G_T | S_T = S, A_T = A) = of E_ \ PI (R_ {T +. 1} + \ Gamma R_ {T + 2} + \ Gamma ^ 2R_ {t + 3} +
\ dots | S_t = s, A_t = a) \]   operation value function \ (q_ \ pi (s, a) \) Bellman equation:
\ [Q_ \ PI (S, a ) = E_ \ pi (R_ {
t + 1} + \ gamma q_ \ pi (S_ {t + 1}, A_ {t + 1}) | S_t = s, A_t = a) \]   By definition, it is easy to obtain action value function \ (q_ \ pi (s, a) \) and a state value function \ (v_ \ pi (s) \) relationship:
\ [V_ \ PI (S) = \ sum_ {a \ in a} \ pi (a | s) q_ \ pi (s, a) \]
in other words, the state of the cost function is a function of the value of all the action based on the policy \ (\ pi \) expectations.

  Meanwhile, using Bellman equation, we use the state value function \ (v_ \ pi (s) \) represents the operation of the cost function \ (Q_ \ PI (S, A) \) , i.e.:
\ [Q_ \ PI (S, A ) = E_ \ pi (R_ { t + 1} + \ gamma q_ \ pi (S_ {t + 1}, A_ {t + 1}) | S_t = s, A_t = a) \]

\[ =E_\pi(R_{t+1}|S_t=s,A_t=a)+\gamma E_\pi(q_\pi(S_{t+1},A_{t+1})|S_t=s,A_t=a) \]

\[ =R_s^a+\gamma \sum_{s'}P_{ss'}^{a}\sum_{a'}\pi(a'|s')q_\pi(s',a') \]

\ [= R_s ≦ a + \ gamma \ sum_ {s'} P_ {ss'} ^ av_ \ pi (s') \]

  Equation 5 and Equation 12 To sum up, we can obtain the following two formulas:
\ [V_ \ PI (S) = \ sum_ {A \ in A} \ PI (A | S) (R_s ^ A + \ Gamma \ sum_ {S ' } P_ {ss '} ^ av_ \ pi (s')) \]

\ [Q_ \ pi (s, a) = R_s ≦ a + \ gamma \ sum_ {s'} P_ {ss'} ^ av_ \ pi (s') \]

5. Best Value function

  Reinforcement learning to solve the problem means finding an optimal strategy for an individual to obtain than other strategies always need to make more gains in the process of interaction with the environment, the best strategy we can use \ (\ pi ^ * \) represented. Once the optimal strategy to find \ (\ PI ^ * \) , then we solve this problem reinforcement learning. In general, more difficult to find an optimal strategy, but may be a better strategy to determine the merits of a number of different strategies by comparison, it is a local optimal solution.

  How to compare the merits of the policy? Is generally carried out by comparing the corresponding cost function:
\ [V _ {*} (S) = \ max _ {\ PI} V _ {\ PI} (S) = \ max_ \ PI \ sum_a \ PI (A | S) Q_ { \ pi} (s, a)
= \ max _ {a} q _ {*} (s, a) \]   or optimize the operation cost function:
\ [Q _ {*} (S, A) = \ max _ {\ pi} q _ {\ pi} (s, a) \]

\ [^ A = R_s + \ gamma \ Max_ \ pi v_ \ pi (s') \]

  State value function \ (v \) describes a state of long-term optimization of value, that is, in this state taking into account all subsequent actions that may occur, and the selection of the value of all actions performed under optimal circumstances, this state.

  Action value function \ (q \) later described in a state, and perform a certain action, brought about the most valuable long-term. That is, in this state to perform a certain action, then consider the following actions always select the best possible in all states to implement long-term value brings.

  For optimal strategy, based on the action value function we can be defined as:
\ [\ PI _ {*} (A | S) = \ left \ {\ the begin {Array} {LL} {1} & {\ text {IF} a = \ arg \ max _ {
a \ in a} q _ {*} (s, a)} \\ {0} & {\ text {else}} \ end {array} \ right. \]   If we find the maximum value of a state function or action value function, then the corresponding strategy \ (\ pi ^ * \) is our solution to strengthen learning problems.

6. Examples of reinforcement learning

  Examples of learning about the strengthening of the specific see [4] and [5], very strong, very good.

7. Thinking

  In many people's articles, the reinforcement learning training model called "agent", why? Because of our human learning and thinking it is very similar:

  Model in the absence of sample, take the initiative to explore, from the environment and then get a (delayed) feedback, and then reflect on the feedback, optimization strategy / action, eventually learning to become a powerful agent.

  Of course, reinforcement learning also has some disadvantages [6]:

  1. Sample utilization is low, the need for training with a large number of samples. Training and sometimes very slow speed (far below humans).

  2. Reward function difficult to design. Most of the reward function are zero, too sparse.

  3. Easy to fall into local optimum. [6] In the example noted at a speed horse reward function can be overturned four corners "run."

  4. Environmental over-fitting. Often no way a model for multiple environments.

  5. Instability . For an unstable model is disastrous. A change could lead to the collapse of the ultra-parameter model.

  Of course, we can not blindly confirm nor deny blindly, reinforcement learning in AUTOML, AlphaGO successful application of intensive study also shows that although there are many difficulties, but also has an exploratory direction enlightening.

[1] R.Sutton et al. Reinforcement learning: An introduction , 1998

[2] https://www.cnblogs.com/jinxulin/p/3511298.html

[3] https://zhuanlan.zhihu.com/p/28084904

[4] https://www.cnblogs.com/pinard/p/9385570.html

[5] https://www.cnblogs.com/pinard/p/9426283.html

[6] https://www.alexirpan.com/2018/02/14/rl-hard.html

This article from the original flying swordsman, For reprint, please contact private letter Contact know almost: @AndyChanCD

Guess you like

Origin www.cnblogs.com/moonwanderer/p/11845691.html