[Wu Enda Machine Learning Course Notes] week four reinforcement learning

Reinforcement Learning Definition

Reinforcement learning (Reinforcement Learning, RL), also known as reward learning, evaluation learning or reinforcement learning, is one of the paradigms and methodologies of machine learning , which is used to describe and solve the learning process of the agent (agent) in the process of interacting with the environment. Strategies to achieve the problem of maximizing returns or achieving specific goals .

In the reinforcement learning framework, we will simply provide our algorithm with a reward function that instructs the learning agent when it is doing well and when it is not doing well. The job of the learning algorithm would then be to figure out how to choose actions over time that would lead to a large reward.

A reinforcement learning system generally includes four elements: policy, reward, value, and environment or model.

Strategy
Strategy defines the behavior of the agent for a given state . In other words, it is a mapping from state to behavior. In fact, the state includes the state of the environment and the state of the agent . Here we start from the agent. That is, the state perceived by the agent. Therefore, we can know that the strategy is the core of the reinforcement learning system, because we can completely determine the behavior in each state through the strategy. We summarize the characteristics of the strategy into the following three points:

  1. Policies define the behavior of the agent
  2. It is a mapping from state to behavior
  3. The policy itself can be a specific mapping or a random distribution

Reward (Reward)
The reward signal defines the goal of the reinforcement learning problem . In each time step, the scalar value sent by the environment to the reinforcement learning is the reward, which can define the performance of the agent , similar to whether humans feel happy or pain. Therefore, we can appreciate that the reward signal is the main factor affecting the strategy. We summarize the characteristics of rewards into the following three points:

  1. The reward is a scalar feedback signal
  2. It can represent how well the agent performs at a certain step
  3. The task of the agent is to maximize the total reward value accumulated in a period of time

Value (Value)
Next, let’s talk about value, or the value function, which is a very important concept in reinforcement learning. Unlike the immediacy of rewards, the value function is a measure of long-term benefits. We often say that "we must keep our feet on the ground and look up to the starry sky", and the evaluation of the value function is "look up to the starry sky", judging the benefits of the current behavior from a long-term perspective, not just staring at the rewards in front of us. Combined with the purpose of reinforcement learning, we can clearly appreciate the importance of the value function. In fact, for a long time, the research of reinforcement learning has focused on the estimation of value. We summarize the characteristics of the value function into the following three points:

  1. The value function is a prediction of future rewards
  2. It can evaluate whether the state is good or bad
  3. The calculation of the value function requires the analysis of transitions between states

The environment (model)
is also called the external environment. It is a simulation of the environment. For example, when the state and behavior are given, we can predict the next state and corresponding rewards with the model. But one thing we should pay attention to is that not all reinforcement learning systems need to have a model , so there will be two different methods based on the model (Model-based) and not based on the model (Model-free). It is learned through the analysis of policy and value function. We summarize the characteristics of the model in the following two points:

  1. The model can predict how the environment will behave next
  2. Performance is specifically reflected by predicted states and rewards

https://blog.csdn.net/weixin_45560318/article/details/112981006

MDP processMarkov decision processes

Markov Decision Processes (MDPs), named after Andrei Markov, provide decision makers with a mathematical modeling framework for decision making for situations where the output of some decisions is partially random and partially controllable. MDPs are very useful for a wide range of optimization problems solved by dynamic programming and reinforcement learning.

A Markov decision process is a tuple (S, a, {Psa}, γ, R) where:

  • S is a set of states . (For example, in autonomous helicopter flight, S might be the set of all possible positions and orientations of the helicopter.)
  • A is a set of actions . (For example, the set of all possible directions in which a helicopter control stick can be pushed.)
  • Psa is the probability of state transition . For each state ∈S and action ∈a, Psa is the distribution over the state space. Simply put, Psa gives the distribution over what state we transition to if we take action a in state s.
  • γ∈[0,1) is called the discount factor .
  • R: S×A7→R is the reward function . (The reward is also sometimes written only as a function of state S, in which case we would have R: S7→R).

The dynamics of an MDP are as follows: we start in some state s0 and then choose some action a0 ∈ a in the MDP. Due to our choice, the state of the MDP randomly transitions to some successor state s1, drawn according to s1∼Ps0a0. Then, we can choose another action a1. Due to this action, the state transitions again, now to some s2∼Ps1a1. Then we choose a 2, and so on. .
We can express this process as follows:
insert image description here
After visiting the sequence of states s0, s1, ... for actions a0, a1, ..., our total payoff
insert image description here
is
insert image description here
In most of our development we will use the simpler state reward R(s), although generalization to state-action reward R(s, a) presents no special difficulty.

Our goal in reinforcement learning is to choose actions over time that maximize the expected value of total payoff:
insert image description here
a policy is an arbitrary function π: S → A reflects the mapped action from a state. We say that if we are in state s and we take = π(s), we are enforcing some policy π. We also define the value function of policy π

insert image description here

value function

Bernoulli equation

value iteration

policy iteration

Guess you like

Origin blog.csdn.net/mossfan/article/details/125460294