Reinforcement learning-Basics of Reinforcement Learning

1. Basic Definitions

Comparison of RL with supervised learning and unsupervised learning:
  (1) Supervised learning is to learn from a labeled training set. The characteristics of each sample in the training set can be regarded as a description of the situation, and its label It can be regarded as the correct action that should be executed, but supervised learning cannot learn interactive situations, because it is very impractical to obtain samples of expected behavior in interactive problems, and the agent can only learn from its own experience (experience) Learning in the experience, and the behavior taken in the experience is not necessarily optimal. At this time, it is very appropriate to use RL, because RL does not use the correct behavior to guide, but uses the existing training information to evaluate the behavior .
  (2) Because RL does not use the experience of taking correct actions, it is indeed a bit similar to unsupervised learning from this point of view, but it is still different. The purpose of unsupervised learning can be said to be from a bunch of unlabeled samples The hidden structure is found in , and the purpose of RL is to maximize the reward signal .
  (3) In general, RL is different from other machine learning algorithms in that: there is no supervisor, only a reward signal; feedback is delayed, not immediately generated; time is of great significance in RL ; agent's Behavior will affect the subsequent series of data.

The key elements of reinforcement learning are: environment, reward, action and state. With these elements we can build a reinforcement learning model.
The problem solved by reinforcement learning is to obtain an optimal policy for a specific problem, so that the reward obtained under this policy is the largest. The so-called policy is actually a series of actions, that is, sequential data.

Reinforcement learning can be described by the following figure. It is necessary to extract an environment from the task to be completed, and abstract the state (state), action (action), and the instantaneous reward (reward) for performing the action.
image.png

  • reward
    reward is usually recorded as, indicating the return reward value of the tth time step. All reinforcement learning is based on the assumption of reward, and reward is a scalar.
  • Action
    The action comes from the action space, and the agent uses the state it is in each time and the reward of the previous state to determine what action to execute currently. To execute an action to maximize the expected reward until the final algorithm converges, the resulting policy is the sequential data of a series of actions.
  • state
    refers to the current state of the agent
  • policy

Policy is only the behavior of the agent, which is the mapping from state to action. It is divided into definite strategy and random strategy. The definite strategy is the definite action in a certain state, and the random strategy is described by probability, that is, the implementation of this strategy in a certain state probability of action.

  • value function

Because reinforcement learning can basically be summarized as obtaining an optimal strategy by maximizing reward. But if only the instantaneous reward is the largest, it will only choose the action with the largest reward from the action space every time, which becomes the simplest greedy policy (Greedy policy), so in order to describe it well, it includes the current reward in the future The value is the largest (even if the total reward from the current moment until the state reaches the goal is the largest). Therefore, a value function is constructed to describe this variable. The expression is as follows
: KaTeX parse error: {equation} can be used only in display mode.
It is a discount coefficient, which is to reduce the impact of future rewards on the current action, and then maximize the value function by selecting an appropriate policy. The famous Bellman equation is the source of reinforcement learning algorithms (eg value iteration, strategy iteration, Q-learning).

  • model

The model is used to predict what the environment will do next, that is, what state will be achieved when performing an action in this state, and what reward will be obtained for this action. So describing a model is to use action transition probability and action state reward. The specific formula is as follows: In reinforcement learning, the agent (agent) is placed in a certain environment (environment). Corresponding to the example of Go, the player is the agent, and the environment is the board. At any time, the environment is always in some state (state), which comes from one of a set of possible states, and for Go, the state refers to the layout state of the board. A set of possible actions (legal moves of pieces) can be made by the decider. Once an action is selected and performed, the state changes accordingly. Problem solving requires the execution of a sequence of actions, followed by feedback in the form of a reward that occurs rarely, usually only after the complete sequence of actions has been performed.


2. Finite Markov Decision Processes

MDPs are simply a cyclical process in which an agent (Agent) takes an action (Action), changes its state (State) to obtain a reward (Reward), and interacts with the environment (Environment).

The strategy of MDP depends entirely on the current state (Only present matters), which is also a manifestation of the Markovian nature.

2.1. The Agent–Environment Interface

The reinforcement learning problem is a simple framework for learning from interactions to achieve a desired goal. Learners and decision makers are called agents, and those interacting with agents are called environments. The interaction is ongoing, the agent chooses to perform actions, and the environment reacts to the actions performed by the agent, making the agent in another new environment. At the same time, the environment generates rewards, and the agent tries to maximize these rewards over time.

At time instants, the agent observes information about the environment, where is the set of all possible states. In this environment, the agent chooses an action, which represents the collection of all actions that can be performed in the state. At the next moment, due to the action, the agent gets a numerical reward and at the same time, the agent is in a new state. The figure below shows the interaction process between agent and environment.
image.png
Under the joint action of MDP and agent, a sequence is generated, also known as the trajectory
KaTeX parse error: {equation} can be used only in display mode.
At the moment, given the state and action, the state and reward of the moment appear The probability of
KaTeX parse error: {equation} can be used only in display mode.
The state transition probability is
KaTeX parse error: {equation} can be used only in display mode.
The expected reward of state–action is
KaTeX parse error: {equation } can be used only in display mode.
Expected value of state-action-next_state tuple
KaTeX parse error: {equation} can be used only in display mode.

Example:
A recycling robot, used to collect soda cans in offices, runs on rechargeable batteries and includes sensors to detect cans and grippers to pick them up and collect them. The strategy of how to search for jars is made by a reinforcement learning agent based on the current battery level. The agent can make the robot do the following things

  • (1) Actively search for a period of time;
  • (2) Wait for a period of time, waiting for someone to bring the jar;
  • (3) Go back and charge. Therefore, the agent has three actions. The state is determined by the state of the battery. When the robot gets the jar, it gets a positive reward, and when the battery runs out, it gets a big negative reward. Assume that the change rule of the state is: the best way to get the jar is to actively search, but this will consume power, but waiting in place will not consume power. When the battery is low, performing a search may drain the battery, in which case the robot must be powered down and wait for rescue.

The action performed by the agent only depends on the power. Therefore, the state set is, and the actions of the agent include wait, serach, and recharge. Therefore, the action set of the agent is: if the power is high, a search will generally not drain the battery, and the probability of high power after the search is α \alphaα , the probability of low battery is1 − α 1-\alpha1α . When the battery is low, the probability that it is still low after performing a search is1 − β 1-\beta1β , the probability of depleting the battery isβ \betaβ , then the robot needs to be charged, so the state will change to high. Each active search will be rewarded, and waiting will be rewarded,rwait r_{wait}rwait. When the robot needs to be rescued, the reward is rsearch r_{search}rsearch
image.png
image.png

2.2. Goals and Rewards

In reinforcement learning, the agent's goal is to obtain the reward passed to it by the environment. At each moment, the reward is a specific value, and the agent's goal is to maximize its expected reward. This means not maximizing immediate rewards, but cumulative rewards over time.

2.3. Returns and Episodes

2.2 describes the goal of reinforcement learning, expressed in mathematical form below. After time instant, the obtained reward sequence is G t G_{t}Gt. Typically, we look for the maximum expected return. The reward can be regarded as a function of the reward sequence. The simplest form of the reward is to directly add the rewards
G t ≐ R t + 1 + R t + 2 + R t + 3 + ⋯ + RT G_{t} \doteq R_ {t+1}+R_{t+2}+R_{t+3}+\cdots+R_{T}GtRt+1+Rt+2+Rt+3++RT
Among them, TTT is the moment of final stop. In particular, we call an experiment with a limited number of steps a single episode, that is, after a limited number of steps, it will eventually enter a terminal state. This type of task is also called episodic tasks, such as playing games.
In some cases, the interaction between the agent and the environment will not stop. Such tasks are called continuing tasks, such as control tasks. At this time, in this case, the return may tend to infinity. For example, at each step, the reward obtained by the agent is +1.
At this time, the concept of discount. The agent chooses a series of actions to maximize the sum of future discounted rewards. That is, to choose, the expectation that maximizes the discounted return
G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 G_{t} \doteq R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\cdots=\sum_{k=0}^{\infty} \gamma^{ k} R_{t+k+1}GtRt+1+γRt+2+c2 Rt+3+=k=0ckRt+k+1

2.4.Policies and Value Functions

Almost all reinforcement learning needs to calculate the value function-a function about the state or a function about the state-action to estimate the value of the agent in a certain state (or perform an action in a certain state), that is, the future reward . Obviously, future rewards depend on the actions performed. Therefore, value functions are defined for specific strategies. For the same reinforcement learning problem, different strategies will have different value functions.

Uses: Reinforcement learning to solve dynamic programming, the theory of extreme learning machines (research on selection of measurement points), and reinforcement learning as a feature selection method.

Guess you like

Origin blog.csdn.net/weixin_45521594/article/details/127567879