Insert picture description here

This article will introduce the basic meaning of reinforcement learning, understand what reinforcement learning is, the concept and basic framework of reinforcement learning, and the types of common problems in reinforcement learning.

What is reinforcement learning?

Reinforcement Learning (RL), also known as Reinforcement Learning, Evaluation Learning, or Reinforcement Learning, is one of the paradigms and methodologies of machine learning, used to describe and solve the learning process of agents in the interaction with the environment. Strategies to maximize returns or achieve specific goals.

The above is the description of reinforcement learning in Baidu Encyclopedia. From such a sentence, we can capture several information:

Reinforcement learning is a machine learning method
Reinforcement learning focuses on the interaction between the agent and the environment
The goal of reinforcement learning is generally to pursue maximum return

In other words, reinforcement learning is a learning mechanism that learns how to map from state to behavior to maximize the reward. Such an agent needs to continuously experiment in the environment, and continuously optimize the state-behavior correspondence through the feedback (reward) given by the environment. Therefore, repeated experiments (trial and error) and delayed reward (delayed reward) are the two most important characteristics of reinforcement learning.

Differences from other machine learning methods

The other machine learning methods here are mainly supervised learning and unsupervised learning , which are also the places where we are most likely to be confused in the process of understanding reinforcement learning.

Supervised learning is the most studied method in the field of machine learning, and it is already very mature. In the training set of supervised learning, each sample contains a label. Under ideal circumstances, this label usually refers to the correct result. The task of supervised learning is to let the system infer the feedback mechanism according to the label corresponding to each sample on the training set, and then calculate a result that is as correct as possible on the sample with unknown label, such as the classification we are familiar with. Problems with regression. In the interaction problem of reinforcement learning, there is no such a universally correct "label". Agents can only learn from their own experience.

But there is no reinforcement learning with the same label of unsupervised learning are not the same, unsupervised learning is never label data centralized find hidden structure, a typical example is the clustering problem. However, the goal of reinforcement learning is to maximize rewards instead of finding hidden data set structures. Although unsupervised learning methods to find the internal structure of the data can help reinforcement learning tasks, it does not fundamentally solve the problem of maximizing rewards. .

Therefore, reinforcement learning is the third machine learning paradigm besides supervised learning and unsupervised learning.

Insert picture description here

Note: Of course, there is also semi-supervised learning that allows the learner to not rely on external interactions and automatically use unlabeled samples to improve learning performance . It is also essentially different from reinforcement learning.

Reinforcement learning characteristics

Based on the previous introduction, we summarize the characteristics of reinforcement learning into the following four points:

No supervisor, only a reward signal
Feedback is delayed rather than immediate
Time series
The behavior of the agent will affect subsequent data

Elements and structure of reinforcement learning

Four basic elements

Reinforcement learning systems generally include four elements: policy, reward, value and environment or model. Next, we will introduce these four elements separately.

Policy (Policy)

The strategy defines the behavior of the agent for a given state. In other words, it is a mapping from state to behavior. In fact, the state includes the state of the environment and the state of the agent. Here we start from the agent. It refers to the state perceived by the agent. Therefore, we can know that strategy is the core of the reinforcement learning system, because we can determine the behavior in each state through strategy. We summarize the characteristics of the strategy into the following three points:

The strategy defines the agent's behavior
It is the mapping from state to behavior
The strategy itself can be a specific mapping or a random distribution

Reward

The reward signal defines the goal of the reinforcement learning problem. In each time step, the scalar value sent by the environment to the reinforcement learning is the reward. It can define the performance of the agent, similar to how humans feel happy or painful. Therefore, we can realize that the reward signal is the main factor affecting the strategy. We summarize the characteristics of rewards into the following three points:

Reward is a scalar feedback signal
It can characterize how the agent behaves at a certain step
The task of the agent is to maximize the total reward value accumulated in a period of time

Value

Next, let's talk about value, or value function, which is a very important concept in reinforcement learning. Unlike the immediacy of rewards, value function is a measure of long-term benefits. We often say "to be down-to-earth, but also to look up to the stars." The evaluation of the value function is to "look up to the stars", judging the benefits of current behavior from a long-term perspective, not just staring at the immediate rewards. Combined with the purpose of reinforcement learning, we can clearly understand the importance of the value function. In fact, for a long period of time, the research of reinforcement learning has focused on the estimation of value. We summarize the characteristics of the value function into the following three points:

The value function is a prediction of future rewards
It can evaluate the status
The calculation of the value function requires analysis of the transition between states

Environment (model)

Finally, let’s talk about the external environment, that is, the model. It is a simulation of the environment. Take an example to understand. After the state and behavior are given, we can predict the next state and the corresponding reward with the model. . But one thing we should pay attention to is that not all reinforcement learning systems need to have a model, so there are two different methods based on model (Model-based) and model -free (Model-free). It is learning through strategy and value function analysis. We summarize the characteristics of the model into the following two points:

The model can predict the next step of the environment
Performance can be reflected by predicted status and rewards

Reinforcement learning architecture

Insert picture description here
We use such a picture to understand the overall architecture of reinforcement learning. The brain refers to the agent agent, and the earth refers to the environment. From the current state of $S^a_t$ off, make an act $A_t$ After that, it has some impact on the environment. It first feeds back a reward signal $R_t$ to the agent $R_{t}$ , Our agent can find some information from it, here we use $O_t$ Representation, and then enter a new state, and then make new behaviors, forming a cycle. The basic process of reinforcement learning is to follow such a structure.

The problem of reinforcement learning

The basic problems of reinforcement learning are classified according to two principles.

Based on the classification of strategy and value, it is divided into three categories:
- Value Based: No strategy but valuable function
- Policy Based: There is a policy but no value function
- Participation evaluation method (Actor Critic): both strategies and valuable functions
Classification based on the environment, divided into two categories:
- Model Free: There are strategies and value functions, no models
- Model Based: There are strategies and value functions as well as models

We use the following Venn diagram to clearly demonstrate these methods:
Insert picture description here

Exploration and Exploitation

Finally, let's talk about the exploration and utilization of the problem of reinforcement learning. Reinforcement learning theory is inspired by behaviorist psychology. It focuses on online learning and tries to maintain a balance between exploration-exploitation. It does not require any data to be given in advance, but learns by receiving rewards (feedback) from the environment on actions. Information and update model parameters.

On the one hand, in order to obtain as much knowledge as possible from the environment, we have to let the agent explore. On the other hand, in order to obtain a larger reward, we have to let the agent use the known information. Fish and bear's paws cannot have both. It is impossible for us to optimize both exploration and utilization at the same time. Therefore, an important challenge in the reinforcement learning problem is how to weigh the relationship between exploration and utilization.

to sum up

Reinforcement learning is a computational method that understands and automates goal-oriented learning and decision-making. It emphasizes that individuals learn through direct interaction with the environment without the need for supervision or a complete environment model.

It can be considered that reinforcement learning is the first effective solution to learn from interacting with the environment in order to achieve long-term goals. This model is the closest approach to human and other animal learning in all forms of machine learning, and it is also currently the most artificial The method of the ultimate goal of intelligent development.

This is the first blog I wrote. The errors in the article are unavoidable. I would be very grateful if readers are willing to inform me.
Later, we will continue to share the basic knowledge of reinforcement learning and other valuable content.

Please indicate the source and original author if you reprint or quote the content of this article

Reinforcement learning (1): Introduction-what is reinforcement learning?