A Preliminary Exploration of Reinforcement Learning

1 Introduction

        Life is full of choices, and every choice is a decision. It is from each decision that we lead ourselves to the next journey of life. When recalling the past, we will be deeply impressed by the decisions made at certain moments in our lives: "Fortunately, I chose to go to graduate school at that time, and found a job I like after graduation!" Well, now we can have a stable life.” Through these reflections, we may be able to understand some truths, become more wise and mature, and embrace future choices and growth with a more positive spirit.

2. Introduction to Reinforcement Learning

        Reinforcement learning is a computational approach in which machines interact with their environment to achieve goals. A round of interaction between the machine and the environment refers to: the machine makes an action decision in a state of the environment, applies this action to the environment, the environment changes accordingly, and the corresponding reward feedback and the next round of state are sent back to the machine . This interaction is iterative, and the machine's goal is to maximize the expected cumulative reward obtained over the course of multiple rounds of interaction.

Reinforcement learning uses an agent to represent a decision-making machine, and the specific interaction between the agent and the environment is shown in the figure.

        In each round of interaction, the agent perceives the current state of the environment, calculates the action of the current round through its own calculation, and applies it to the environment; after the environment receives the action of the agent, it generates a corresponding immediate reward signal And the corresponding state transition occurs. The agent perceives the new environment state in the next round of interaction, and so on.

Among them, the agent has three key elements, namely perception, decision-making and reward.

  • perception. The agent perceives the state of the environment to some extent, so as to know the status quo it is in.
  • decision making. The process by which the agent calculates the actions needed to achieve the goal based on the current state is called decision-making.
  • award. According to the state and the actions taken by the agent, the environment generates a scalar signal as reward feedback, and this scalar signal measures the quality of the agent's round of actions. Maximizing the cumulative reward expectation is the goal of the agent's improvement strategy, and it is also a key indicator to measure the quality of the agent's strategy.

3. Reinforcement learning environment

        The agent of reinforcement learning completes sequential decision-making in the interaction with a dynamic environment. We say that an environment is dynamic, which means that it will continue to evolve as certain factors change, which is often described as a random process in mathematics and physics. For a stochastic process, the most critical element is the state and the conditional probability distribution of state transition.

        If an external interference factor, that is, the action of the agent, is added to the random process of the environment's own evolution, then the probability distribution of the next moment state of the environment will be determined by the current state and the action of the agent, using the simplest mathematics The formula says

According to the above formula, it can be seen that the action of the agent's decision-making acts on the environment, causing the corresponding state change of the environment, and the agent then needs to make further decisions in the new state. 

        From this we can see that the environment in which an agent interacts with a decision-making task is a dynamic random process, the distribution of its future state is jointly determined by the current state and the action of the agent's decision, and each round of state transition is accompanied by There are two aspects of randomness: one is the randomness of the agent’s decision-making actions, and the other is the randomness of the environment’s sampling of the next moment’s state based on the current state and the agent’s actions.

4. Objectives of Reinforcement Learning

        Every time the agent interacts with the environment, the environment will generate a corresponding reward signal, which is usually represented by a real number scalar. This reward signal is generally a timely feedback signal that interprets the current state or action. The reward signals obtained in each round of the entire interaction process can be accumulated to form the overall return of the agent . According to the dynamics of the environment, we can know that even if the environment and the agent's strategy remain unchanged, the initial state of the agent remains unchanged, and the results of the interaction between the agent and the environment are likely to be different, and the rewards obtained will also be different. Therefore, in reinforcement learning, we focus on the expectation of reward and define it as value , which is the optimization goal of agent learning in reinforcement learning.

5. Data in Reinforcement Learning

        In reinforcement learning, data is obtained as the agent interacts with the environment. If the agent does not take a certain decision-making action, the data corresponding to the action will never be observed, so the training data of the current agent comes from the decision results of the previous agent. Therefore, the strategy of the agent is different, and the data distribution generated by interacting with the environment is different, as shown in the figure

        Specifically, there is a concept about data distribution in reinforcement learning called occupancy measure . The normalized occupancy measure is used to measure the interaction between an agent's decision-making and a dynamic environment. The probability distribution of a specific state-action pair

        Occupancy metrics have an important property: given two policies and the two occupancy metrics resulting from their interaction with a dynamic environment, then two policies are the same if and only if the two occupancy metrics are the same. That is, if an agent's strategy changes, the occupancy metric it interacts with the environment changes accordingly.

        According to the important nature of occupancy metrics, we can understand the way of thinking that reinforces the essence of learning:

  • The strategy of reinforcement learning will be continuously updated during training, and its corresponding data distribution (ie occupancy metric) will change accordingly.
  • Since rewards are based on state-action pairs, the value corresponding to a policy is actually the expectation of the corresponding reward under an occupancy metric, so finding the optimal policy corresponds to finding the optimal occupancy metric.

6. The difference between reinforcement learning and general supervised learning

        For a general supervised learning task, our goal is to find an optimal model function that minimizes a given loss function on the training dataset. Under the assumption that the training data is independent and identically distributed, this optimization goal means minimizing the generalization error of the model on the entire data distribution , which can be summarized as follows with a brief formula:

In contrast, the ultimate optimization goal of reinforcement learning tasks is to maximize the value of the agent's strategy in the process of interacting with the dynamic environment, and the value of the strategy can be equivalently converted into the expectation of the reward function on the occupancy measure of the strategy, namely :

To sum up, the general difference between supervised learning and reinforcement learning is:

  • The optimization goals of the two are different. General supervised learning focuses on finding a model that minimizes the expectation of the loss function under a given data distribution; while reinforcement learning focuses on finding an agent strategy that produces the optimal model in the process of interacting with the dynamic environment. Data distribution, i.e. maximizing the expectation of a given reward function under that distribution.
  • The optimization methods of the two are different. Supervised learning directly optimizes the target by optimizing the output of the model for data characteristics, that is, modifying the objective function while the data distribution remains unchanged; reinforcement learning adjusts the interaction between the agent and the environment by changing the strategy. Distribution, and then optimize the objective, that is, modify the data distribution while the objective function remains unchanged.

Guess you like

Origin blog.csdn.net/m0_64087341/article/details/130246353
Recommended