Reinforcement learning-an introduction Reinforcement learning translation section 1.3

1.3 Elements of Reinforcement Learning

​ In addition to the agent and the environment, we can determine the four main sub-elements of the reinforcement learning system: strategy, reward signal, value function, and optional environment model.

​ The strategy defines how the learning agent behaves at a given time. Roughly speaking, a strategy is a mapping from perceived environmental states to actions to be taken in these states. It corresponds to the so-called set of stimulus-response rules or associations in psychology. In some cases, the strategy may be a simple function or connection, while in other cases, it may involve a lot of calculations, such as a search process. Strategy is the core of reinforcement learning agents, because it alone is not enough to determine behavior. In general, the strategy may be random, specifying the probability of each action.

​ The reward signal defines the goal of the reinforcement learning problem. At each time step, the environment sends a number called a reward to the reinforcement learning agent. The agent’s sole goal is to maximize the total return in the long run. Therefore, the reward signal defines what is good and bad for the agent. In a biological system, we might think that rewards are similar to happy or painful experiences. They are the direct and decisive characteristics of the problems faced by the agent. The reward signal is the main basis for changing the strategy; if a certain operation selected by the strategy subsequently has a low reward, the strategy may be changed to select other operations in the future. Generally speaking, the reward signal may be a random function of the state of the environment and the action taken.

​ While the reward signal indicates what is good in the immediate sense, and the value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of rewards that an agent can accumulate in the future from that state. Reward determines the immediate and inherent desirability of the environmental state, while values ​​indicate the long-term desirability of the state after considering the possible states and the rewards of these states. For example, one state may always produce lower immediate returns, but still have higher value, because other states usually follow closely behind and produce high returns. vice versa. To make a human analogy, rewards are a bit like happiness (if high) and pain (if low), while values ​​correspond to a more refined and visionary judgment, how happy or unhappy we are, our environment is in a specific status.

​ In a sense, rewards are primary, and value, as a prediction of rewards, is secondary. There is no value without rewards, and the only purpose of evaluating value is to get more rewards. However, what we care most about when making and evaluating decisions is value. Action choices are made based on value judgments. We pursue the behaviors that can bring the highest value state, not the behaviors with the highest return, because in the long run, these behaviors can bring us the greatest return. Unfortunately, determining value is much harder than determining rewards. The reward is basically given directly by the environment, but the value must be evaluated and re-evaluated based on a series of observations made by the actor throughout its life cycle. In fact, the most important component of almost all reinforcement learning algorithms that we consider is an effective valuation method. The core role of value evaluation can be said to be the most important thing in intensive learning in the past 60 years.

​ The fourth and final element of some reinforcement learning systems is an environmental model. This is something that imitates environmental behavior, or more generally, something that can make inferences about environmental behavior. For example, given a state and behavior, the model can predict the outcome of the next state and the next reward. Models are used for planning. What we call planning is to decide what actions to take before actually experiencing what might happen in the future, by considering what might happen in the future. The use of models and plans to solve reinforcement learning problems is called a model-based approach, rather than a simple model-free approach, that is, an explicit trial-and-error learning method-almost regarded as the opposite of plan. In Chapter 8, we discussed a reinforcement learning system, which can learn through trial and error at the same time, learn a model of the environment, and use the model for planning. Modern reinforcement learning ranges from low-level trial and error learning to high-level thoughtful planning.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/107381015