Reinforcement Learning: Getting Started Chapter 1 Reading Notes

Chapter 1 Reading Notes

 

Introduction:

  Babies are able to learn from their interactions with their environment since their infancy. Learning from interaction with the environment is the most fundamental theory of almost all learning and intelligence. In this book, we use computational methods as artificial intelligence engineers and researchers to study the effects of different methods. Compared with other machine learning, reinforcement learning focuses more on goal guidance.

1.1 Reinforcement Learning

  Reinforcement learning is mainly about learning how to map the optimal action from a given state in order to maximize the numerical reward signal. The two most obvious features of reinforcement learning are trial-and-error search and delay effects.

  It is very important to clarify the definition of reinforcement learning. There are three aspects of reinforcement learning. At the same time, there are problems, methods that can solve such problems well, and related fields of research problems and solutions.

  Partially observable Markov decision process: In short, the basic idea of ​​partially observable Markov decision process is to grasp the most important part of the agent learning such problems by interacting with the environment in order to achieve the goal, and use the theory of dynamic systems to form Reinforcement learning problem.

  Differences from supervised learning: Supervised learning methods are used in most of the current research. Supervised learning learns and trains itself from datasets with known labels, and then extrapolates and generalizes responses and actions to new situations, relying on existing knowledge. This is a very important way of learning, but it is impractical to solve all environment interaction problems. Reinforcement learning can learn from one's own experience.

  Difference from unsupervised learning: Unsupervised learning finds hidden structures from unlabeled data. This is different from reinforcement learning, which is about trying to maximize feedback signals. So we call reinforcement learning the third method of machine learning.

  One of the challenges of reinforcement learning: it is difficult to balance the extent to which exploring new data and leveraging existing information. (Not in Supervised and Unsupervised Learning)

  A key feature of reinforcement learning is to consider the problem of learning interactive learning between the agent and the environment from an overall perspective. Unlike other learning methods (such as supervised learning) reinforcement learning starts with a fully interactive, goal-directed agent. All agents have clear goals, can perceive the outside world, and can take actions to affect the environment. And generally assumes that he must act even in the face of very uncertain circumstances. Supervised learning methods may also be used, generally because it is used to judge feedback signals. Research subsystems can also be used, but the research subsystems have a very well-defined location in the overall problem.

  The application object of reinforcement learning is not only an object such as a robot, but also a component of the robot, which forms the relationship with the subsystem and the environment. Therefore, it is necessary to truly grasp the essence of reinforcement learning.

  One of the most exciting features of modern reinforcement learning is its tight integration with other engineering and disciplines. Reinforcement learning is the closest thing to the way humans and other animals learn.

  Finally, reinforcement learning is part of a larger trend toward generalized theory in artificial intelligence. In the late 1970s , many AI researchers assumed that there is no generalization principle, and instead intelligence is trained from a large amount of data. This view is also very common today, but it does not dominate. We think this view is very immature, and there is far less research and effort in this area. Modern AI also includes a lot of research towards generalized theoretical learning, adding a lot of specialized domain knowledge. Time will tell how far research can go toward a generalized theory.

1.2 Examples _ _

  All of these examples involve agents and environments making decisions, finding ways to achieve goals despite the uncertainty of the environment.

In these cases, decisions made at this stage will more or less affect subsequent actions and circumstances. So making decisions requires planning and forecasting.

  At the same time, actions in these cases cannot be adequately prepared, so the agent must frequently monitor the environment and respond appropriately. With time and experience, the agent can improve its performance.

1.3 Components of Reinforcement Learning

  In addition to the agent and the environment, reinforcement learning has 4 components: the policy, the feedback signal, the value function, and an optional environment model.

  A policy specifies how the agent behaves at a given time. In layman's terms, a strategy is a mapping of actions from a given environmental state.

  The feedback signal specifies the goal of reinforcement learning. Feedback models are the most important basis for changing strategies.

  A value function ( value ) is an expectation of the future. In layman's terms, the value of the value function is the sum of the feedback signals that the agent can accumulate from this state.

Reward is the basis of value . In fact, we believe that the most important thing in almost all reinforcement learning algorithms is to find a way to compute a valid estimate. Estimation of values ​​is arguably the most important step in 60 years of reinforcement learning .

An environment model can mimic the behavior of the environment. For example, given an action and state , the environment model can predict the next state.

1.4 Limitations and Scope

  The study of reinforcement learning is very dependent on state, but the focus of this book is not on the state but on the part that makes the decision. Most of the reinforcement learning methods in this book study the value function, but it is not necessary to use it to solve reinforcement learning problems. For example, genetic algorithms, genetic programming, simulated annealing, etc. do not have steps to estimate the value function. These methods take multiple static methods, each for a different part of the environment, and choose the most efficient or least biased strategy. Evolutionary algorithms are often used in problems that require less search time, such as a relatively small policy space or problems that are easily structured, and are very efficient.

  But at the heart of reinforcement learning is learning in interaction with the environment. Evolutionary learning is not aware of the state action and the value function that maps from state to action . Although evolution and learning have many similarities, we do not consider evolution itself suitable for solving reinforcement learning problems, so this book does not cover them.

 

1.5 An extended example Tic-Tac-Toe

  Rules: The first team to reach a row wins. Suppose a draw and a loss are considered losses. Suppose we are playing against imperfect players. (The worst result for a perfect player is a draw)

Traditional optimal methods cannot solve this problem because of the uncertainty of the adversary.

  Evolutionary method to solve: Find all problem-solving strategies, then experiment, each strategy produces a probability of winning, and then choose the optimal strategy.

  Value function solution: program a value for each state, representing the probability of winning in this state. The values ​​of all states form a table, which is used as a value function. For the values ​​in the initialization table, the value of one row that has been formed by oneself is 1, the value of the state formed by the other party is 0, and the other values ​​are 0.5 . The specific method , most of them use the greedy algorithm, and a small part uses randomness to select the next action, The value is then corrected by back-off to tend to the later value .

  This example illustrates the difference between evolutionary algorithms and learning value function methods. The evolutionary algorithm only cares about the final result, ignoring the process. In comparison, the learning value function method can use the information in the process.

  This example reveals the key characteristics of reinforcement learning: one is simultaneous learning while interacting with the environment, and two is that there is a clear goal, and correct behavior requires planning and anticipation, while also taking delay effects into account. This is a very distinctive feature of reinforcement learning because it does not use an environment model nor search for possible sequences of actions and states .

  The application of reinforcement learning does not stop there. It can be no opponent or the opponent is nature, or it can be a continuous problem, but it will be more complicated. It can also be applied to large-scale data. The performance of reinforcement learning systems for very large-scale problems is closely related to how efficiently the system can generalize from past experience.

Reinforcement learning does not require an environment model. When building an environment model is the bottleneck, the environment model-free approach will have a greater advantage.

1.6 Summary

  The most distinctive feature of reinforcement learning that distinguishes it from other computational methods is that it learns in direct interaction with the environment for the goal, without relying on an interpreter or model of the environment. Reinforcement learning is the preferred method to solve the problem of interacting with the environment in order to achieve long-term goals.

  Reinforcement learning uses Markov decisions to formally define agents and environments such as rewards , actions , and states . This framework aims to characterize the most important features in artificial intelligence in a simple way.

1.7 Early History of Reinforcement Learning

  There were two main lines of early reinforcement learning, which were studied separately before the merger of modern reinforcement learning. One is learning through trial and error, which originated from animal learning. This main line runs through the earliest days of artificial intelligence. The second is optimal control and solution methods using value functions and dynamic programming. For the most part it doesn't involve learning. The third main line is the time difference method. In the 1980s three main lines merged to form modern reinforcement learning.

  Second main line: In the mid -1950s Bellman and others proposed a method to solve the optimal control problem. This approach exploits the concept of state and value functions or optimal return functions in dynamic systems. This method of solving optimal control is called dynamic programming, and Bellman also calls the discrete stochastic version of the optimal control problem Markov decision. Ronald Howard improved the iterative approach to MDPs . These are the foundations of modern artificial intelligence.

  Dynamic programming is widely regarded as the only feasible way to solve general discrete optimal control problems. But he is limited by the complexity dimension. Dynamic programming has been extensively developed since the 1950s , including extensions to partially observable Markov decisions ( lovejoy, 1991 ) , many applications, fuzzy methods, and asynchronous methods.

  The connection between dynamic programming, optimal control and reinforcement learning: Dynamic programming can solve optimal control problems, which are also considered reinforcement learning problems to a certain extent.

The first main line: trial and error learning. Source, Turing's application and others.

  The effects of trial-and-error learning are long-term. Learning by trial and error Learning based on rated feedback does not rely on so-called correct practices.

The third main line: time difference learning. It is characterized by being driven by differences that occur adjacent in time. It plays a unique and particularly important role in reinforcement learning. Sutton developed Klopf 's ideas in depth, especially in relation to the theory of animal learning. Learning rules driven by differences in temporally adjacent predictions are described.

  The time difference thread and the optimal control thread were finally integrated in the development of Chris Watkin's Q-learning in 1989 . This study extends and integrates three approaches. There have been numerous developments in reinforcement learning before this. Mainly in machine learning, but also neural networks, artificial intelligence. In 1992 , Gerry Tesauro 's flag , GD-Gammon, brought additional attention to the field.

  Since the first edition of this book was published, neuroscience focusing on reinforcement learning algorithms and the relationship of reinforcement learning in the nervous system has developed rapidly. That's because experts and scholars have pointed out that the time-division algorithm has a mysterious connection with the activity of dopamine-producing neurons.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325900644&siteId=291194637