Reinforcement learning-an introduction Reinforcement learning translation section 1.7

1.7 Early History of Reinforcement Learning

​ The early history of reinforcement learning has two main lines, both long and rich, and they were pursued independently before modern reinforcement learning was intertwined. One of the clues is about trial and error learning, which originated in animal learning psychology. This clue runs through some of the earliest work in the field of artificial intelligence and led to the renaissance of reinforcement learning in the early 1980s. The second clue involves the optimal control problem and its solution using value functions and dynamic programming. In most cases, this thread does not involve learning. These two threads are basically independent, but have become interrelated to some extent. About one-third of the threads are not obvious about the time difference method, such as the thread used in the Tic-Tac-Toe example in this chapter.

​ These three clues came together in the late 1980s and formed the modern field of reinforcement learning that we introduced in this book. In this short history, the clues that focus on trial and error learning are the ones we are most familiar with and have the most say. However, before that, let's briefly discuss the optimal control thread.

​ The term "optimal control" was used in the late 1950s to describe the design of controllers to minimize or maximize the behavior of dynamic systems over time. In the mid-1950s, Richard Bellman and others proposed a solution to this problem by extending the theories of Hamilton and Jacoby in the 19th century. This method uses the concept of dynamic system state and value function or "optimal return function" to define a functional equation, which is now commonly called Bellman's equation. One type of method that solves the optimal control problem by solving this equation is called dynamic programming (Bellman, 1957a). Bellman (1957b) also introduced discrete stochastic versions of optimal control problems called Markov Decision Processes (MDPs). Ronaldhoward (1960) proposed mdp's strategy iteration method. All these are the basic elements of modern reinforcement learning theories and algorithms.

​ Dynamic programming is widely regarded as the only feasible method to solve general stochastic optimal control problems. It stems from Bellman's "curse of dimensionality", that is, its computational requirements increase exponentially with the number of state variables, but it is still more effective and more widely applicable than any other general method. Since the late 1950s, dynamic programming has been extensively developed, including the extension of partially observable mdp (Lovejoy, 1991), many applications (White survey, 1985, 1988, 1993), and approximate methods (Rust survey, 1996 ) And asynchronous methods (Bertsekas, 1982, 1983). Many excellent modern processing methods for dynamic programming are available (for example, Bertsekas, 2005, 2012; Puterman, 1994; Ross, 1983; Whittle, 1982, 1983). Bryson (1996) provides an authoritative historical optimal control.

​ On the one hand, the link between optimal control and dynamic programming, on the other hand, the link between learning is difficult to understand. We cannot determine the reason for this separation, but the main reason is probably the separation between the discipline involved and its different goals. Another reason may be that people generally believe that dynamic programming is an offline calculation, which essentially relies on an accurate system model and the analytical solution of Bellman's equation. In addition, the simplest form of dynamic programming is backward calculations, which makes it difficult to understand how it participates in the learning process that must move forward. Some of the earliest work in dynamic programming, such as the work of Bellman and Dreyfus (1959), can now be classified as following the learning method. Witten's (1977) work (discussed below) must be a combination of learning and dynamic programming ideas. Werbos (1987) clearly pointed out that there is a greater correlation between dynamic programming and learning methods, and the correlation between dynamic programming and understanding of neural and cognitive mechanisms. For us, the complete integration of dynamic programming methods and online learning did not appear until Chriswatkins' work in 1989, and his use of MDP formalism to deal with reinforcement learning has been widely adopted. Since then, these relationships have been extensively developed by many researchers, especially the term "neural dynamic programming" proposed by Dimitri Bertsekas and John Tsitiklis (1996) refers to the combination of dynamic programming and artificial neural networks. Another term currently used is "approximate dynamic programming". These different methods emphasize different aspects of the subject, but they all have the same interest as reinforcement learning to avoid the classic shortcomings of dynamic programming.

​ We believe that all the work in optimal control is also a work of reinforcement learning in a sense. We define reinforcement learning methods as any effective method to solve reinforcement learning problems. It can now be clearly seen that these problems are closely related to optimal control problems, especially stochastic optimal control problems, such as MDP problems. Therefore, we must regard optimal control solving methods, such as dynamic programming, as reinforcement learning methods. Because almost all traditional methods require complete knowledge of the system to be controlled, it is a bit unnatural to say that they are part of reinforcement learning. On the other hand, many dynamic programming algorithms are incremental and iterative. Like learning methods, they gradually get the correct answer through continuous approximation. As we show in the rest of this book, these similarities are far more than superficial. The theories and solutions of complete knowledge and incomplete knowledge are so closely linked that we think they must be considered as part of the same subject.

​ Now let us return to another main thread leading to the modern field of reinforcement learning, which is the clue centered on trial and error learning. We only discuss the main points of contact. We discuss this topic in more detail in Section 14.3. According to the American psychologist RS Woodworth (1938), the idea of ​​trial and error learning can be traced back to Alexander Bain’s discussion of “groping and experimental” learning in the 1850s, and more specifically, the British animal behaviorist. And psychologist Conway Lloyd Morgan used this term in 1894 to describe his observation behavior of animals. Edward Thorndike may be the first to succinctly express the essence of trial-and-error learning as a learning principle:

​ Among the several responses to the same situation, under the same other conditions, the satisfaction of the animal is accompanied or followed by the satisfaction of the animal. They will be more closely related to the situation. Therefore, when this When this happens again, they will be more likely to happen again; those reactions that accompany or follow the animal's discomfort, under the same other conditions, their connection with this situation will be weakened, so that when this When this happens again, they are unlikely to happen. The greater the sense of satisfaction or discomfort, the greater the strengthening or weakening of this connection. (Thunderk, 1911, p. 244)

​ Thorndike calls it the "E-like law" because it describes the influence of reinforcement events on the tendency to choose actions. Thorndike later modified this law in order to better explain the continuous data of animal learning (such as the difference between rewards and punishments), and various forms of law have caused considerable controversy among learning theorists (for example, see Gallistel, 2005; Herrnstein, 1970; Kimble, 16 Chapter 1: Introduction 1961, 1967; Mazu, 1994). Nevertheless, the law of effect in one form or another is widely regarded as the basic principle of many behaviors (for example, Hilgard and Bower, 1975; Dennett, 1978; Campbell, 1960; Cziko, 1995). It is the basis of Clark Hull's (1943, 1952) influential learning theory and Skinner's (1938) influential experimental method.

​ The word "reinforcement" in animal learning began to be used after Thorndike expressed the law of effect. It first appeared in this context in the English translation of Pavlov's monograph on conditioning in 1927 (according to our Known). Pavlov described reinforcement as the reinforcement of an animal’s behavior pattern after receiving a stimulus, that is, an enhancer that has an appropriate time relationship with another stimulus or response. Some psychologists extend the concept of reinforcement to include the weakening and reinforcement of behavior, and extend the concept of reinforcement to possibly include the omission or termination of stimuli. To be considered an enhancer, the enhancement or weakening must continue after the enhancer is withdrawn; a stimulus that merely attracts the animal's attention or stimulates its behavior without producing a lasting change will not be considered an enhancer.

​ The idea of ​​trial-and-error learning on a computer is one of the earliest ideas about the possibility of artificial intelligence. In a 1948 report, Alan Turing described the design of a "pleasure-pain system" that follows the law of effect:

​ When a configuration with an undetermined action is reached, the missing data will be randomly selected, and appropriate entries will be temporarily entered in the description, and then applied. When a painful stimulus appears, all tentative items are cancelled, and when a happy stimulus appears, they are permanently retained. (Turing, 1948)

​ Many sophisticated electromechanical machines have been built, proving Trilan's error learning. The earliest possible is a machine made by Thomas Ross (Thomas Ross, 1933), which can find its own way in a simple maze and remember the path by setting the switch. In 1951, W. Grey Walter (W. Grey Walter) produced a version of the "mechanical tortoise" (Walter, 1950) capable of simple learning. In 1952, Claude Shannon demonstrated a maze-running mouse named Theseus. It found the path of the maze through trial and error. The maze itself remembered the direction of success through magnets and relays under the floor (see also Shannon, 1951). J. A. Deutsch (1954) described a maze solving machine based on behavior theory (Deutsch, 1953), which has some common characteristics with model-based reinforcement learning (Chapter 8). In his doctoral dissertation, Marvin Minsky (1954) discussed the computational model of reinforcement learning and described a simulation machine he built composed of components he called SNARCs (random neural simulation reinforcement calculators), designed to simulate Modifiable synaptic connections in the brain (Chapter 15). The website control network contains a wealth of information about these and many other electromechanical learning machines.

​ The establishment of electromechanical learning machines gives way to programming digital computers to perform various types of learning, some of which implement trial and error learning. Farley and Clark (1954) described a digital simulation of a neural network learning machine that learns through trial and error. But their interest quickly shifted from trial and error learning to generalization and pattern recognition, that is, from reinforcement learning to supervised learning (Clark and Farley, 1955). This begins the confusion about the relationship between these types of learning. Many researchers seem to think they are studying reinforcement learning, when in fact they are studying supervised learning. For example, artificial neural network pioneers such as Rosenblatt (1962) and Widrow and Ho (1960) are obviously motivated by reinforcement learning. They use reward and punishment language, but the system they study is a supervised learning system suitable for pattern recognition and perceptual learning. Researchers have even blurred the distinction between researchers and today's textbooks. For example, some textbooks use the term "Triland error" to describe artificial neural networks learned from training examples. This is an understandable confusion, because these networks use wrong information to update the connection weights, but this ignores the essential feature of trial-and-error learning, which is to choose actions based on evaluation feedback instead of relying on knowledge of correct actions.

​ To a certain extent, due to these confusions, research on true trial-and-error learning became scarce in the 1960s and 1970s, although there were notable exceptions. In the 1960s, the engineering literature first used "reinforcement" and "reinforcement learning" to describe engineering applications of trial-and-error learning (for example, Waltz and Fu, 1965; Mendel, 1966; Fu, 1970; Mendel and McClaren, 1970). Particularly influential is Minsky’s paper "Steps to Artificial Intelligence" (Minsky, 1961), which discusses several issues related to trial-and-error learning, including predictions, expectations, and what he calls complex reinforcement The basic credit allocation problem of the learning system: How to allocate credits for success among the many decisions that may involve making it? In a sense, all the methods we discuss in this book are designed to solve this problem. Minsky's paper is worth reading today.

​ In the next few paragraphs, we will discuss some other exceptions and some exceptions that were relatively ignored in the calculation and theoretical research of real trial-and-error learning in the 1960s and 1970s.

One exception is the work of New Zealand researcher John Andre, who developed a system called STeLLA to learn the interaction with the environment through trial and error. This system includes an internal model of the world and later "internal monologues" to deal with the problem of hidden states (Andreae, 1963, 1969; Andreae and Cashin, 1969). André's later work (1977) emphasized learning from teachers, but it still included trial and error learning, and generating novel events was one of the goals of the system. One of the characteristics of this work is the "leakage process", which Andreae (1998) elaborated more comprehensively. It implements a credit allocation mechanism similar to the backup update operation described by us. Unfortunately, his pioneering research is not widely known and did not have a significant impact on subsequent reinforcement learning research. The most recent summary is available (Andreae, 2017a, b).

​ More influential is Donald Mitch's work. In 1961 and 1963, he described a simple trial and error learning system, learning how to play tic-tac-toe (or zero-and-cross) called threat (matchbox educated notts and cross engine). It includes a matchbox for each possible game position, each matchbox contains many colored beads, and different colors for each possible movement from that position. By randomly selecting a bead corresponding to the current game position from the matchbox, the movement of the threat can be determined. When the game is over, add or remove beads in the box used to reward or punish threats. Michie and Chambers (1968) described another tic-tac-toe reinforcement learner called GLEE (Game Learning Expectation Simulation Engine) and a reinforcement learning controller called box. They applied the box to the task of learning to balance an electric pole that is hinged to a mobile cart, based on a fault signal that only occurs when the pole falls or the cart reaches the end of the track. This task is adapted from the early work of Widrow and Smith (1964), who used a supervised learning method, assuming guidance from a teacher who was already able to balance the pole position. The pole balance of Michie and Chambers is one of the best early examples of reinforcement learning tasks under the condition of incomplete knowledge. It influenced later work on reinforcement learning, starting with some of our own research (Barto, Sutton, and Anderson, 1983; Sutton, 1984). Mitch (1974) consistently emphasized that debugging and learning are important aspects of artificial intelligence.

​ Widrow, Gupta, and Maitra (1973) improved the least mean square (LMS) algorithm of Widrow and Ho (1960) and produced a reinforcement learning rule that can learn from success and failure signals instead of Learn from training examples. They call this learning style "selectively guided adaptation" and describe it as "learning with critics" rather than "learning with teachers." They analyzed this rule and showed how to learn blackjack. This is an isolated attempt by Widrow on reinforcement learning, and his contribution to supervised learning is more influential. Our use of the term "critic" is derived from the papers of Widlow, Gupta, and Maitra. Buchanan, Mitchell, Smith, and Johnson (1978) independently used the term "critic" in the context of machine learning (see also Dieterich and Buchanan, 1984), but for them, critics are an expert system that can do It's more than just evaluating performance.

​ Has a more direct impact on modern clue learning and reinforcement research. These methods solve a non-associative, pure choice learning problem, called a k-arm gambling machine, through analogy slot machines, or "one-arm gambling machines", except for the k-lever (see Chapter 2). The learning automata is a simple, low-memory machine used to increase the probability of return in these problems. Learning automata originated from the work of Russian mathematician and physicist MLTsetlin and his colleagues in the 1960s (published after Tsetlin’s death, 1973), and has been extensively developed in the field of engineering (see Narendra and Thathachar, 1974, 1989). These developments include the study of random learning automata, a method of updating the probability of action based on reward signals. Harth and Tzanakou's (1974) Alopex algorithm (for pattern extraction algorithm) is not developed in the tradition of random learning automata, but it is a stochastic method for detecting the correlation between action and reinforcement. Influenced some of our early studies (Barto, Sutton and Brouwer, 1981). Random learning automata is a precursor of early psychological research, starting with William Estes (1950), moving towards the statistical theory of learning, and further developed by others (for example, Bush and Mosteller, 1955; Sternberg, 1963).

​ The statistical learning theory developed in psychology is adopted by economic researchers, leading to a series of research in this field dedicated to reinforcement learning. This work began in 1973, applying Bush and Mostler's learning theory to a series of classic economic models (Cross, 1973). One goal of this research is to study artificial agents that behave more like real people, rather than traditional ideal economic agents (Arthur, 1991). This method is extended to the research of reinforcement learning in the context of game theory. Reinforcement learning in economics is largely independent of the early work of reinforcement learning in artificial intelligence, although game theory is still a topic of interest in these two fields (outside the scope of this book). Camerer (2011) discussed the reinforcement learning tradition in economics. Now, e, Vrancx, and De haware (2012) outline the methods introduced in this book from the perspective of multi-agent expansion. Reinforcement learning in the context of game theory is very different from reinforcement learning used to play tic-tac-toe, checkers and other entertainment games. For the learning aspect of Szita, please refer to Szita 2012 Learning Overview.

​ Johnholland (1975) outlined the general theory of adaptive systems based on selection principles. His early work mainly involved trial and error in unrelated forms, such as evolutionary methods and k-armed bandits. In 1976 and 1986, he introduced the classifier system, the real reinforcement learning system, including correlation and value functions. A key component of the Holland classifier system is the "bucket Budget algorithm" for credit allocation, which is closely related to the time difference algorithm we used in the tic-tac-toe example, and is discussed in Chapter 6. Another key component is genetic algorithm, an evolutionary method whose role is to evolve useful representations. The classifier system has been widely developed by many researchers, forming a major branch of reinforcement learning research (Urbanowicz and Moore, 2009 review), but we do not consider ourselves to be a genetic algorithm for reinforcement learning systems that have received more attention. The same is true for other evolutionary methods (for example, Fogel, Owens and Walsh, 1966; Koza, 1992).

​ The person who tried repeatedly to restore reinforcement learning in artificial intelligence was Harry Klopf (1972, 1975, 1982). Klopf recognizes that as learning researchers focus almost entirely on supervised learning, the fundamental aspects of adaptive behavior are disappearing. According to Klopf, what is missing is the hedonic aspect of behavior: the drive to obtain a certain result from the environment, to control the environment on the desired goal, away from the undesired goal (see Section 15.9). This is the basic idea of ​​trial and error learning. Klopf’s ideas are particularly influential to the author, because our evaluation of them (Barto and Sutton, 1981a) makes us recognize the difference between supervised learning and reinforcement learning, and ultimately focus on reinforcement learning. Most of the early work done by us and our colleagues aims to show that reinforcement learning and supervised learning are indeed different (Barto, Sutton, and Brouwer, 1981; Barto and Sutton, 1981b; Barto and Anandan, 1985). Other studies have shown how reinforcement learning can solve important problems in artificial neural network learning, especially how to generate multi-layer network learning algorithms (Barto, Anderson, and Sutton, 1982; Barto and Anderson, 1985; Barto, 1985, 1986; Barto and Jordan, 1987; see section 15.10).

​ We now turn to the third clue of reinforcement learning, which is about the history of time difference learning. The time difference learning method is unique in that it is driven by the difference between consecutive estimates of the same amount of time, for example, the probability of winning in the tic-tac-toe example. This clue is smaller and less obvious than the other two clues, but it plays a particularly important role in this field, partly because the time difference method seems to be a new and unique feature of reinforcement learning.

​ Part of the origin of time difference learning comes from the psychology of animal learning, especially the concept of secondary enhancers. A secondary enhancer is an irritant that is paired with a primary enhancer (such as food or pain) and therefore has similar strengthening properties. Minsky (1954) may be the first person to realize the importance of this psychological principle to artificial learning systems. Arthur Samuel (1959) was the first to propose and implement a learning method that includes the idea of ​​time difference as part of his famous checkers project (section 16.2).

​ Samuel did not mention Minsky's work, nor did he mention that it might be related to animal learning. His inspiration obviously came from Claude Shannon's (1950) suggestion that the computer can be programmed to use an evaluation function to play chess, and the game can be improved by modifying this function online. (It is possible that these ideas of Shannon also influenced Bellman, but we don’t know the evidence for this.) Minsky (1961) extensively discussed Samuel’s work in his “Steps” paper, proposing a difference between natural and artificial Strengthen the theoretical connection.

​ As we have already discussed, in the ten years following the work of Minsky and Samuel, very little computational work has been done on trial and error learning, and it is clear that no computational work has been done at all in terms of time difference learning. In 1972, Klopf combined trial and error learning with an important part of time difference learning. Klopf is interested in the principles that can be learned in large systems, and therefore is interested in the concept of local reinforcement, that is, the subcomponents of the entire learning system can enhance each other. He proposed the concept of "generalized reinforcement", that is, each component (nominal, each neuron) treats all its inputs in a reinforced way: excitatory input is reward, and inhibitory input is punishment. This is different from the concept of temporal difference learning as we know it now. In retrospect, it is farther away than Samuel's work. On the other hand, Klopf connects this kind of thinking with trying to learn, and connects it with the vast experience database of animal learning psychology.

​ Sutton (1978a, b, c) further developed Klopf's ideas, especially in connection with animal learning theory, and described learning rules driven by continuous predictive changes in time. He and Barto improved these ideas and developed a classic psychological model of conditioning based on time difference learning (Sutton and Barto, 1981a; Barto and Sutton, 1982). Based on time difference learning, it also follows several other influential classical conditioning psychology models (for example, Klopf, 1988; Moore et al., 1986; Sutton and Barto, 1987, 1990). Some neuroscience models developed at this time have been well explained in terms of time difference learning (Hawkins and Kandel, 1984; Byrne, Gingrich, and Baxter, 1990; Gelperin, Hopfield, and Tank, 1985; Tesauro, 1986; Friston et al., 1994), although in most cases there is no historical correlation.

​ Our early research on time difference learning was strongly influenced by animal learning theory and Klopf's work. The relationship with Minsky's "step" file and Samuel's checkers player was not confirmed until later. However, by 1981, we were fully aware that all the previous work described above was part of the time difference and trial and error thread. At this time, we developed a method that combines time difference learning with trial-and-error learning, called the actor-critic architecture, and applied this method to the pole balance problem of Michie and Chambers (Barto, Sutton, and Anderson, 1983 ). This method has been extensively studied in Sutton's (1984) doctoral thesis, and was extended to use backpropagation neural networks in Anderson's (1986) doctoral thesis. During this period of time, Holland (1986) explicitly incorporated the idea of ​​time difference into his classifier system in the form of a barrel travel algorithm. Sutton (1988) took a key step to separate time difference learning and control as a general forecasting method. This article also introduces the TD algorithm and proves some of its convergence.

​ In 1981, when we finished our work on actor-critic architecture, we discovered a paper by Ian Witten (1977, 1976a), which seemed to be the earliest publication of time difference learning rules. He proposed what we now call tabular TD(0) as part of the adaptive controller that solves mdp. This research was first submitted to a journal for publication in 1974 and also appeared in Witten's doctoral thesis in 1976. Witton’s work is a descendant of Andre’s early experiments with Stella and other trial-and-error learning systems. Therefore, Witten's 1977 paper crosses the main clues of reinforcement learning research, trial and error learning, and optimal control, while making a significant early contribution to time difference learning.

​ In 1989, with the development of Q-learning by Chris Watkins, time difference and optimal control thread were completely integrated. This research expands and integrates the three clues of reinforcement learning research. Paulwerbos (1987) has advocated the integration of trial and error learning and dynamic programming since 1977. When I arrived at Watkins, the research on reinforcement learning had grown tremendously, mainly in the machine learning subfield of artificial intelligence, but also in the broader field of artificial neural networks and artificial intelligence. In 1992, Gerry Tesauro's backgammon game project TD Gammon achieved remarkable success, which also brought more attention to the field.

Since the publication of the first edition of this book, a thriving branch of neuroscience has developed, focusing on the relationship between reinforcement learning algorithms and reinforcement learning in the nervous system. As many researchers have pointed out, there is a striking similarity between the behavior of the time difference algorithm and the activity of dopamine-producing neurons in the brain (Friston et al., 1994; Barto, 1995a; Houk, Adams and Barto, 1995; Montague, Dayan and Sejnowski, 1996; Schultz, Dayan and Montague, 1997). Chapter 15 introduces this exciting aspect of reinforcement learning. In the recent history of reinforcement learning, there are too many other important things to mention in this short narrative; we quote many of them at the end of the chapters in which they are produced.

Bibliographical Remarks

​ For other general reports on reinforcement learning, we refer readers to books by Szepesv'ari (2010), Bertsekas and Tsitiklis (1996), Kaelbling (1993a) and Sugiyama, Hachiya and Morimura (2013). Books that adopt a control or operations research perspective include Si, Barto, Powell and Wunsch (2004), Powell (2011), Lewis and Liu (2012), and Bertsekas (2012). Cao's (2009) review puts reinforcement learning in the context of other methods of stochastic dynamic system learning and optimization. Three special issues of "Machine Learning" magazine focus on reinforcement learning: Sutton (1992a), Kaelbling (1996) and Singh (2002). Useful surveys are provided by Barto (1995b), Kaelbling, Littman and Moore (1996), and Keerthi and Ravindran (1997). The book edited by Weiring and van Otterlo (2012) provides an excellent overview of recent developments.

1.2 The example of Phil’s breakfast in this chapter is inspired by Agre (1988).

1.5 In Chapter 6, the time difference method used in Tic Tac Toe is developed.

Part I: Tabular Solution Methods

​ In this part of the book, we describe almost all the core ideas of reinforcement learning algorithms. The simplest form of these algorithms is: the state space and action space are small enough to represent the approximate value function as an array or table. In this case, these methods can often find accurate solutions, that is, they can usually find the optimal value function and optimal strategy accurately. This is in contrast to the approximation method described in the next part of the book, which can only find approximate solutions, but in turn can be effectively applied to larger problems.

The first chapter of this part of the book describes the solution to the special case of the reinforcement learning problem, in which there is only a single state, called the bandit problem. Chapter 2 describes the general problem formula we deal with in the rest of this book—the finite Markov decision process—and its main ideas, including Bellman's equation and value function.

​ The next three chapters describe three basic methods for solving finite Markov decision problems: dynamic programming, Monte Carlo methods, and time difference learning. Each type of method has its advantages and disadvantages. The dynamic programming method has been well developed in mathematics, but it needs a complete and accurate environmental model. The Monte Carlo method does not require a model, and the concept is simple, but it is not suitable for incremental calculation. Finally, the time difference method does not require a model and is completely incremental, but it is more complicated to analyze. These methods also differ in efficiency and convergence speed.

​ The remaining two chapters describe how to combine these three types of methods to obtain their best characteristics. In one chapter, we described how to combine the advantages of the Monte Carlo method with the advantages of the ad hoc method through a multi-step guided method. In the last chapter of this part of the book, we will show how to combine time difference learning methods with model learning and planning methods (such as dynamic programming) to solve table reinforcement learning problems completely and uniformly.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/107381096