Chapter 2 Reinforcement Learning and Deep Reinforcement Learning

Preface

This chapter provides a concise explanation of the basic terms and concepts of reinforcement learning. It will give you a good understanding of the basic reinforcement learning framework for developing artificial intelligence agents. This chapter will also introduce deep reinforcement learning and provide you with some types of advanced problems that algorithms can help you solve. In this chapter, you will find that mathematical expressions and equations are used in many places. Although there are enough theoretical reinforcement learning and deep reinforcement learning behind to fill the entire book, the key concepts that are actually implemented are useful as discussed in this chapter, so when we implement algorithms in Python to train our agents, you can Clearly understand the logic behind them. If you can't master it all in the first exam, it doesn't matter. When you need a better understanding, you can go back to this chapter and modify it.
In this chapter we will discuss the following topics:

  • What is reinforcement learning
  • Markov decision process
  • Reinforcement learning framework
  • What is deep reinforcement learning
  • How deep reinforcement learning agents work in practice

What is reinforcement learning

If you are new to artificial intelligence (AI) or machine learning, you might be wondering what reinforcement learning is all about. Simply put, it is learning through reinforcement. Reinforcement, as you know from ordinary English or psychology, is to increase or reinforce the behavior of choosing to take a certain action in response to something, because taking that action will get a higher return. We humans are good at learning through reinforcement at a very young age. Those with children may use this fact more to teach them to develop good habits. However, we will all be able to relate to this because we have all gone through that stage of life not long ago! For example, if a child completes homework on time after school every day, parents will reward the child with chocolate. The child learns that if he/she completes homework every day, he/she will get chocolates (reward). Therefore, this strengthened their determination to complete their homework every day to receive chocolate. This kind of learning reinforces the process of choosing a specific action, and its motivation is the reward that will be obtained by taking this action. This learning process is called reinforcement learning or reinforcement learning.

You might think, "Oh, yes. This kind of human psychology is familiar to me. But what does it have to do with machine learning or artificial intelligence?" Good idea. The concept of reinforcement learning is actually inspired by behavioral psychology. It is the link between several research fields, the most important being computer science, mathematics, neuroscience and psychology, as shown in the figure below:
Insert picture description here
We will soon realize that reinforcement learning is one of the most promising methods in machine learning , It will lead the development of artificial intelligence. If these terms are new to you, don't worry! From the next paragraph, we will review these terms to understand the relationship between them and make you feel comfortable. If you already know these terms, then reading this book from a different perspective will refresh you.

Understand the meaning and content of AI in an intuitive way

The intelligence displayed by humans and animals is called natural intelligence, and the intelligence displayed by machines is called artificial intelligence, for obvious reasons. We humans develop algorithms and technologies to provide intelligence to machines. Some of the biggest developments in this area are in the areas of machine learning, artificial neural networks, and deep learning. These fields have jointly promoted the development of artificial intelligence. So far, there are three main machine learning paradigms that have developed to a certain degree of maturity. They are:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

In the chart below, you can intuitively understand the AI ​​field. As you can see, these learning paradigms are subsets of the field of machine learning, and machine learning itself is also a subset/branch of AI:
Insert picture description here

Supervised learning

Supervised learning is similar to how we teach children to identify someone or something by name. We provide an input and a name/class label (label for short) associated with the input, and expect the machine to learn the mapping of the input to the label. This sounds simple, if we just want machine learning input-to-label to map several objects (such as object recognition-task type) or people (such as face/sound/person recognition tasks), but if we want one Machine learning thousands of categories, each category may have several different input changes? For example, if the task is to recognize a person's face from the input image, it also recognizes the face from 1000 other input images , Even for an adult, this task can be complicated. For the same person's face, the input image may have several changes. This person may wear glasses in one input image, or a hat in another input image, or show a completely different facial expression. It is a much more difficult task for the machine to see the input image and recognize and recognize the face. With the latest developments in the field of deep learning, supervised classification tasks like this are no longer difficult for machines. The machine can recognize faces and other things, and its accuracy has reached an unprecedented level. For example, the DeepFace system (https://research.fb.com/wp-content/uplDimensions/2016/11/deepface-closing-the-gap-to-human-level-performance-in-face-verification.pdf ), an artificial intelligence research laboratory developed by Facebook, achieving 97.45% accuracy on face recognition tags in the field dataset.

Unsupervised learning

Unlike the supervised learning paradigm, unsupervised learning is a form of learning that does not provide labels for learning algorithms and inputs. This type of learning algorithm is usually used to find patterns in input data and cluster similar data together. The latest developments in the field of deep learning have introduced a new form of learning called Generative Adversarial Networks, which was very popular during the writing of this book. If you are interested, you can learn more about generative adversarial networks from this video: https://www.packtpub.com/big-data-and-business-intelligence/learning-generating-adversarial-network-video.

Reinforcement learning

Compared with supervised learning and unsupervised learning, reinforcement learning is a hybrid learning method. As we learned at the beginning of this section, reinforcement learning is driven by reward signals. In the example of a child with homework problems, the reward signal is chocolate from the parent. In the world of machine learning, chocolate may not attract computers (well, we can program computers to want chocolate, but why should we do this? Isn’t it enough for kids?!), but only scalar values ​​(numbers ) That's it! The reward signal is still human-specific to some extent, indicating the expected goal of the task. For example, using reinforcement learning to train an agent to play Atari games, the scores in the game can be used as reward signals. This makes reinforcement learning easier (for humans rather than machines!), because we don't need to mark buttons to be pressed at every point in the game to teach the machine how to play the game. Instead, we just let the machines learn by themselves to maximize their scores. We can make a machine learn how to play games, how to control a car, or how to do homework on its own, and all we have to do is tell it how to score. Doesn’t it sound interesting? That’s why we learn it in this chapter. In the next chapters, you will develop some cool machines yourself.

Reinforcement learning practice

Now that you have an intuitive understanding of the true meaning of artificial intelligence and the various algorithms that drive its development, we will now focus on the practical aspects of building a reinforcement learning machine.
The following are the core concepts you need to pay attention to when developing a reinforcement learning system:

  • Agent
  • Rewards
  • Environment
  • State
  • Value function
  • Policy

Agent

In the reinforcement learning world, machines are run or instructed by (software) agents. The agent is part of the machine, it has the intelligence and decides what to do next. When we go deep into reinforcement learning, you will encounter the term "Agent" many times. Reinforcement learning is based on the reward hypothesis, which believes that any goal can be described by maximizing the expected cumulative reward. So, what exactly is this reward? This is the question we will discuss next.

Rewards

The reward (expressed as) is usually a scalar, which is provided to the agent as feedback to promote its learning. The agent’s goal is to maximize total rewards, and this signal indicates the agent’s performance at the time step. The following examples of reward signals for different tasks may help you understand more intuitively:

  • For the Atari game we discussed before, or any ordinary computer game, the reward signal for each increase in score is +1, and the reward signal for each decrease in score is -1.
  • For stock trading, the reward signal can be +1 for every dollar of gain and -1 for every dollar of loss.
  • When driving a car in the simulation, the reward signal is +1 for every mile and the reward signal is -100 for every collision.
  • Sometimes, the reward signal may be sparse. For example, in chess or Go games, if the agent wins the game, the reward signal can be +1, and if the agent loses the game, the reward signal can be -1. Reward is scarce, because the agent only receives the reward signal after completing a complete game, and does not know how well each move he has made.

Environment

In the first chapter, we studied the different environments provided by the OpenAI Gym toolkit. You may be wondering why they are called environments rather than problems, tasks, or whatever. Now that you have read this chapter, any impressions?

The environment is the platform that represents the problem or task we are interested in, and the platform that the agent interacts with. The following figure shows the general reinforcement learning paradigm
Insert picture description here
at the highest level of abstraction: at each time step (indicated by), the agent receives an observation from the environment, and then performs an operation, for which it will receive a scalar reward from the environment and The next observation, and then repeat this process, until reaching a terminal state. What is observation and what is state? Let's study it next.

State

When the agent interacts with the environment, the process generates a series of observations, actions, and rewards, as described earlier. In a certain time step, what agents know so far is the sequence of observations, actions, and rewards, and it has been observing the time step. Intuitively, it makes sense to call this history:
Insert picture description here
what happens in the future depends on history. Formally speaking, the information used to determine what happens next is called status. Because it depends on the history to that time step, it can be expressed as:
Insert picture description here
Here, f represents a function.

Before we continue, there is a very subtle message for you to understand. Let's take another look at the general performance of the reinforcement learning system:
Insert picture description here
now, you will notice that the two main entities in the system (agent and environment) have their own state representations. The state of the environment (sometimes used to represent it) is the performance of the environment itself (private), and the environment uses it to choose the next observation and reward. This state is usually invisible/available to the agent. Similarly, an agent also has an internal representation (sometimes used as a representation) of its own state, which is information used by the agent based on its operation. Because the representation is the internal representation of the agent, any function is used by the agent to represent it. Usually, it is based on a certain function of the history that the agent has observed so far. In addition, the Markov state is a representation of the state using all useful information in the history. By definition, using Markov attributes, a state is Markov or Markov, which means that given the present, the future is independent of the past. In other words, this state is sufficient statistics for the future. Once the status is known, the history can be discarded. In general, the state of the environment, and history, satisfy the Markov property.

In some cases, the environment may make its internal environment state directly visible to the agent. Such an environment is called a fully observable environment. In the case that the agent cannot directly observe the state of the environment, the agent must construct its own state representation according to the state it observes. This environment is called a partially observable environment. For example, an agent playing poker can only observe the community cards, but not the cards owned by other players. Therefore, it is a partially observed environment. Similarly, a self-driving car with only one camera does not know its absolute position in the environment, which makes the environment only partially observed.

Model

The model is an agent's representation of the environment. This is similar to our mental model of people and things around us. The agent uses its environmental model to predict what will happen next. It has two key parts:

  • State transition model/probability
  • Feedback mode

The state transition model is a probability distribution or function that predicts the probability of ending the state at the next time step given a state and time step action. Mathematically, it can be expressed as: the
Insert picture description here
agent uses the reward model to predict the immediate next reward it will get if it takes action in the state of the time step. This expectation for the next reward can be expressed mathematically as:
Insert picture description here

Value function

The value function represents the agent's prediction of future rewards. There are two types of value functions: state value functions and behavior value functions.

State-value function

The state value function is a function that represents how good the agent is in a state at time step t. It is also commonly referred to as a value function. Represents the prediction of the agent's future returns when the time step t ends. Mathematically, it can be expressed as:
Insert picture description here
This expression means that the state value under the policy is the expected sum of future rewards after discounting, where γ is the discount factor, which is a real number, and the value range is [0,1]. In fact, the discount coefficient is usually set in the range of [0.95,0.99]. Another new term is that this is the agent’s policy.

Action-value function

The behavior-value function is a function that indicates how well the agent estimates how well the agent takes action in the state. Expressed as. The relationship between it and the state value function is as follows:
Insert picture description here

Policy

The policy indicated stipulates the actions to be taken by the state. It can be seen as a function that maps state to operation. There are two main types of policies: deterministic policies and random policies.
The deterministic strategy specifies one behavior in a given state, that is, only one behavior.
Insert picture description here

The stochastic strategy specifies the distribution of behaviors under a given time step, that is, there are multiple behaviors, and each behavior has a probability value.
Insert picture description here

Agents that follow different strategies may exhibit different behaviors in the same environment.

Deep reinforcement learning

Deep reinforcement learning is naturally produced when people make progress in the field of deep learning and apply it to reinforcement learning. We have learned state value function, action value function and policy. Let us briefly look at how to represent them mathematically or implement them through computer code. The state-value function is a real-valued function. It takes the current state as input and outputs a real-valued number (such as 4.57). This number is the agent's prediction of its state, and the agent will continuously update its value function based on the new experience it has gained. Similarly, the action value function is also a real value function, which takes action as an input outside of the state and outputs a real number. One way to express these functions is to use neural networks, because neural networks are general function approximators, which can represent complex nonlinear functions. For agents who try to play Atari games by just looking at the image on the screen (like us), the state can be the pixel value of the image on the screen. In this case, we can use a deep neural network with convolutional layers to extract visual features from the state/image, and then use several fully connected layers to finally output or, depending on which function we want to approximate.

If we do, then we are doing deep reinforcement learning! Is it easy to understand? I hope so. Let's look at some other methods we can use deep learning in reinforcement learning.

Recall that a policy is expressed as a deterministic policy. Under random conditions, the action can be discrete (such as "move left," "move," or "move forward") or continuous values ​​(such as acceleration "0.05", "0.67" as a guide, etc.), they can be single or multi-dimensional. Therefore, the policy can sometimes be a complex function! It may have to accept a multi-dimensional state (such as an image) as input and a multi-dimensional probability vector as an output (in the case of a random policy). So, this looks like a monster function, doesn't it? It really is. This is the life-saving of deep neural networks! We can use deep neural networks to approximate the strategy of an agent, and then directly learn and update the strategy (by updating the parameters of the deep neural network). This is called deep reinforcement learning based on strategy optimization, and it has been proven to be very effective in solving several challenging control problems, especially in the field of robotics.

In summary, deep reinforcement learning is the application of deep learning to reinforcement learning. So far, researchers have successfully applied deep learning to reinforcement learning in two ways. One method is to use a deep neural network to approximate the value function, and the other method is to use a deep neural network to represent the strategy.

As early as 2005, when researchers tried to use neural networks as value function approximators, these ideas were already known. But it has only recently become popular, because although neural networks or other nonlinear value function approximators can better represent the complex values ​​of environmental states and behaviors, they are prone to instability and often lead to suboptimal functions. It’s only recently that researchers like Volodymyr Mnih and his colleagues at DeepMind (now part of Google) have discovered techniques for stabilizing learning and using deep, non-linear function approximators to train agents, which converge. To approach the optimal value function. In the later chapters of this book, we will actually copy some of their pioneering results at the time, which surpassed the human ability to play Atari games.

Application of reinforcement learning and deep reinforcement learning

Until recently, the practical application of reinforcement learning and deep reinforcement learning has been limited due to the complexity and instability of samples. However, these algorithms have proven to be very powerful in solving some very difficult practical problems. Some of them are listed here to give you some idea:

  • Learn to play video games better than humans : The news may have reached you. Researchers from DeepMind and other companies have developed a series of algorithms, starting with DeepMind's Deep-Q Network (DQN for short), which has reached human level when playing Atari games. We will implement this algorithm in the later chapters of this book! In essence, it is a deep variant of the Q-learning algorithm briefly introduced in this chapter with some changes to improve the speed and stability of learning. After a few games, its game score can reach human level. What is even more impressive is that the same algorithm achieves this level of game without any specific game adjustments or changes.
  • Master Go : Go is a Chinese game that has challenged artificial intelligence for decades. It is played on a full-size 19×19 board, which is more complicated than chess because there are many possible positions on the board. Until recently, no artificial intelligence algorithm or software could reach human level in this game. AlphaGo is an artificial intelligence program from DeepMind. It uses deep reinforcement learning and Monte Carlo tree search technology to change all of this and defeated the human world champions Li Shishi (4-1) and Fan Hui (5-0). DeepMind has released a more advanced version of its artificial intelligence agent called AlphaGO Zero (it uses zero human knowledge and learned to play chess on its own!) and AlphaZero (it can play Go, Chess and Shogi!), all These all use deep reinforcement learning as the core algorithm.
  • **Help AI win the edge of danger! **Currently, IBM's Watson is an artificial intelligence system developed by IBM. It became famous for defeating humans in the "Jeopardy" program. Use the extension of TD learning to create its daily double betting strategy to help it defeat the human champion.
  • Robot motion and operation : Both reinforcement learning and deep reinforcement learning make it possible to control the motion and navigation of complex robots. Several recent studies by researchers at the University of California, Berkeley have shown how they can use deep reinforcement, training strategies to provide vision and control for robotic manipulating tasks, and generate connected drives to make complex bipeds walk and run.

to sum up

In this chapter, we discussed how the agent interacts with the environment, taking actions based on the observations it receives from the environment, and how the environment responds to the agent's actions with (optional) rewards and the next observation.

Through a concise understanding of the basics of reinforcement learning, we have further understood what deep reinforcement learning is and revealed the fact that we can use deep neural networks to represent value functions and strategies. Although this chapter has a lot of symbols and definitions, I hope it will lay a solid foundation for us to develop some cool agents in the following chapters. In the next chapter, we will consolidate the learning of the first two chapters and lay the foundation to train the agent to solve some interesting problems.

Guess you like

Origin blog.csdn.net/weixin_42990464/article/details/112134547