Reinforcement Learning - Initial Understanding

foreword

        The concept of reinforcement learning was known to the public when Alpha Go defeated Ke Jie, who was ranked No. 1 in the world at the time, in 2017. Later, as reinforcement learning was applied in major games such as King of Glory, it became more and more familiar to people. The Glory of Kings AI team even published a paper on the application of reinforcement learning in Glory of Kings in the top journal AAAI.

What is Reinforcement Learning

        Reinforcement Learning (RL), also known as Reinforcement Learning, Evaluation Learning or Enhanced Learning, is one of the paradigms and methodologies of machine learning , which is used to describe and solve the problem of learning through the process of interaction between agents and the environment. Strategies to achieve the problem of maximizing returns or achieving specific goals.

        Reinforcement learning : Act based on the feedback of the environment, through continuous interaction with the environment, trial and error, and finally achieve a specific goal or maximize the overall action benefits. Reinforcement learning does not require the label of the training data, but it needs the feedback given by the action environment of each step, whether it is incentive or punishment, the feedback can be quantified, and the behavior of the training object is constantly adjusted based on the feedback.

The above are two different definitions, one from Baidu and one from a certain blog.

        For example, AlphaGo plays Go , and AlphaGo is the training object of reinforcement learning. There is no right or wrong in each step of AlphaGo, but there are "good and bad". On the current board, if you play "good", this is a good move. Playing "bad", this is a bad move. The training basis of reinforcement learning is that AlphaGo can give clear feedback on every action environment. Is it "good" or "bad"? How much is "good" and "bad" can be quantified. The ultimate training goal of reinforcement learning in the AlphaGo scene is to let the chess pieces occupy more areas on the board and win the final victory.

        To use an inappropriate analogy, it is a bit like training monkeys in a circus. The animal trainer beats the gong and trains the monkey to stand and salute. The monkey is our training object. If the monkey completes the action of standing and saluting, it will get a certain amount of food reward. If it is not completed or the completion is wrong, there will be no food reward or even a whip. After a long time, whenever the animal trainer beats the gong, the monkey will naturally know to stand and salute, because this action is the action that gains the most benefits in the current environment, and other actions will not have food, and even be whipped.

Below is an example. Examples include Agent, environment, action, observation(State), revward.

For example, when the agent observes a glass of water, the action taken by the agent is to overturn the water, and then the reward given to the agent by the environment is not to do that (Don't do that).

 Because the actions of reinforcement learning are continuous, the Agent observes that the water is overturned, and the measure to be taken is to clean it, and the reward given to it by the environment is Thank you.

The goal of the Agent is to learn the rewards of actions taken in the past. Make the maximum reward.

Another example of Alpha go

At the beginning, what AlphaGo observed was a chessboard. It needs to take an action, which is to place a piece. The environment here is your opponent, and the piece you place will affect your opponent's reaction.

 Immediately afterwards your opponent plays a white piece.

 After the machine sees this Observation, it takes an action and drops another sunspot.

 In most cases, your reward is 0, because in most cases, nothing happens after you make a move. The reward is 1 only if you win, and -1 if you lose.

Difference Between Supervised Learning and Reinforcement Learning

 

Inspiration for Reinforcement Learning

Reinforcement learning is inspired by behaviorist theory in psychology :

  • All learning is the process of establishing a direct link between stimulus and response through conditioning .

  • Reinforcement plays an important role in the establishment process of stimulus-response. In the stimulus-response connection, what the individual learns is a habit, and the habit is the result of repeated practice and reinforcement .

  • Once the habit is formed , the acquired habitual response will appear automatically as long as the original or similar stimulus situation appears.

Based on the above theory, reinforcement learning is how to train the object to gradually form the expectation of the stimulus under the stimulation of the reward or punishment given by the environment, and produce the habitual behavior that can obtain the maximum benefit.

Key Features of Reinforcement Learning

  • Trial-and-error learning:  Reinforcement learning requires the training object to continuously interact with the environment, and summarize the best behavior decisions for each step through trial and error. There is no guidance in the whole process, only cold feedback. All learning is based on environmental feedback, training subjects to adjust their behavioral decisions.

  • Delayed feedback:  During the reinforcement learning training process, the "trial and error" behavior of the training object obtains feedback from the environment. Sometimes it may be necessary to wait until the entire training is over to get a feedback, such as Game Over or Win. Of course, in this case, we usually disassemble it during training, and try to break down the feedback into each step.

  • Time is an important factor in reinforcement learning : a series of environmental state changes and environmental feedback in reinforcement learning are strongly linked to time. The entire reinforcement learning training process is a change over time, and the state & feedback are also constantly variable, so time is an important factor in reinforcement learning.

  • The current behavior affects the subsequent received data : why this feature is proposed separately is also to distinguish it from supervised learning & semi-supervised learning. In supervised learning & semi-supervised learning, each piece of training data is independent and has no relationship with each other. But this is not the case in reinforcement learning. The current state and the actions taken will affect the state received in the next step. There is a certain correlation between data and data.

Difficulties in Practical Development of Reinforcement Learning

When we actually apply reinforcement learning to training, we often encounter various problems. Although reinforcement learning is very powerful, sometimes many problems are difficult to solve.

Reward setting:  How to set the Reward function and how to quantify the environmental feedback is a very difficult problem. For example, in AlphaGo, how to measure the "good" and "bad" of each chess move and finally quantify it is a very difficult problem. The Reward function in some scenarios is difficult to set.

Sampling training takes too long, and it is difficult to apply in the actual industry:  reinforcement learning needs to explore every Action in every State as much as possible, and then learn. In practical applications, this is a very large number in some scenarios, and the computing power overhead is very large for the training time. In many cases, the same effect can be obtained by using other algorithms, but the training time is long and the computing power cost is saved a lot. The upper limit of reinforcement learning is very high, but if the training is not in place, the lower limit is very low in many cases.

Easy to fall into local optimum:  In some scenarios, the action taken by the Agent may be the current local optimum rather than the global optimum. There are often screenshots on the Internet showing that when playing games, you encounter the King of Glory AI. It is obvious that pushing towers or crystals is the most reasonable behavior at this time, but the AI ​​is going to fight minions, because the AI ​​adopts a locally optimal behavior. No matter how reasonable the Reward function setting is, it may fall into local optimum.

other

What are model-based algorithms, model-free algorithms, and online learning?

Having a model algorithm is to stand in the perspective of God and open the eyes of the sky. I don't even need to try it myself and just rely on deduction to know how to go.

Model-free algorithm: There is no God's perspective, I am in this maze, and I can only know the specific situation by trying it myself.

Online learning: I am constantly learning in this maze, taking one step at a time.

Guess you like

Origin blog.csdn.net/keeppractice/article/details/130791920