Reinforcement Learning

What is reinforcement learning

In supervised learning, the obtained output value needs to be compared with the given standard answer, and then the network parameters are updated with forward feedback. In reinforcement learning, through continuous interaction with the environment, the feedback given by the environment is obtained, and it is continuously updated and optimized. The environment does not give a standard answer, but only gives a score for each output, allowing the computer to continuously explore the rules and gradually find a way to get a high score.

mdp

In fact, the mathematical model problem solved by reinforcement learning is the
three basic elements of MDP (Markov Decision Process): system action/state/reward.
Through training, the machine can observe the current state from the environment every time it changes, and give it based on the observation. The corresponding action changes the state, and the environment rewards the machine.

For example in the classic Tetris game:
Status: stacking of dropped blocks
System Action: What shape does the falling block take and where to drop it
Reward: Complete block stacking and clearing one/multiple lines gives points

two key modules

Value function: Performing an action in a specific state will bring long-term benefits
Decision : Decide what action to perform based on the value function

The goal of RL is to make its learned decisions lead to the optimal total reward gain in the long run

Classification

Understanding the environment (model-based): Through the understanding of the environment, a virtual environment is simulated based on experience. Anticipate all scenarios by imagining what will happen next, choose the best and take the next step.
Do not understand the environment (model-free): do not care about the structure of the real environment, etc., only focus on the score, not why. While testing, wait for feedback from the real environment to take the next action.

Probability-based (Policy-Based RL): Use probability to measure the possibility of taking various actions next time, so actions with high probability are not necessarily selected, each action has the possibility of being selected, but the probability is different; For actions represented by continuous values, use probability distribution to
select Value-Based RL: Use scores to measure the probability of taking various actions next time, the highest value is selected, and the decision is more determined; but value-based methods cannot Action
Actor-Critic for continuous-valued representations : Actors will choose actions based on probability, and critics will give values to the actions made, thus speeding up the learning process on the original policy gradients.

Monte-Carlo update: For example, when a full game round of the game is completed (from the beginning to the end of the game), the experience is summarized based on the results, and the code of
conduct is updated Temporal-Difference update: "As you play, Learning", learning and updating the impact of each step in the game (single-step updates are more efficient, so most current RL is based on single-step updates)

Online learning (On-Policy): You must "play and learn by yourself"
Off-Policy: You can watch "others play" to gain experience, you can "play during the day and learn at night" and save the feedback obtained during the day, and at night through memory Focus on learning and updating