Hongyi LiDeep Learning
1. Deep Reinforcement Learning (RL)
What is RL? (three steps)
Labeling data is challenging in some tasks. RL can be used on unlabeled data.
There are two things in RL: actors and environments. The environment will use the observation as input, the actor will output the action after obtaining the observation, and the environment will continuously give reward as feedback to judge whether the actor's action is good or not.
Find a policy that maximizes total reward
Endgame: All aliens are killed, or your spaceship is destroyed.
The input of the neural network: the observed value of the machine represented by a vector or matrix Output
neural network: each action corresponds to a neuron in the output layer.
The reward depends not only on a but also on s.
The third step is to find a set of parameters, the bigger the R, the better.
Because a is sampled, the result is random, so the result is not necessarily the same every time.
The environment and reward are black boxes, and the results and processes are unknown.
The environment is also random, and general methods cannot be used. How to optimize here is the main challenge of RL.
The randomness of RL is very large, and the results of the test are very different.
calculate loss
Control an actor: make it take (or not take) a particular action - given a particular observation.
This is the same as training a classifier with supervised learning!
Calculate A
Change the binary classification problem, add weights: the difficulty is how to get these pairs and A
version 0
Not a good version, just look at the near but not the far. An action affects subsequent observations, which in turn affect subsequent rewards. a1 may affect r2.
reward delay: the actor has to sacrifice immediate rewards to get more long-term rewards.
In Space Invaders, only "firing" yields positive rewards, so vision 0 will learn an actor that always "fires".
version 1
How good a1 is is determined by the r that follows.
Question: If the game is long, is the r far later due to actions far earlier?
version 2
join gamma, influence reduction
version 3
Scores for earlier actions will accumulate more.
Is G to be standardized? reward is relative!
Good or bad rewards are "relative", if all rn ≥ 10, Tn = 10 is negative... Subtracting the baseline b makes G' have positive and negative values.
2. Gradient method
It can be seen that collecting data {s, a} requires training many times.
Every time the model parameters are updated, the entire training set needs to be collected again. The parameters are only updated once in a loop.
In fact, the information we obtained is only suitable for the current parameters, not necessarily for the following parameters!
off-policy: This way, we don't have to collect data after every update.
Gathering Training Data: Explore
Actors need to have randomness in the data collection process. The main reason we sample actions.
Amplify output entropy
Add noise on parameters
Suppose your actor always goes "left", we never know what will happen if "fire".
3、actor critic
critic: Given an actor, how good is it to observe s (and take action a)
Value function Vθ(s): When using actor e, expect to receive a discounted cumulative reward discounted after seeing s
MC
Make predictions after playing a full game
TD
Update parameters after playing for a while
version 3.5
The above is randomly sampled according to a distribution.
The following is obtained after executing an at