【Learning】Deep Reinforcement Learning


1. Deep Reinforcement Learning (RL)

What is RL? (three steps)

Labeling data is challenging in some tasks. RL can be used on unlabeled data.
There are two things in RL: actors and environments. The environment will use the observation as input, the actor will output the action after obtaining the observation, and the environment will continuously give reward as feedback to judge whether the actor's action is good or not.
insert image description here
Find a policy that maximizes total reward
Endgame: All aliens are killed, or your spaceship is destroyed.
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
The input of the neural network: the observed value of the machine represented by a vector or matrix Output
neural network: each action corresponds to a neuron in the output layer.
insert image description here
insert image description here
insert image description here
The reward depends not only on a but also on s.
The third step is to find a set of parameters, the bigger the R, the better.
Because a is sampled, the result is random, so the result is not necessarily the same every time.
The environment and reward are black boxes, and the results and processes are unknown.
The environment is also random, and general methods cannot be used. How to optimize here is the main challenge of RL.
insert image description here
The randomness of RL is very large, and the results of the test are very different.

calculate loss

Control an actor: make it take (or not take) a particular action - given a particular observation.
insert image description here
insert image description here
This is the same as training a classifier with supervised learning!
insert image description here

Calculate A

Change the binary classification problem, add weights: the difficulty is how to get these pairs and A
insert image description here

version 0

insert image description here
Not a good version, just look at the near but not the far. An action affects subsequent observations, which in turn affect subsequent rewards. a1 may affect r2.
reward delay: the actor has to sacrifice immediate rewards to get more long-term rewards.
In Space Invaders, only "firing" yields positive rewards, so vision 0 will learn an actor that always "fires".

version 1

How good a1 is is determined by the r that follows.
insert image description here
Question: If the game is long, is the r far later due to actions far earlier?

version 2

join gamma, influence reduction
insert image description here

version 3

Scores for earlier actions will accumulate more.
Is G to be standardized? reward is relative!
Good or bad rewards are "relative", if all rn ≥ 10, Tn = 10 is negative... Subtracting the baseline b makes G' have positive and negative values.
insert image description here

2. Gradient method

It can be seen that collecting data {s, a} requires training many times.
insert image description here
insert image description here
Every time the model parameters are updated, the entire training set needs to be collected again. The parameters are only updated once in a loop.
insert image description here
In fact, the information we obtained is only suitable for the current parameters, not necessarily for the following parameters!
insert image description here
insert image description here
off-policy: This way, we don't have to collect data after every update.

insert image description here

Gathering Training Data: Explore

Actors need to have randomness in the data collection process. The main reason we sample actions.
Amplify output entropy
Add noise on parameters
Suppose your actor always goes "left", we never know what will happen if "fire".
insert image description here

3、actor critic

critic: Given an actor, how good is it to observe s (and take action a)
Value function Vθ(s): When using actor e, expect to receive a discounted cumulative reward discounted after seeing s
insert image description here

MC

insert image description here
Make predictions after playing a full game

TD

Update parameters after playing for a while
insert image description here
insert image description here

version 3.5

insert image description here
insert image description here
The above is randomly sampled according to a distribution.
The following is obtained after executing an at
insert image description here
insert image description here
insert image description here

Supongo que te gusta

Origin blog.csdn.net/Raphael9900/article/details/128531719
Recomendado
Clasificación