【Learning】Deep reinforcement learning, model compression


一、deep reinforcement learning

Reinforcement Learning Scenario
insert image description here
insert image description here
Between Supervised Learning and Reinforcement Learning:
insert image description here
Training a Chatbot - Reinforcement Learning: Letting two agents talk to each other (sometimes producing good dialogues, sometimes producing bad ones)
With this approach we can generate a lot of dialogues. Use some pre-defined rules to evaluate how good a conversation is

insert image description here
Interactive Retrieval
insert image description here
insert image description here
insert image description here
Reward Delay: In Space Invaders, only "fire" is rewarded, although movement before "fire" is important. In the game of Go, it may be better to sacrifice immediate rewards for more long-term rewards. The behavior of the agent affects the subsequent data it receives.

Deep reinforcement learning can be divided into: Policy-based and Value-based.
insert image description here
insert image description here

Policy-based Approach——Learning an Actor

insert image description here
insert image description here

Neural Networks as Actors

The input of the neural network: the observed value of the machine represented by a vector or matrix. The
output neural network: each action corresponds to a neuron in the output layer.
insert image description here
In fact, the action is random and not fixed. It doesn't mean that the big number means this action.
What are the benefits of using a network instead of a lookup table? generalization! Things you haven't seen get good results too.

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
It should be noted that this is to multiply all the rewards of this τ, not to multiply an r. Because we want actions to be the same as many results.
insert image description here
Dividing by a p is better, and it is standardized. I don't want to be more biased and have a low score.
insert image description here
The probability of unsampled actions will be reduced, so how to solve it? Just subtract a bias to make the probability positive and negative.
insert image description here
The critic is a function dependent on the actor being rated, which is represented by a neural network. State-value function Vπ(s): When using actor π, the expectation is to obtain a cumulative reward after seeing the observation (state) s.
insert image description here
insert image description here

small model

model compression
insert image description here

Networks can be pruned

Networks are often overparameterized (with lots of redundant weights or neurons), we can prune them!
insert image description here
Importance of weights: absolute value, lifetime...
Importance of a neuron: number of times it is non-zero on a given dataset...
After pruning, accuracy drops (hopefully not too much)
for recover Fine-tune the training data, don't prune too much at once, or the network won't recover.
insert image description here
insert image description here
In fact, irregular networks are not easy to use GPU to accelerate, and it is not necessarily accelerated.
insert image description here
After removing neurons, the network is regular, so it can be accelerated.
insert image description here
How about simply training a smaller network? It is well known that smaller networks are harder to learn successfully. . Are larger networks easier to optimize?
insert image description here
As long as the sub-network is trained well, the large one can be trained well.
insert image description here
The small network cannot be trained directly, and it can be trained after pruning the large network.
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/Raphael9900/article/details/128564915