Article directory
Highway-env Intersection
This article will continue to explore the implementation of related DQN algorithms in rl-agents. The following introduction will take the environment intersection
as an example. First, we will introduce intersection-v1
in Highway-env. Related documents in Highway-env——http://highway-env.farama.org/environments/intersection/.
The environment in highway-env can be modified through the configuration file. Information such as observations, actions, dynamics and rewards are all stored in the configuration file in the form of dictionaries.
PS: For the principles of DQN and DuelingDQN algorithms, please refer to[Reinforcement Learning] 10 - DQN Algorithm[Reinforcement Learning] 11 - Double DQN Algorithm and Dueling DQN algorithm
import gymnasium as gym
import pprint
from matplotlib import pyplot as plt
env = gym.make("intersection-v1", render_mode='rgb_array')
pprint.pprint(env.unwrapped.config)
Output config, you can see the following information:
{
'action': {
'dynamical': True,
'lateral': True,
'longitudinal': True,
'steering_range': [-1.0471975511965976, 1.0471975511965976],
'type': 'ContinuousAction'},
'arrived_reward': 1,
'centering_position': [0.5, 0.6],
'collision_reward': -5,
'controlled_vehicles': 1,
'destination': 'o1',
'duration': 13,
'high_speed_reward': 1,
'initial_vehicle_count': 10,
'manual_control': False,
'normalize_reward': False,
'observation': {
'features': ['presence',
'x',
'y',
'vx',
'vy',
'long_off',
'lat_off',
'ang_off'],
'type': 'Kinematics',
'vehicles_count': 5},
'offroad_terminal': False,
'offscreen_rendering': False,
'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
'policy_frequency': 1,
'real_time_rendering': False,
'render_agent': True,
'reward_speed_range': [7.0, 9.0],
'scaling': 7.15,
'screen_height': 600,
'screen_width': 600,
'show_trajectories': False,
'simulation_frequency': 15,
'spawn_probability': 0.6}
The image can then be output via the following code:
plt.imshow(env.render())
plt.show()
Outputobservation
, you can see that it is a 5*8 array:
[[ 1.0000000e+00 9.9999998e-03 1.0000000e+00 0.0000000e+00
-1.2500000e-01 6.3297665e+01 0.0000000e+00 0.0000000e+00]
[ 1.0000000e+00 1.3849856e-01 -1.0000000e+00 -9.9416278e-02
1.2500000e-01 8.1300293e+01 1.0361128e-15 0.0000000e+00]
[ 1.0000000e+00 -2.0000000e-02 -1.0000000e+00 0.0000000e+00
2.2993930e-01 6.5756187e+01 2.8473811e-15 0.0000000e+00]
[ 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]
[ 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]]
observation
The explanation of is as follows,
Through the following code, the type of action can be changed into a discrete space.
env.unwrapped.configure({
"action": {
'longitudinal': True,
"type": "DiscreteMetaAction"
}
})
DQN of rl-agents
A neural-network model is used to estimate the state-action value function and produce a greedy optimal policy.
Implemented variants:
- Double DQN
- Dueling architecture
- N-step targets
References:
Playing Atari with Deep Reinforcement Learning, Mnih V. et al (2013).
Deep Reinforcement Learning with Double Q-learning, van Hasselt H. et al. (2015).
Dueling Network Architectures for Deep Reinforcement Learning, Wang Z. et al. (2015).
Query agent for actions sequence
As we know fromthe previous section, specific agent training is performed by calling the run_episodes
function. The step
function will be called and self.agent.plan(self.observation)
executed. For the implementation of DQNAgent, it is first implemented by the AbstractAgent
classplan
, and then plan函数
will call the act
function :
def step(self):
"""
Plan a sequence of actions according to the agent policy, and step the environment accordingly.
"""
# Query agent for actions sequence
actions = self.agent.plan(self.observation)
// rl_agents/agents/common/abstract.py
class AbstractAgent(Configurable, ABC):
def __init__(self, config=None):
super(AbstractAgent, self).__init__(config)
self.writer = None # Tensorboard writer
self.directory = None # Run directory
@abstractmethod
def act(self, state):
"""
Pick an action
:param state: s, the current state of the agent
:return: a, the action to perform
"""
raise NotImplementedError()
def plan(self, state):
"""
Plan an optimal trajectory from an initial state.
:param state: s, the initial state of the agent
:return: [a0, a1, a2...], a sequence of actions to perform
"""
return [self.act(state)]
DQN abstract classAbstractDQNAgent
inherits from AbstractStochasticAgent
, AbstractStochasticAgent
inherits from AbstractAgent
, Implement the rewriting of the function in the DQN abstract classAbstractDQNAgent
:act
def act(self, state, step_exploration_time=True):
"""
Act according to the state-action value model and an exploration policy
:param state: current state
:param step_exploration_time: step the exploration schedule
:return: an action
"""
self.previous_state = state
if step_exploration_time:
self.exploration_policy.step_time()
# Handle multi-agent observations
# TODO: it would be more efficient to forward a batch of states
if isinstance(state, tuple):
return tuple(self.act(agent_state, step_exploration_time=False) for agent_state in state)
# Single-agent setting
values = self.get_state_action_values(state)
self.exploration_policy.update(values)
return self.exploration_policy.sample()
Explore strategies
First let’s look at the implementation ofexploration_policy
:
self.exploration_policy = exploration_factory(self.config["exploration"], self.env.action_space)
Explore the configuration file section loaded by the policy:
"exploration": {
"method": "EpsilonGreedy",
"tau": 15000,
"temperature": 1.0,
"final_temperature": 0.05
}
Jump toexploration_factory
. You can see that three types of exploration strategies are mainly implemented. The specific content will be introduced in the following part:
- Greedy
- ϵ \epsilonϵ-Greedy
- Boltzmann
def exploration_factory(exploration_config, action_space):
"""
Handles creation of exploration policies
:param exploration_config: configuration dictionary of the policy, must contain a "method" key
:param action_space: the environment action space
:return: a new exploration policy
"""
from rl_agents.agents.common.exploration.boltzmann import Boltzmann
from rl_agents.agents.common.exploration.epsilon_greedy import EpsilonGreedy
from rl_agents.agents.common.exploration.greedy import Greedy
if exploration_config['method'] == 'Greedy':
return Greedy(action_space, exploration_config)
elif exploration_config['method'] == 'EpsilonGreedy':
return EpsilonGreedy(action_space, exploration_config)
elif exploration_config['method'] == 'Boltzmann':
return Boltzmann(action_space, exploration_config)
else:
raise ValueError("Unknown exploration method")
Neural network implementation
Then get Q ( s , a ) Q(s,a) Q(s,a)值
def get_state_action_values(self, state):
"""
:param state: s, an environment state
:return: [Q(a1,s), ..., Q(an,s)] the array of its action-values for each actions
"""
return self.get_batch_state_action_values([state])[0]
abstract method calledget_batch_state_action_values
@abstractmethod
def get_batch_state_action_values(self, states):
"""
Get the state-action values of several states
:param states: [s1; ...; sN] an array of states
:return: values:[[Q11, ..., Q1n]; ...] the array of all action values for each state
"""
raise NotImplementedError
Next, let’s look at the specific implementation in DQNAgent:
class DQNAgent(AbstractDQNAgent):
def __init__(self, env, config=None):
super(DQNAgent, self).__init__(env, config)
size_model_config(self.env, self.config["model"])
self.value_net = model_factory(self.config["model"])
self.target_net = model_factory(self.config["model"])
self.target_net.load_state_dict(self.value_net.state_dict())
self.target_net.eval()
logger.debug("Number of trainable parameters: {}".format(trainable_parameters(self.value_net)))
self.device = choose_device(self.config["device"])
self.value_net.to(self.device)
self.target_net.to(self.device)
self.loss_function = loss_function_factory(self.config["loss_function"])
self.optimizer = optimizer_factory(self.config["optimizer"]["type"],
self.value_net.parameters(),
**self.config["optimizer"])
self.steps = 0
def get_batch_state_action_values(self, states):
return self.value_net(torch.tensor(states, dtype=torch.float).to(self.device)).data.cpu().numpy()
value_net
The implementation of depends onmodel_factory
, and the configuration file part is as follows:
"model": {
"type": "MultiLayerPerceptron",
"layers": [128, 128]
},
Re-entermodel_factory
, mainly realizing four types of networks:
- MultiLayerPerceptron
- DuelingNetwork
- ConvolutionalNetwork
- EgoAttentionNetwork
Here we will first analyze the multi-layer perceptron (i.e. ordinary DQN).
// rl_agents/agents/common/models.py
def model_factory(config: dict) -> nn.Module:
if config["type"] == "MultiLayerPerceptron":
return MultiLayerPerceptron(config)
elif config["type"] == "DuelingNetwork":
return DuelingNetwork(config)
elif config["type"] == "ConvolutionalNetwork":
return ConvolutionalNetwork(config)
elif config["type"] == "EgoAttentionNetwork":
return EgoAttentionNetwork(config)
else:
raise ValueError("Unknown model type")
MultiLayerPerceptron
The class inherits from BaseModule
, and BaseModule
inherits from torch.nn.Module
. According to the configuration filebaseline.json
, you can see that the sizes of the MultiLayerPerceptron
class are [128, 128] and the activation function is RELU. We can notice that there is a reshape operation in the network implementation, because the input of state is a 5*8 matrix, which can be converted into a matrix through reshape dimensional vector. The final network structure looks like the picture below.
class MultiLayerPerceptron(BaseModule, Configurable):
def __init__(self, config):
super().__init__()
Configurable.__init__(self, config)
sizes = [self.config["in"]] + self.config["layers"]
self.activation = activation_factory(self.config["activation"])
layers_list = [nn.Linear(sizes[i], sizes[i + 1]) for i in range(len(sizes) - 1)]
self.layers = nn.ModuleList(layers_list)
if self.config.get("out", None):
self.predict = nn.Linear(sizes[-1], self.config["out"])
@classmethod
def default_config(cls):
return {
"in": None,
"layers": [64, 64],
"activation": "RELU",
"reshape": "True",
"out": None}
def forward(self, x):
if self.config["reshape"]:
x = x.reshape(x.shape[0], -1) # We expect a batch of vectors
for layer in self.layers:
x = self.activation(layer(x))
if self.config.get("out", None):
x = self.predict(x)
return x
Get Q Q After Q, the exploration strategy is updated and an action is sampled. With ϵ \epsilon ϵ-Greedy one, oneϵ \epsilon ϵ-Greedy inherits DiscreteDistribution
, so the main focus is on the related implementation in DiscreteDistribution
.
def act(self, state, step_exploration_time=True):
...
self.exploration_policy.update(values)
return self.exploration_policy.sample()
rl_agents/agents/common/exploration/epsilon_greedy.py
def update(self, values):
"""
Update the action distribution parameters
:param values: the state-action values
:param step_time: whether to update epsilon schedule
"""
self.optimal_action = np.argmax(values)
self.epsilon = self.config['final_temperature'] + \
(self.config['temperature'] - self.config['final_temperature']) * \
np.exp(- self.time / self.config['tau'])
if self.writer:
self.writer.add_scalar('exploration/epsilon', self.epsilon, self.time)
class DiscreteDistribution(Configurable, ABC):
def __init__(self, config=None, **kwargs):
super(DiscreteDistribution, self).__init__(config)
self.np_random = None
@abstractmethod
def get_distribution(self):
"""
:return: a distribution over actions {action:probability}
"""
raise NotImplementedError()
def sample(self):
"""
:return: an action sampled from the distribution
"""
distribution = self.get_distribution()
return self.np_random.choice(list(distribution.keys()), 1, p=np.array(list(distribution.values())))[0]
You can see that you first need to obtain a distribution of actions. This part is in ϵ \epsilon ϵ-The implementation in Greedy is:
def get_distribution(self):
distribution = {
action: self.epsilon / self.action_space.n for action in range(self.action_space.n)}
distribution[self.optimal_action] += 1 - self.epsilon
return distribution
get_distribution
The function returns a dictionary of probability distributions for an action. The key of the dictionary is the action, and the value of the dictionary is the probability of the action being selected. The calculation method of probability distribution is: each action has a basic probability self.epsilon / self.action_space.n
, where self.action_space.n
is the total number of actions, that is, each action has an equal probability of being selected, This is based on an exploration perspective. At the same time, the optimal action self.optimal_action
will obtain an additional probability increment 1 - self.epsilon
, which is based on the perspective of utilization, that is, using the known optimal action.
sample
The function samples based on the action probability distribution obtained by the get_distribution
function and returns an action. Specifically, use the np_random.choice
function, whose parameters include an action list and a corresponding action probability distribution list, and return an action randomly sampled according to the given probability distribution.
Summary 1
At this point,act
the function returns an action to be executed. The block diagram of this part is as follows:
These next steps have been discussed in the previous lecturehttp://t.csdnimg.cn/ddpVJ.
# Forward the actions to the environment viewer
try:
self.env.unwrapped.viewer.set_agent_action_sequence(actions)
except AttributeError:
pass
# Step the environment
previous_observation, action = self.observation, actions[0]
transition = self.wrapped_env.step(action)
self.observation, reward, done, truncated, info = transition
terminal = done or truncated
# Call callback
if self.step_callback_fn is not None:
self.step_callback_fn(self.episode, self.wrapped_env, self.agent, transition, self.writer)
Record the experience
Nowstep
there is only this step left in the function, let’s look at the implementation of this step.
# Record the experience.
try:
self.agent.record(previous_observation, action, reward, self.observation, done, info)
except NotImplementedError:
pass
Jump directly toAbstractDQNAgent
class to view related implementation
def record(self, state, action, reward, next_state, done, info):
"""
Record a transition by performing a Deep Q-Network iteration
- push the transition into memory
- sample a minibatch
- compute the bellman residual loss over the minibatch
- perform one gradient descent step
- slowly track the policy network with the target network
:param state: a state
:param action: an action
:param reward: a reward
:param next_state: a next state
:param done: whether state is terminal
"""
if not self.training:
return
if isinstance(state, tuple) and isinstance(action, tuple): # Multi-agent setting
[self.memory.push(agent_state, agent_action, reward, agent_next_state, done, info)
for agent_state, agent_action, agent_next_state in zip(state, action, next_state)]
else: # Single-agent setting
self.memory.push(state, action, reward, next_state, done, info)
batch = self.sample_minibatch()
if batch:
loss, _, _ = self.compute_bellman_residual(batch)
self.step_optimizer(loss)
self.update_target_network()
Replaybuffer
self.memory
Is an implementation of Replaybuffer
self.memory = ReplayMemory(self.config)
- The implementation of push function can improve the computing speed.
- In reinforcement learning, it is often necessary to sample a batch of data from the experience replay cache (here:
self.memory
) to update the model. The n-step here is a commonly used technique, which indicates that when predicting the next state, not only the current state and action are used, but also the next n-1 states and actions. When n is 1, this is a common single-step transition; when n is greater than 1, this is n-step sampling.
rl_agents/agents/common/memory.py
class ReplayMemory(Configurable):
"""
Container that stores and samples transitions.
"""
def __init__(self, config=None, transition_type=Transition):
super(ReplayMemory, self).__init__(config)
self.capacity = int(self.config['memory_capacity'])
self.transition_type = transition_type
self.memory = []
self.position = 0
@classmethod
def default_config(cls):
return dict(memory_capacity=10000,
n_steps=1,
gamma=0.99)
def push(self, *args):
"""Saves a transition."""
if len(self.memory) < self.capacity:
self.memory.append(None)
self.position = len(self.memory) - 1
elif len(self.memory) > self.capacity:
self.memory = self.memory[:self.capacity]
# Faster than append and pop
self.memory[self.position] = self.transition_type(*args)
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size, collapsed=True):
"""
Sample a batch of transitions.
If n_steps is greater than one, the batch will be composed of lists of successive transitions.
:param batch_size: size of the batch
:param collapsed: whether successive transitions must be collapsed into one n-step transition.
:return: the sampled batch
"""
# TODO: use agent's np_random for seeding
if self.config["n_steps"] == 1:
# Directly sample transitions
return random.sample(self.memory, batch_size)
else:
# Sample initial transition indexes
indexes = random.sample(range(len(self.memory)), batch_size)
# Get the batch of n-consecutive-transitions starting from sampled indexes
all_transitions = [self.memory[i:i+self.config["n_steps"]] for i in indexes]
# Collapse transitions
return map(self.collapse_n_steps, all_transitions) if collapsed else all_transitions
def collapse_n_steps(self, transitions):
"""
Collapse n transitions <s,a,r,s',t> of a trajectory into one transition <s0, a0, Sum(r_i), sp, tp>.
We start from the initial state, perform the first action, and then the return estimate is formed by
accumulating the discounted rewards along the trajectory until a terminal state or the end of the
trajectory is reached.
:param transitions: A list of n successive transitions
:return: The corresponding n-step transition
"""
state, action, cumulated_reward, next_state, done, info = transitions[0]
discount = 1
for transition in transitions[1:]:
if done:
break
else:
_, _, reward, next_state, done, info = transition
discount *= self.config['gamma']
cumulated_reward += discount*reward
return state, action, cumulated_reward, next_state, done, info
def __len__(self):
return len(self.memory)
def is_full(self):
return len(self.memory) == self.capacity
def is_empty(self):
return len(self.memory) == 0
Back torecord
code, first put the sampled data into the Replaybuffer. When the amount of sampled data is greater thanbatch_size
, sample it from the Replaybuffer.
def sample_minibatch(self):
if len(self.memory) < self.config["batch_size"]:
return None
transitions = self.memory.sample(self.config["batch_size"])
return Transition(*zip(*transitions))
compute_bellman_residual
Then use the bellman equation to update:
loss, _, _ = self.compute_bellman_residual(batch)
def compute_bellman_residual(self, batch, target_state_action_value=None):
# Compute concatenate the batch elements
if not isinstance(batch.state, torch.Tensor):
# logger.info("Casting the batch to torch.tensor")
state = torch.cat(tuple(torch.tensor([batch.state], dtype=torch.float))).to(self.device)
action = torch.tensor(batch.action, dtype=torch.long).to(self.device)
reward = torch.tensor(batch.reward, dtype=torch.float).to(self.device)
next_state = torch.cat(tuple(torch.tensor([batch.next_state], dtype=torch.float))).to(self.device)
terminal = torch.tensor(batch.terminal, dtype=torch.bool).to(self.device)
batch = Transition(state, action, reward, next_state, terminal, batch.info)
# Compute Q(s_t, a) - the model computes Q(s_t), then we select the
# columns of actions taken
state_action_values = self.value_net(batch.state)
state_action_values = state_action_values.gather(1, batch.action.unsqueeze(1)).squeeze(1)
if target_state_action_value is None:
with torch.no_grad():
# Compute V(s_{t+1}) for all next states.
next_state_values = torch.zeros(batch.reward.shape).to(self.device)
if self.config["double"]:
# Double Q-learning: pick best actions from policy network
_, best_actions = self.value_net(batch.next_state).max(1)
# Double Q-learning: estimate action values from target network
best_values = self.target_net(batch.next_state).gather(1, best_actions.unsqueeze(1)).squeeze(1)
else:
best_values, _ = self.target_net(batch.next_state).max(1)
next_state_values[~batch.terminal] = best_values[~batch.terminal]
# Compute the expected Q values
target_state_action_value = batch.reward + self.config["gamma"] * next_state_values
# Compute loss
loss = self.loss_function(state_action_values, target_state_action_value)
return loss, target_state_action_value, batch
with torch.no_grad():
Used to disable gradient calculations within its scope- Implemented DoubleDQN
self.loss_function = loss_function_factory(self.config["loss_function"])
The loss function includes the following:
def loss_function_factory(loss_function):
if loss_function == "l2":
return F.mse_loss
elif loss_function == "l1":
return F.l1_loss
elif loss_function == "smooth_l1":
return F.smooth_l1_loss
elif loss_function == "bce":
return F.binary_cross_entropy
else:
raise ValueError("Unknown loss function : {}".format(loss_function))
step_optimizer
The gradient is truncated
def step_optimizer(self, loss):
# Optimize the model
self.optimizer.zero_grad()
loss.backward()
for param in self.value_net.parameters():
param.grad.data.clamp_(-1, 1)
self.optimizer.step()
update_target_network
Update target network
def update_target_network(self):
self.steps += 1
if self.steps % self.config["target_update"] == 0:
self.target_net.load_state_dict(self.value_net.state_dict())
Summary 2
At this point, the entire DQN algorithm has been implemented. The block diagram of the record
part is as follows:
exploration_policy
This part mainly implements three strategies:
- Greedy
- ϵ \epsilonϵ-Greedy
- Boltzmann
You can refer to this part:[Reinforcement Learning] 02——Exploration and Utilization
Greedy
Greedy greedy strategy is to choose the optimal strategy a t = arg max a ∈ A Q ( s , a ) a_t=\argmax_{a\in\mathcal{A}} Q(s,a) at=argmaxa∈AQ(s,a)
class Greedy(DiscreteDistribution):
"""
Always use the optimal action
"""
def __init__(self, action_space, config=None):
super(Greedy, self).__init__(config)
self.action_space = action_space
if isinstance(self.action_space, spaces.Tuple):
self.action_space = self.action_space.spaces[0]
if not isinstance(self.action_space, spaces.Discrete):
raise TypeError("The action space should be discrete")
self.values = None
self.seed()
def get_distribution(self):
optimal_action = np.argmax(self.values)
return {
action: 1 if action == optimal_action else 0 for action in range(self.action_space.n)}
def update(self, values):
self.values = values
ϵ \epsilonϵ-Greedy
ϵ \epsilonϵ-Greedy formula is as follows:
a t = { arg max a ∈ A Q ^ ( a ) , sampling probability: 1- ϵ Randomly selected from A, sampling probability: ϵ a_t=\begin{cases}\arg\max_{a\in\mathcal{A}}\hat{Q}(a),&\text{Sampling probability: 1- }\epsilon\\\text{Randomly selected from}\mathcal{A}\text{},&\text{Sampling probability: }\epsilon&\end{cases} at={
argmaxa∈AQ^(a),从 A中用机选择, Rough rate:1-ϵRough estimate: ϵ
What is implemented here is actually a decay greedy strategy, and the decay curve is shown in the figure below.
ϵ = final-temperature + ( temperature − final-temperature ) ∗ e − t τ \begin{aligned}\epsilon &= \text{final-temperature}+(\text{temperature }-\text{final-temperature})*e^{\frac{-t}{\tau}}\end{aligned} ϵ=final-temperature+(temperature−final-temperature)∗It ist−t
class EpsilonGreedy(DiscreteDistribution):
"""
Uniform distribution with probability epsilon, and optimal action with probability 1-epsilon
"""
def __init__(self, action_space, config=None):
super(EpsilonGreedy, self).__init__(config)
self.action_space = action_space
if isinstance(self.action_space, spaces.Tuple):
self.action_space = self.action_space.spaces[0]
if not isinstance(self.action_space, spaces.Discrete):
raise TypeError("The action space should be discrete")
self.config['final_temperature'] = min(self.config['temperature'], self.config['final_temperature'])
self.optimal_action = None
self.epsilon = 0
self.time = 0
self.writer = None
self.seed()
@classmethod
def default_config(cls):
return dict(temperature=1.0,
final_temperature=0.1,
tau=5000)
def get_distribution(self):
distribution = {
action: self.epsilon / self.action_space.n for action in range(self.action_space.n)}
distribution[self.optimal_action] += 1 - self.epsilon
return distribution
def update(self, values):
"""
Update the action distribution parameters
:param values: the state-action values
:param step_time: whether to update epsilon schedule
"""
self.optimal_action = np.argmax(values)
self.epsilon = self.config['final_temperature'] + \
(self.config['temperature'] - self.config['final_temperature']) * \
np.exp(- self.time / self.config['tau'])
if self.writer:
self.writer.add_scalar('exploration/epsilon', self.epsilon, self.time)
def step_time(self):
self.time += 1
def set_time(self, time):
self.time = time
def set_writer(self, writer):
self.writer = writer
Boltzmann
The Boltzmann distribution is a probability distribution function that describes the distribution of molecules at thermodynamic equilibrium. It shows that in a given energy state, the probability of different microstates appearing is different and conforms to an exponential function form.
In thermodynamics, any substance will have certain thermal motion at a certain temperature. These thermal motion states can be described by molecular internal energy or kinetic energy. The Boltzmann distribution shows the distribution probability of molecules among all possible states at the same temperature. Its expression is:
P ( E i ) = e − E i / k T ∑ j e − E j / k T P(E_i) = \frac{e^{-E_i/kT}}{\sum_{j} e^{-E_j/kT}} P(Ei)=∑jIt is−Ej/kTIt is−Ei/kT
In that, P ( E i ) P(E_i) P(Ei)Morecular quantity capacity E i E_i ANDiprobability, k k k is Boltzmann’s constant, T T T为温度, E j E_j ANDjfor all attainable energy states.
It can be seen that the occurrence probability of each energy state in the Boltzmann distribution has a negative exponential relationship with its energy, so the probability of occurrence of states with smaller energy is greater. This is consistent with the trend of increasing entropy, that is, the more ordered the state, the smaller the probability of occurrence.
class Boltzmann(DiscreteDistribution):
"""
Uniform distribution with probability epsilon, and optimal action with probability 1-epsilon
"""
def __init__(self, action_space, config=None):
super(Boltzmann, self).__init__(config)
self.action_space = action_space
if not isinstance(self.action_space, spaces.Discrete):
raise TypeError("The action space should be discrete")
self.values = None
self.seed()
@classmethod
def default_config(cls):
return dict(temperature=0.5)
def get_distribution(self):
actions = range(self.action_space.n)
if self.config['temperature'] > 0:
weights = np.exp(self.values / self.config['temperature'])
else:
weights = np.zeros((len(actions),))
weights[np.argmax(self.values)] = 1
return {
action: weights[action] / np.sum(weights) for action in actions}
def update(self, values):
self.values = values
operation result
The running commands and methods have been introduced in the previous lecture[rl-agents code learning] 01 - Overall framework.
The hyperparameter settings adopt the default settings, and the DQN algorithm is used to run 4000steps and 20000steps respectively. Use Tensorboard to view the results:
tensorboard --logdir C:\Users\16413\Desktop\rl-agents-master\scripts\out\IntersectionEnv\DQNAgent\baseline_20231113-123234_7944\
4000steps
You can see that the final episode reward is roughly around 3.
20000steps
You can see that the final episode reward is roughly around 3.