[rl-agents code learning] 02——DQN algorithm

Highway-env Intersection

This article will continue to explore the implementation of related DQN algorithms in rl-agents. The following introduction will take the environment intersection as an example. First, we will introduce intersection-v1 in Highway-env. Related documents in Highway-env——http://highway-env.farama.org/environments/intersection/.

The environment in highway-env can be modified through the configuration file. Information such as observations, actions, dynamics and rewards are all stored in the configuration file in the form of dictionaries.

PS: For the principles of DQN and DuelingDQN algorithms, please refer to[Reinforcement Learning] 10 - DQN Algorithm[Reinforcement Learning] 11 - Double DQN Algorithm and Dueling DQN algorithm

import gymnasium as gym
import pprint
from matplotlib import pyplot as plt

env = gym.make("intersection-v1", render_mode='rgb_array')
pprint.pprint(env.unwrapped.config)

Output config, you can see the following information:

{
    
    'action': {
    
    'dynamical': True,
            'lateral': True,
            'longitudinal': True,
            'steering_range': [-1.0471975511965976, 1.0471975511965976],
            'type': 'ContinuousAction'},
 'arrived_reward': 1,
 'centering_position': [0.5, 0.6],
 'collision_reward': -5,
 'controlled_vehicles': 1,
 'destination': 'o1',
 'duration': 13,
 'high_speed_reward': 1,
 'initial_vehicle_count': 10,
 'manual_control': False,
 'normalize_reward': False,
 'observation': {
    
    'features': ['presence',
                              'x',
                              'y',
                              'vx',
                              'vy',
                              'long_off',
                              'lat_off',
                              'ang_off'],
                 'type': 'Kinematics',
                 'vehicles_count': 5},
 'offroad_terminal': False,
 'offscreen_rendering': False,
 'other_vehicles_type': 'highway_env.vehicle.behavior.IDMVehicle',
 'policy_frequency': 1,
 'real_time_rendering': False,
 'render_agent': True,
 'reward_speed_range': [7.0, 9.0],
 'scaling': 7.15,
 'screen_height': 600,
 'screen_width': 600,
 'show_trajectories': False,
 'simulation_frequency': 15,
 'spawn_probability': 0.6}

The image can then be output via the following code:

plt.imshow(env.render())
plt.show()

Insert image description here
Outputobservation, you can see that it is a 5*8 array:

[[ 1.0000000e+00  9.9999998e-03  1.0000000e+00  0.0000000e+00
  -1.2500000e-01  6.3297665e+01  0.0000000e+00  0.0000000e+00]
 [ 1.0000000e+00  1.3849856e-01 -1.0000000e+00 -9.9416278e-02
   1.2500000e-01  8.1300293e+01  1.0361128e-15  0.0000000e+00]
 [ 1.0000000e+00 -2.0000000e-02 -1.0000000e+00  0.0000000e+00
   2.2993930e-01  6.5756187e+01  2.8473811e-15  0.0000000e+00]
 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00]
 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00]]

observationThe explanation of is as follows,
Insert image description here
Through the following code, the type of action can be changed into a discrete space.

env.unwrapped.configure({
    
    
    "action": {
    
    
        'longitudinal': True,
        "type": "DiscreteMetaAction"
    }
})

DQN of rl-agents

A neural-network model is used to estimate the state-action value function and produce a greedy optimal policy.

Implemented variants:

  • Double DQN
  • Dueling architecture
  • N-step targets

References:

Playing Atari with Deep Reinforcement Learning, Mnih V. et al (2013).
Deep Reinforcement Learning with Double Q-learning, van Hasselt H. et al. (2015).
Dueling Network Architectures for Deep Reinforcement Learning, Wang Z. et al. (2015).

Query agent for actions sequence

As we know fromthe previous section, specific agent training is performed by calling the run_episodes function. The step function will be called and self.agent.plan(self.observation) executed. For the implementation of DQNAgent, it is first implemented by the AbstractAgent classplan, and then plan函数 will call the act function :

    def step(self):
        """
            Plan a sequence of actions according to the agent policy, and step the environment accordingly.
        """
        # Query agent for actions sequence
        actions = self.agent.plan(self.observation)
// rl_agents/agents/common/abstract.py
class AbstractAgent(Configurable, ABC):

    def __init__(self, config=None):
        super(AbstractAgent, self).__init__(config)
        self.writer = None  # Tensorboard writer
        self.directory = None  # Run directory
        
    @abstractmethod
    def act(self, state):
        """
            Pick an action

        :param state: s, the current state of the agent
        :return: a, the action to perform
        """
        raise NotImplementedError()

    def plan(self, state):
        """
            Plan an optimal trajectory from an initial state.

        :param state: s, the initial state of the agent
        :return: [a0, a1, a2...], a sequence of actions to perform
        """
        return [self.act(state)]

DQN abstract classAbstractDQNAgent inherits from AbstractStochasticAgent, AbstractStochasticAgent inherits from AbstractAgent, Implement the rewriting of the function in the DQN abstract classAbstractDQNAgent:act

    def act(self, state, step_exploration_time=True):
        """
            Act according to the state-action value model and an exploration policy
        :param state: current state
        :param step_exploration_time: step the exploration schedule
        :return: an action
        """
        self.previous_state = state
        if step_exploration_time:
            self.exploration_policy.step_time()
        # Handle multi-agent observations
        # TODO: it would be more efficient to forward a batch of states
        if isinstance(state, tuple):
            return tuple(self.act(agent_state, step_exploration_time=False) for agent_state in state)

        # Single-agent setting
        values = self.get_state_action_values(state)
        self.exploration_policy.update(values)
        return self.exploration_policy.sample()

Explore strategies

First let’s look at the implementation ofexploration_policy :

        self.exploration_policy = exploration_factory(self.config["exploration"], self.env.action_space)

Explore the configuration file section loaded by the policy:

"exploration": {
    
    
 "method": "EpsilonGreedy",
    "tau": 15000,
    "temperature": 1.0,
    "final_temperature": 0.05
}

Jump toexploration_factory. You can see that three types of exploration strategies are mainly implemented. The specific content will be introduced in the following part:

  • Greedy
  • ϵ \epsilonϵ-Greedy
  • Boltzmann
def exploration_factory(exploration_config, action_space):
    """
        Handles creation of exploration policies
    :param exploration_config: configuration dictionary of the policy, must contain a "method" key
    :param action_space: the environment action space
    :return: a new exploration policy
    """
    from rl_agents.agents.common.exploration.boltzmann import Boltzmann
    from rl_agents.agents.common.exploration.epsilon_greedy import EpsilonGreedy
    from rl_agents.agents.common.exploration.greedy import Greedy

    if exploration_config['method'] == 'Greedy':
        return Greedy(action_space, exploration_config)
    elif exploration_config['method'] == 'EpsilonGreedy':
        return EpsilonGreedy(action_space, exploration_config)
    elif exploration_config['method'] == 'Boltzmann':
        return Boltzmann(action_space, exploration_config)
    else:
        raise ValueError("Unknown exploration method")

Neural network implementation

Then get Q ( s , a ) Q(s,a) Q(s,a)

    def get_state_action_values(self, state):
        """
        :param state: s, an environment state
        :return: [Q(a1,s), ..., Q(an,s)] the array of its action-values for each actions
        """
        return self.get_batch_state_action_values([state])[0]

abstract method calledget_batch_state_action_values

    @abstractmethod
    def get_batch_state_action_values(self, states):
        """
        Get the state-action values of several states
        :param states: [s1; ...; sN] an array of states
        :return: values:[[Q11, ..., Q1n]; ...] the array of all action values for each state
        """
        raise NotImplementedError

Next, let’s look at the specific implementation in DQNAgent:

class DQNAgent(AbstractDQNAgent):
    def __init__(self, env, config=None):
        super(DQNAgent, self).__init__(env, config)
        size_model_config(self.env, self.config["model"])
        self.value_net = model_factory(self.config["model"])
        self.target_net = model_factory(self.config["model"])
        self.target_net.load_state_dict(self.value_net.state_dict())
        self.target_net.eval()
        logger.debug("Number of trainable parameters: {}".format(trainable_parameters(self.value_net)))
        self.device = choose_device(self.config["device"])
        self.value_net.to(self.device)
        self.target_net.to(self.device)
        self.loss_function = loss_function_factory(self.config["loss_function"])
        self.optimizer = optimizer_factory(self.config["optimizer"]["type"],
                                           self.value_net.parameters(),
                                           **self.config["optimizer"])
        self.steps = 0
        
    def get_batch_state_action_values(self, states):
        return self.value_net(torch.tensor(states, dtype=torch.float).to(self.device)).data.cpu().numpy()

value_netThe implementation of depends onmodel_factory, and the configuration file part is as follows:

    "model": {
    
    
        "type": "MultiLayerPerceptron",
        "layers": [128, 128]
    },

Re-entermodel_factory, mainly realizing four types of networks:

  • MultiLayerPerceptron
  • DuelingNetwork
  • ConvolutionalNetwork
  • EgoAttentionNetwork

Here we will first analyze the multi-layer perceptron (i.e. ordinary DQN).

// rl_agents/agents/common/models.py
def model_factory(config: dict) -> nn.Module:
    if config["type"] == "MultiLayerPerceptron":
        return MultiLayerPerceptron(config)
    elif config["type"] == "DuelingNetwork":
        return DuelingNetwork(config)
    elif config["type"] == "ConvolutionalNetwork":
        return ConvolutionalNetwork(config)
    elif config["type"] == "EgoAttentionNetwork":
        return EgoAttentionNetwork(config)
    else:
        raise ValueError("Unknown model type")

MultiLayerPerceptronThe class inherits from BaseModule, and BaseModule inherits from torch.nn.Module. According to the configuration filebaseline.json, you can see that the sizes of the MultiLayerPerceptron class are [128, 128] and the activation function is RELU. We can notice that there is a reshape operation in the network implementation, because the input of state is a 5*8 matrix, which can be converted into a matrix through reshape dimensional vector. The final network structure looks like the picture below.

Insert image description here

class MultiLayerPerceptron(BaseModule, Configurable):
    def __init__(self, config):
        super().__init__()
        Configurable.__init__(self, config)
        sizes = [self.config["in"]] + self.config["layers"] 
        self.activation = activation_factory(self.config["activation"])
        layers_list = [nn.Linear(sizes[i], sizes[i + 1]) for i in range(len(sizes) - 1)]
        self.layers = nn.ModuleList(layers_list)
        if self.config.get("out", None):
            self.predict = nn.Linear(sizes[-1], self.config["out"])

    @classmethod
    def default_config(cls):
        return {
    
    "in": None,
                "layers": [64, 64],
                "activation": "RELU",
                "reshape": "True",
                "out": None}

    def forward(self, x):
        if self.config["reshape"]:
            x = x.reshape(x.shape[0], -1)  # We expect a batch of vectors
        for layer in self.layers:
            x = self.activation(layer(x))
        if self.config.get("out", None):
            x = self.predict(x)
        return x

Get Q Q After Q, the exploration strategy is updated and an action is sampled. With ϵ \epsilon ϵ-Greedy one, oneϵ \epsilon ϵ-Greedy inherits DiscreteDistribution, so the main focus is on the related implementation in DiscreteDistribution.

    def act(self, state, step_exploration_time=True):
    ...
        self.exploration_policy.update(values)
        return self.exploration_policy.sample()
rl_agents/agents/common/exploration/epsilon_greedy.py
    def update(self, values):
        """
            Update the action distribution parameters
        :param values: the state-action values
        :param step_time: whether to update epsilon schedule
        """
        self.optimal_action = np.argmax(values)
        self.epsilon = self.config['final_temperature'] + \
            (self.config['temperature'] - self.config['final_temperature']) * \
            np.exp(- self.time / self.config['tau'])
        if self.writer:
            self.writer.add_scalar('exploration/epsilon', self.epsilon, self.time)
class DiscreteDistribution(Configurable, ABC):
    def __init__(self, config=None, **kwargs):
        super(DiscreteDistribution, self).__init__(config)
        self.np_random = None
        
    @abstractmethod
    def get_distribution(self):
        """
        :return: a distribution over actions {action:probability}
        """
        raise NotImplementedError()

    def sample(self):
        """
        :return: an action sampled from the distribution
        """
        distribution = self.get_distribution()
        return self.np_random.choice(list(distribution.keys()), 1, p=np.array(list(distribution.values())))[0]

You can see that you first need to obtain a distribution of actions. This part is in ϵ \epsilon ϵ-The implementation in Greedy is:

    def get_distribution(self):
        distribution = {
    
    action: self.epsilon / self.action_space.n for action in range(self.action_space.n)}
        distribution[self.optimal_action] += 1 - self.epsilon
        return distribution

get_distribution The function returns a dictionary of probability distributions for an action. The key of the dictionary is the action, and the value of the dictionary is the probability of the action being selected. The calculation method of probability distribution is: each action has a basic probability self.epsilon / self.action_space.n, where self.action_space.n is the total number of actions, that is, each action has an equal probability of being selected, This is based on an exploration perspective. At the same time, the optimal action self.optimal_action will obtain an additional probability increment 1 - self.epsilon, which is based on the perspective of utilization, that is, using the known optimal action.

sampleThe function samples based on the action probability distribution obtained by the get_distribution function and returns an action. Specifically, use the np_random.choice function, whose parameters include an action list and a corresponding action probability distribution list, and return an action randomly sampled according to the given probability distribution.

Summary 1

At this point,act the function returns an action to be executed. The block diagram of this part is as follows:

Insert image description here
These next steps have been discussed in the previous lecturehttp://t.csdnimg.cn/ddpVJ.

        # Forward the actions to the environment viewer
        try:
            self.env.unwrapped.viewer.set_agent_action_sequence(actions)
        except AttributeError:
            pass
            
        # Step the environment
        previous_observation, action = self.observation, actions[0]
        transition = self.wrapped_env.step(action)
        self.observation, reward, done, truncated, info = transition
        terminal = done or truncated

        # Call callback
        if self.step_callback_fn is not None:
            self.step_callback_fn(self.episode, self.wrapped_env, self.agent, transition, self.writer)

Record the experience

Nowstep there is only this step left in the function, let’s look at the implementation of this step.

        # Record the experience.
        try:
            self.agent.record(previous_observation, action, reward, self.observation, done, info)
        except NotImplementedError:
            pass

Jump directly toAbstractDQNAgent class to view related implementation

    def record(self, state, action, reward, next_state, done, info):
        """
            Record a transition by performing a Deep Q-Network iteration

            - push the transition into memory
            - sample a minibatch
            - compute the bellman residual loss over the minibatch
            - perform one gradient descent step
            - slowly track the policy network with the target network
        :param state: a state
        :param action: an action
        :param reward: a reward
        :param next_state: a next state
        :param done: whether state is terminal
        """
        if not self.training:
            return
        if isinstance(state, tuple) and isinstance(action, tuple):  # Multi-agent setting
            [self.memory.push(agent_state, agent_action, reward, agent_next_state, done, info)
             for agent_state, agent_action, agent_next_state in zip(state, action, next_state)]
        else:  # Single-agent setting
            self.memory.push(state, action, reward, next_state, done, info)
        batch = self.sample_minibatch()
        if batch:
            loss, _, _ = self.compute_bellman_residual(batch)
            self.step_optimizer(loss)
            self.update_target_network()

Replaybuffer

self.memoryIs an implementation of Replaybuffer

  self.memory = ReplayMemory(self.config)
  • The implementation of push function can improve the computing speed.
  • In reinforcement learning, it is often necessary to sample a batch of data from the experience replay cache (here: self.memory) to update the model. The n-step here is a commonly used technique, which indicates that when predicting the next state, not only the current state and action are used, but also the next n-1 states and actions. When n is 1, this is a common single-step transition; when n is greater than 1, this is n-step sampling.
rl_agents/agents/common/memory.py
class ReplayMemory(Configurable):
    """
        Container that stores and samples transitions.
    """
    def __init__(self, config=None, transition_type=Transition):
        super(ReplayMemory, self).__init__(config)
        self.capacity = int(self.config['memory_capacity'])
        self.transition_type = transition_type
        self.memory = []
        self.position = 0

    @classmethod
    def default_config(cls):
        return dict(memory_capacity=10000,
                    n_steps=1,
                    gamma=0.99)

    def push(self, *args):
        """Saves a transition."""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
            self.position = len(self.memory) - 1
        elif len(self.memory) > self.capacity:
            self.memory = self.memory[:self.capacity]
        # Faster than append and pop
        self.memory[self.position] = self.transition_type(*args)
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size, collapsed=True):
        """
            Sample a batch of transitions.

            If n_steps is greater than one, the batch will be composed of lists of successive transitions.
        :param batch_size: size of the batch
        :param collapsed: whether successive transitions must be collapsed into one n-step transition.
        :return: the sampled batch
        """
        # TODO: use agent's np_random for seeding
        if self.config["n_steps"] == 1:
            # Directly sample transitions
            return random.sample(self.memory, batch_size)
        else:
            # Sample initial transition indexes
            indexes = random.sample(range(len(self.memory)), batch_size)
            # Get the batch of n-consecutive-transitions starting from sampled indexes
            all_transitions = [self.memory[i:i+self.config["n_steps"]] for i in indexes]
            # Collapse transitions
            return map(self.collapse_n_steps, all_transitions) if collapsed else all_transitions

    def collapse_n_steps(self, transitions):
        """
            Collapse n transitions <s,a,r,s',t> of a trajectory into one transition <s0, a0, Sum(r_i), sp, tp>.

            We start from the initial state, perform the first action, and then the return estimate is formed by
            accumulating the discounted rewards along the trajectory until a terminal state or the end of the
            trajectory is reached.
        :param transitions: A list of n successive transitions
        :return: The corresponding n-step transition
        """
        state, action, cumulated_reward, next_state, done, info = transitions[0]
        discount = 1
        for transition in transitions[1:]:
            if done:
                break
            else:
                _, _, reward, next_state, done, info = transition
                discount *= self.config['gamma']
                cumulated_reward += discount*reward
        return state, action, cumulated_reward, next_state, done, info

    def __len__(self):
        return len(self.memory)

    def is_full(self):
        return len(self.memory) == self.capacity

    def is_empty(self):
        return len(self.memory) == 0

Back torecord code, first put the sampled data into the Replaybuffer. When the amount of sampled data is greater thanbatch_size, sample it from the Replaybuffer.

    def sample_minibatch(self):
        if len(self.memory) < self.config["batch_size"]:
            return None
        transitions = self.memory.sample(self.config["batch_size"])
        return Transition(*zip(*transitions))

compute_bellman_residual

Then use the bellman equation to update:

loss, _, _ = self.compute_bellman_residual(batch)
    def compute_bellman_residual(self, batch, target_state_action_value=None):
        # Compute concatenate the batch elements
        if not isinstance(batch.state, torch.Tensor):
            # logger.info("Casting the batch to torch.tensor")
            state = torch.cat(tuple(torch.tensor([batch.state], dtype=torch.float))).to(self.device)
            action = torch.tensor(batch.action, dtype=torch.long).to(self.device)
            reward = torch.tensor(batch.reward, dtype=torch.float).to(self.device)
            next_state = torch.cat(tuple(torch.tensor([batch.next_state], dtype=torch.float))).to(self.device)
            terminal = torch.tensor(batch.terminal, dtype=torch.bool).to(self.device)
            batch = Transition(state, action, reward, next_state, terminal, batch.info)

        # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
        # columns of actions taken
        state_action_values = self.value_net(batch.state)
        state_action_values = state_action_values.gather(1, batch.action.unsqueeze(1)).squeeze(1)

        if target_state_action_value is None:
            with torch.no_grad():
                # Compute V(s_{t+1}) for all next states.
                next_state_values = torch.zeros(batch.reward.shape).to(self.device)
                if self.config["double"]:
                    # Double Q-learning: pick best actions from policy network
                    _, best_actions = self.value_net(batch.next_state).max(1)
                    # Double Q-learning: estimate action values from target network
                    best_values = self.target_net(batch.next_state).gather(1, best_actions.unsqueeze(1)).squeeze(1)
                else:
                    best_values, _ = self.target_net(batch.next_state).max(1)
                next_state_values[~batch.terminal] = best_values[~batch.terminal]
                # Compute the expected Q values
                target_state_action_value = batch.reward + self.config["gamma"] * next_state_values

        # Compute loss
        loss = self.loss_function(state_action_values, target_state_action_value)
        return loss, target_state_action_value, batch
  • with torch.no_grad():Used to disable gradient calculations within its scope
  • Implemented DoubleDQN
  • self.loss_function = loss_function_factory(self.config["loss_function"])The loss function includes the following:
def loss_function_factory(loss_function):
    if loss_function == "l2":
        return F.mse_loss
    elif loss_function == "l1":
        return F.l1_loss
    elif loss_function == "smooth_l1":
        return F.smooth_l1_loss
    elif loss_function == "bce":
        return F.binary_cross_entropy
    else:
        raise ValueError("Unknown loss function : {}".format(loss_function))

step_optimizer

The gradient is truncated

    def step_optimizer(self, loss):
        # Optimize the model
        self.optimizer.zero_grad()
        loss.backward()
        for param in self.value_net.parameters():
            param.grad.data.clamp_(-1, 1)
        self.optimizer.step()

update_target_network

Update target network

    def update_target_network(self):
        self.steps += 1
        if self.steps % self.config["target_update"] == 0:
            self.target_net.load_state_dict(self.value_net.state_dict())

Summary 2

At this point, the entire DQN algorithm has been implemented. The block diagram of the record part is as follows:

Insert image description here

exploration_policy

This part mainly implements three strategies:

  • Greedy
  • ϵ \epsilonϵ-Greedy
  • Boltzmann

You can refer to this part:[Reinforcement Learning] 02——Exploration and Utilization

Greedy

Greedy greedy strategy is to choose the optimal strategy a t = arg max ⁡ a ∈ A Q ( s , a ) a_t=\argmax_{a\in\mathcal{A}} Q(s,a) at=argmaxaAQ(s,a)

class Greedy(DiscreteDistribution):
    """
        Always use the optimal action
    """

    def __init__(self, action_space, config=None):
        super(Greedy, self).__init__(config)
        self.action_space = action_space
        if isinstance(self.action_space, spaces.Tuple):
            self.action_space = self.action_space.spaces[0]
        if not isinstance(self.action_space, spaces.Discrete):
            raise TypeError("The action space should be discrete")
        self.values = None
        self.seed()

    def get_distribution(self):
        optimal_action = np.argmax(self.values)
        return {
    
    action: 1 if action == optimal_action else 0 for action in range(self.action_space.n)}

    def update(self, values):
        self.values = values

ϵ \epsilonϵ-Greedy

ϵ \epsilonϵ-Greedy formula is as follows:
a t = { arg ⁡ max ⁡ a ∈ A Q ^ ( a ) , sampling probability: 1- ϵ Randomly selected from A, sampling probability: ϵ a_t=\begin{cases}\arg\max_{a\in\mathcal{A}}\hat{Q}(a),&\text{Sampling probability: 1- }\epsilon\\\text{Randomly selected from}\mathcal{A}\text{},&\text{Sampling probability: }\epsilon&\end{cases} at={ argmaxaAQ^(a), A中用机选择, Rough rate:1-ϵRough estimateϵ
What is implemented here is actually a decay greedy strategy, and the decay curve is shown in the figure below.
ϵ = final-temperature + ( temperature − final-temperature ) ∗ e − t τ \begin{aligned}\epsilon &= \text{final-temperature}+(\text{temperature }-\text{final-temperature})*e^{\frac{-t}{\tau}}\end{aligned} ϵ=final-temperature+(temperaturefinal-temperature)It istt
Insert image description here

class EpsilonGreedy(DiscreteDistribution):
    """
        Uniform distribution with probability epsilon, and optimal action with probability 1-epsilon
    """

    def __init__(self, action_space, config=None):
        super(EpsilonGreedy, self).__init__(config)
        self.action_space = action_space
        if isinstance(self.action_space, spaces.Tuple):
            self.action_space = self.action_space.spaces[0]
        if not isinstance(self.action_space, spaces.Discrete):
            raise TypeError("The action space should be discrete")
        self.config['final_temperature'] = min(self.config['temperature'], self.config['final_temperature'])
        self.optimal_action = None
        self.epsilon = 0
        self.time = 0
        self.writer = None
        self.seed()

    @classmethod
    def default_config(cls):
        return dict(temperature=1.0,
                    final_temperature=0.1,
                    tau=5000)

    def get_distribution(self):
        distribution = {
    
    action: self.epsilon / self.action_space.n for action in range(self.action_space.n)}
        distribution[self.optimal_action] += 1 - self.epsilon
        return distribution

    def update(self, values):
        """
            Update the action distribution parameters
        :param values: the state-action values
        :param step_time: whether to update epsilon schedule
        """
        self.optimal_action = np.argmax(values)
        self.epsilon = self.config['final_temperature'] + \
            (self.config['temperature'] - self.config['final_temperature']) * \
            np.exp(- self.time / self.config['tau'])
        if self.writer:
            self.writer.add_scalar('exploration/epsilon', self.epsilon, self.time)

    def step_time(self):
        self.time += 1

    def set_time(self, time):
        self.time = time

    def set_writer(self, writer):
        self.writer = writer

Boltzmann

The Boltzmann distribution is a probability distribution function that describes the distribution of molecules at thermodynamic equilibrium. It shows that in a given energy state, the probability of different microstates appearing is different and conforms to an exponential function form.

In thermodynamics, any substance will have certain thermal motion at a certain temperature. These thermal motion states can be described by molecular internal energy or kinetic energy. The Boltzmann distribution shows the distribution probability of molecules among all possible states at the same temperature. Its expression is:

P ( E i ) = e − E i / k T ∑ j e − E j / k T P(E_i) = \frac{e^{-E_i/kT}}{\sum_{j} e^{-E_j/kT}} P(Ei)=jIt isEj/kTIt isEi/kT

In that, P ( E i ) P(E_i) P(Ei)Morecular quantity capacity E i E_i ANDiprobability, k k k is Boltzmann’s constant, T T T为温度, E j E_j ANDjfor all attainable energy states.

It can be seen that the occurrence probability of each energy state in the Boltzmann distribution has a negative exponential relationship with its energy, so the probability of occurrence of states with smaller energy is greater. This is consistent with the trend of increasing entropy, that is, the more ordered the state, the smaller the probability of occurrence.

class Boltzmann(DiscreteDistribution):
    """
        Uniform distribution with probability epsilon, and optimal action with probability 1-epsilon
    """

    def __init__(self, action_space, config=None):
        super(Boltzmann, self).__init__(config)
        self.action_space = action_space
        if not isinstance(self.action_space, spaces.Discrete):
            raise TypeError("The action space should be discrete")
        self.values = None
        self.seed()

    @classmethod
    def default_config(cls):
        return dict(temperature=0.5)

    def get_distribution(self):
        actions = range(self.action_space.n)
        if self.config['temperature'] > 0:
            weights = np.exp(self.values / self.config['temperature'])
        else:
            weights = np.zeros((len(actions),))
            weights[np.argmax(self.values)] = 1
        return {
    
    action: weights[action] / np.sum(weights) for action in actions}

    def update(self, values):
        self.values = values

operation result

The running commands and methods have been introduced in the previous lecture[rl-agents code learning] 01 - Overall framework.

The hyperparameter settings adopt the default settings, and the DQN algorithm is used to run 4000steps and 20000steps respectively. Use Tensorboard to view the results:

 tensorboard --logdir C:\Users\16413\Desktop\rl-agents-master\scripts\out\IntersectionEnv\DQNAgent\baseline_20231113-123234_7944\

4000steps
Insert image description here
Insert image description here
You can see that the final episode reward is roughly around 3.

20000steps
Insert image description here
You can see that the final episode reward is roughly around 3.

Guess you like

Origin blog.csdn.net/sinat_52032317/article/details/134378227