Machine learning framework Ray -- 1.4 Basic use of Ray RLlib

1.What is RLlib

RLlib is an industry-grade reinforcement learning (RL) library built on Ray. RLlib provides a highly scalable and unified API suitable for a variety of industry and research applications.

Next, create the Ray RLlib environment in Anaconda.

conda create -n RayRLlib python=3.7 
conda activate RayRLlib 
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install tensorflow-probability -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install ipykernel -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pyarrow
pip install gputil
pip install "ray[rllib]" -i https://pypi.tuna.tsinghua.edu.cn/simple 

Select the above RayLib as the interpreter and import the gym environment and ray library. Using the PPO algorithm, the gym environment is customized.

import gymnasium as gym
from ray.rllib.algorithms.ppo import PPOConfig

2. Define a small game class similar to gym

A custom gym environment called SimpleCorridor is defined. In this environment, the agent needs to learn to move to the right to reach the exit of the corridor. The agent needs to move in the corridor to reach the exit. S represents the starting point, G represents the goal, and the length of the corridor is configurable. The actions that the agent can choose are 0 (left) and 1 (right). The observation value is a floating point number representing the index of the current position. The reward value for each step is -0.1, unless the target position is reached (reward value +1.0).

英文原版:
    Corridor in which an agent must learn to move right to reach the exit.

    ---------------------
    | S | 1 | 2 | 3 | G |   S=start; G=goal; corridor_length=5
    ---------------------

    Possible actions to chose from are: 0=left; 1=right
    Observations are floats indicating the current field index, e.g. 0.0 for
    starting position, 1.0 for the field next to the starting position, etc..
    Rewards are -0.1 for all steps, except when reaching the goal (+1.0).

 class is defined as follows

# Define your problem using python and Farama-Foundation's gymnasium API:

class SimpleCorridor(gym.Env):

    def __init__(self, config):
        # 初始化环境,包括设置结束位置、当前位置、动作空间(两个离散动作:左和右)和观察空间。
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = gym.spaces.Discrete(2)  # left and right
        self.observation_space = gym.spaces.Box(0.0, self.end_pos, shape=(1,))

    def reset(self, *, seed=None, options=None):
        # 重置环境,将当前位置设为0,并返回初始观察值。
        """Resets the episode.

        Returns:
           Initial observation of the new episode and an info dict.
        """
        self.cur_pos = 0
        # Return initial observation.
        return [self.cur_pos], {}

    def step(self, action):
        # 根据给定的动作在环境中执行一步操作。根据动作和当前位置更新智能体位置。
        # 当到达走廊末端(目标)时,设置terminated标志。
        # 当目标达成时,奖励为+1.0,否则为-0.1。
        # 返回新的观察值、奖励、terminated标志、truncated标志和信息字典。
        
        """Takes a single step in the episode given `action`.

        Returns:
            New observation, reward, terminated-flag, truncated-flag, info-dict (empty).
        """
        # Walk left.
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        # Walk right.
        elif action == 1:
            self.cur_pos += 1
        # Set `terminated` flag when end of corridor (goal) reached.
        terminated = self.cur_pos >= self.end_pos
        truncated = False
        # +1 when goal reached, otherwise -1.
        reward = 1.0 if terminated else -0.1
        return [self.cur_pos], reward, terminated, truncated, {}

3. PPO-based reinforcement learning training

The following code creates a PPOConfig object through Ray RLlib and uses the SimpleCorridor environment. Set the environment configuration and set the corridor length to 28. Parallelize environment exploration by setting num_rollout_workers to 10. Construct PPO algorithm object through configuration.

config = (
    PPOConfig().environment(
        # Env class to use (here: our gym.Env sub-class from above).
        env=SimpleCorridor,
        # Config dict to be passed to our custom env's constructor.
        # Use corridor with 20 fields (including S and G).
        env_config={"corridor_length": 28},
    )
    # Parallelize environment rollouts.
    .rollouts(num_rollout_workers=10)
)
# Construct the actual (PPO) algorithm object from the config.
algo = config.build()

# 循环训练PPO算法20次迭代,输出每次迭代的平均奖励。
for i in range(20):
    results = algo.train()
    print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")

Reinforcement learning training is performed through the above code, with 10 parallel rollout workers and 20 training iterations.

The average Reward output during training is as follows.

(RolloutWorker pid=334231) /home/yaoyao/anaconda3/envs/RayRLlib/lib/python3.7/site-packages/gymnasium/spaces/box.py:227: UserWarning: WARN: Casting input x to numpy array.
...
Iter: 0; avg. reward=-24.700000000000117
Iter: 1; avg. reward=-29.840909090909282
...
Iter: 18; avg. reward=-1.7286713286713296
Iter: 19; avg. reward=-1.7269503546099298

4. Validate the model

Execute a complete episode in a corridor environment. Starting from an initial observation, use an algorithm to calculate an action, apply the action to the environment and obtain new observations, rewards, terminated flags and truncated flags. Accumulate rewards and output the total reward at the end of the loop.

# 在训练完成后,使用训练好的算法在新的走廊环境(长度为10)中进行推理
env = SimpleCorridor({"corridor_length": 10})
# 首先初始化环境并获得初始观察值。
terminated = truncated = False
total_reward = 0.0
# 玩1个回合
while not terminated and not truncated:
    # Compute a single action, given the current observation
    # from the environment.
    action = algo.compute_single_action(obs)
    # Apply the computed action in the environment.
    obs, reward, terminated, truncated, info = env.step(action)
    # Sum up rewards for reporting purposes.
    total_reward += reward
# 结果输出
print(f"Played 1 episode; total-reward={total_reward}")

The model obtained after training is verified in the specified environment, and the final reward is +0.1, which is a significant improvement compared to the initial -24.

Played 1 episode; total-reward=0.10000000000000009

Since in this case, the length of the corridor is 10, the Agent adopts the optimal strategy (always walking to the right), and the maximum reward that can be obtained isRmax=9\times \left ( -0.1 \right )+\left ( +1 \right )=+0.1

It can be shown that the Agent has learned the optimal strategy through the PPO algorithm.

Guess you like

Origin blog.csdn.net/wenquantongxin/article/details/129969925