机器学习框架Ray -- 1.4 Ray RLlib的基本使用

1.什么是RLlib

RLlib是一种建立在Ray之上的行业级别的强化学习（RL）库。RLlib提供了高度可扩展性和统一的API，适用于各种行业和研究应用。

下面在Anaconda中创建Ray RLlib的环境。

conda create -n RayRLlib python=3.7 
conda activate RayRLlib 
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install tensorflow-probability -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install ipykernel -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pyarrow
pip install gputil
pip install "ray[rllib]" -i https://pypi.tuna.tsinghua.edu.cn/simple

选择上述RayLib作为解释器，导入gym环境与ray库。使用PPO算法，gym环境为自定义的。

import gymnasium as gym
from ray.rllib.algorithms.ppo import PPOConfig

2.定义一个类似于gym的小游戏类

定义了一个名为SimpleCorridor的自定义gym环境。在这个环境中，智能体需要学会向右移动以到达走廊的出口。智能体需要在走廊里移动以到达出口。S表示起点，G表示目标，走廊长度可配置。智能体可以选择的动作有0（左）和1（右）。观察值是一个浮点数，表示当前位置的索引。每一步的奖励值是-0.1，除非到达目标位置（奖励值+1.0）。

英文原版：
Corridor in which an agent must learn to move right to reach the exit.

---------------------
| S | 1 | 2 | 3 | G | S=start; G=goal; corridor_length=5
---------------------

Possible actions to chose from are: 0=left; 1=right
Observations are floats indicating the current field index, e.g. 0.0 for
starting position, 1.0 for the field next to the starting position, etc..
Rewards are -0.1 for all steps, except when reaching the goal (+1.0).

class定义如下

# Define your problem using python and Farama-Foundation's gymnasium API:

class SimpleCorridor(gym.Env):

    def __init__(self, config):
        # 初始化环境，包括设置结束位置、当前位置、动作空间（两个离散动作：左和右）和观察空间。
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = gym.spaces.Discrete(2)  # left and right
        self.observation_space = gym.spaces.Box(0.0, self.end_pos, shape=(1,))

    def reset(self, *, seed=None, options=None):
        # 重置环境，将当前位置设为0，并返回初始观察值。
        """Resets the episode.

        Returns:
           Initial observation of the new episode and an info dict.
        """
        self.cur_pos = 0
        # Return initial observation.
        return [self.cur_pos], {}

    def step(self, action):
        # 根据给定的动作在环境中执行一步操作。根据动作和当前位置更新智能体位置。
        # 当到达走廊末端（目标）时，设置terminated标志。
        # 当目标达成时，奖励为+1.0，否则为-0.1。
        # 返回新的观察值、奖励、terminated标志、truncated标志和信息字典。
        
        """Takes a single step in the episode given `action`.

        Returns:
            New observation, reward, terminated-flag, truncated-flag, info-dict (empty).
        """
        # Walk left.
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        # Walk right.
        elif action == 1:
            self.cur_pos += 1
        # Set `terminated` flag when end of corridor (goal) reached.
        terminated = self.cur_pos >= self.end_pos
        truncated = False
        # +1 when goal reached, otherwise -1.
        reward = 1.0 if terminated else -0.1
        return [self.cur_pos], reward, terminated, truncated, {}

3.基于PPO的强化学习训练

以下代码通过Ray RLlib创建了一个PPOConfig对象，并使用SimpleCorridor环境。设置环境配置，设置走廊长度为28。通过设置num_rollout_workers为10来并行化环境探索。通过配置构建PPO算法对象。

config = (
    PPOConfig().environment(
        # Env class to use (here: our gym.Env sub-class from above).
        env=SimpleCorridor,
        # Config dict to be passed to our custom env's constructor.
        # Use corridor with 20 fields (including S and G).
        env_config={"corridor_length": 28},
    )
    # Parallelize environment rollouts.
    .rollouts(num_rollout_workers=10)
)
# Construct the actual (PPO) algorithm object from the config.
algo = config.build()

# 循环训练PPO算法20次迭代，输出每次迭代的平均奖励。
for i in range(20):
    results = algo.train()
    print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")

通过上述代码进行强化学习训练，并行rollout workers为10个，训练迭代次数为20次。

训练过程中平均Reward输出如下。

(RolloutWorker pid=334231) /home/yaoyao/anaconda3/envs/RayRLlib/lib/python3.7/site-packages/gymnasium/spaces/box.py:227: UserWarning: WARN: Casting input x to numpy array.
...
Iter: 0; avg. reward=-24.700000000000117
Iter: 1; avg. reward=-29.840909090909282
...
Iter: 18; avg. reward=-1.7286713286713296
Iter: 19; avg. reward=-1.7269503546099298

4.验证模型

在走廊环境中执行一个完整的episode。从初始观察值开始，使用算法计算一个动作，将动作应用于环境并获得新的观察值、奖励、terminated标志和truncated标志。累积奖励并在循环结束时输出总奖励。

# 在训练完成后，使用训练好的算法在新的走廊环境（长度为10）中进行推理
env = SimpleCorridor({"corridor_length": 10})
# 首先初始化环境并获得初始观察值。
terminated = truncated = False
total_reward = 0.0
# 玩1个回合
while not terminated and not truncated:
    # Compute a single action, given the current observation
    # from the environment.
    action = algo.compute_single_action(obs)
    # Apply the computed action in the environment.
    obs, reward, terminated, truncated, info = env.step(action)
    # Sum up rewards for reporting purposes.
    total_reward += reward
# 结果输出
print(f"Played 1 episode; total-reward={total_reward}")

经过训练获得的模型在指定环境中验证，其最终获得的奖励为+0.1，相比较初始的-24有明显进步。

Played 1 episode; total-reward=0.10000000000000009

由于本案例中，长廊长度为10，Agent采取最优策略（一直向右行走），能够获得的最大奖励为 $Rmax=9\times \left ( -0.1 \right )+\left ( +1 \right )=+0.1$

可以说明，Agent通过PPO算法已经学会了最优策略。