1.什么是RLlib
RLlib是一种建立在Ray之上的行业级别的强化学习(RL)库。RLlib提供了高度可扩展性和统一的API,适用于各种行业和研究应用。
下面在Anaconda中创建Ray RLlib的环境。
conda create -n RayRLlib python=3.7
conda activate RayRLlib
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install tensorflow-probability -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install ipykernel -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pyarrow
pip install gputil
pip install "ray[rllib]" -i https://pypi.tuna.tsinghua.edu.cn/simple
选择上述RayLib作为解释器,导入gym环境与ray库。使用PPO算法,gym环境为自定义的。
import gymnasium as gym
from ray.rllib.algorithms.ppo import PPOConfig
2.定义一个类似于gym的小游戏类
定义了一个名为SimpleCorridor的自定义gym环境。在这个环境中,智能体需要学会向右移动以到达走廊的出口。智能体需要在走廊里移动以到达出口。S表示起点,G表示目标,走廊长度可配置。智能体可以选择的动作有0(左)和1(右)。观察值是一个浮点数,表示当前位置的索引。每一步的奖励值是-0.1,除非到达目标位置(奖励值+1.0)。
英文原版:
Corridor in which an agent must learn to move right to reach the exit.
---------------------
| S | 1 | 2 | 3 | G | S=start; G=goal; corridor_length=5
---------------------
Possible actions to chose from are: 0=left; 1=right
Observations are floats indicating the current field index, e.g. 0.0 for
starting position, 1.0 for the field next to the starting position, etc..
Rewards are -0.1 for all steps, except when reaching the goal (+1.0).
class定义如下
# Define your problem using python and Farama-Foundation's gymnasium API:
class SimpleCorridor(gym.Env):
def __init__(self, config):
# 初始化环境,包括设置结束位置、当前位置、动作空间(两个离散动作:左和右)和观察空间。
self.end_pos = config["corridor_length"]
self.cur_pos = 0
self.action_space = gym.spaces.Discrete(2) # left and right
self.observation_space = gym.spaces.Box(0.0, self.end_pos, shape=(1,))
def reset(self, *, seed=None, options=None):
# 重置环境,将当前位置设为0,并返回初始观察值。
"""Resets the episode.
Returns:
Initial observation of the new episode and an info dict.
"""
self.cur_pos = 0
# Return initial observation.
return [self.cur_pos], {}
def step(self, action):
# 根据给定的动作在环境中执行一步操作。根据动作和当前位置更新智能体位置。
# 当到达走廊末端(目标)时,设置terminated标志。
# 当目标达成时,奖励为+1.0,否则为-0.1。
# 返回新的观察值、奖励、terminated标志、truncated标志和信息字典。
"""Takes a single step in the episode given `action`.
Returns:
New observation, reward, terminated-flag, truncated-flag, info-dict (empty).
"""
# Walk left.
if action == 0 and self.cur_pos > 0:
self.cur_pos -= 1
# Walk right.
elif action == 1:
self.cur_pos += 1
# Set `terminated` flag when end of corridor (goal) reached.
terminated = self.cur_pos >= self.end_pos
truncated = False
# +1 when goal reached, otherwise -1.
reward = 1.0 if terminated else -0.1
return [self.cur_pos], reward, terminated, truncated, {}
3.基于PPO的强化学习训练
以下代码通过Ray RLlib创建了一个PPOConfig对象,并使用SimpleCorridor环境。设置环境配置,设置走廊长度为28。通过设置num_rollout_workers为10来并行化环境探索。通过配置构建PPO算法对象。
config = (
PPOConfig().environment(
# Env class to use (here: our gym.Env sub-class from above).
env=SimpleCorridor,
# Config dict to be passed to our custom env's constructor.
# Use corridor with 20 fields (including S and G).
env_config={"corridor_length": 28},
)
# Parallelize environment rollouts.
.rollouts(num_rollout_workers=10)
)
# Construct the actual (PPO) algorithm object from the config.
algo = config.build()
# 循环训练PPO算法20次迭代,输出每次迭代的平均奖励。
for i in range(20):
results = algo.train()
print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")
通过上述代码进行强化学习训练,并行rollout workers为10个,训练迭代次数为20次。
训练过程中平均Reward输出如下。
(RolloutWorker pid=334231) /home/yaoyao/anaconda3/envs/RayRLlib/lib/python3.7/site-packages/gymnasium/spaces/box.py:227: UserWarning: WARN: Casting input x to numpy array.
...
Iter: 0; avg. reward=-24.700000000000117
Iter: 1; avg. reward=-29.840909090909282
...
Iter: 18; avg. reward=-1.7286713286713296
Iter: 19; avg. reward=-1.7269503546099298
4.验证模型
在走廊环境中执行一个完整的episode。从初始观察值开始,使用算法计算一个动作,将动作应用于环境并获得新的观察值、奖励、terminated标志和truncated标志。累积奖励并在循环结束时输出总奖励。
# 在训练完成后,使用训练好的算法在新的走廊环境(长度为10)中进行推理
env = SimpleCorridor({"corridor_length": 10})
# 首先初始化环境并获得初始观察值。
terminated = truncated = False
total_reward = 0.0
# 玩1个回合
while not terminated and not truncated:
# Compute a single action, given the current observation
# from the environment.
action = algo.compute_single_action(obs)
# Apply the computed action in the environment.
obs, reward, terminated, truncated, info = env.step(action)
# Sum up rewards for reporting purposes.
total_reward += reward
# 结果输出
print(f"Played 1 episode; total-reward={total_reward}")
经过训练获得的模型在指定环境中验证,其最终获得的奖励为+0.1,相比较初始的-24有明显进步。
Played 1 episode; total-reward=0.10000000000000009
由于本案例中,长廊长度为10,Agent采取最优策略(一直向右行走),能够获得的最大奖励为
可以说明,Agent通过PPO算法已经学会了最优策略。