[Reinforcement Learning] "Easy RL" - Q-learning - CliffWalking (cliff walking) code interpretation

0. Preface

The code of this blog comes from the cliff walking actual combat part of the Q learning part of the mushroom book "Easy RL".
Easy-RL github: https://github.com/datawhalechina/easy-rl
Note, this is the v.1.0.3 branch
This part of the code has two core files:

  • qlearning.py
  • task0.py

First learn the task0 part

1. Hyperparameters

There are generally two types of parameters in machine learning models: one type needs to be learned and estimated from data, called model parameters (Parameter) , that is, the parameters of the model itself. Another category is the tuning parameters (tuning parameters) in machine learning algorithms, which need to be set manually, called hyperparameters (Hyperparameter) .

class Config:
    """超参数
    """

    def __init__(self):
        ################################## 环境超参数 ###################################
        self.algo_name = 'Q-learning'  # 算法名称,我们使用Q学习算法
        self.env_name = 'CliffWalking-v0'  # 环境名称,悬崖行走
        self.device = torch.device( 
            "cuda" if torch.cuda.is_available() else "cpu")  # 检测GPU,如果没装CUDA的话默认为CPU
        self.seed = 10  # 随机种子,置0则不设置随机种子。我们学习过程中的随机值都对应着一个随机种子,方便我们复现学习结果
        self.train_eps = 400  # 训练的回合数
        self.test_eps = 30  # 测试的回合数
        ################################################################################

        ################################## 算法超参数 ###################################
        self.gamma = 0.90  # 强化学习中的折扣因子
        self.epsilon_start = 0.95  # ε-贪心策略中的初始epsilon,减小此值可减少学习开始时的随机探索几率
        self.epsilon_end = 0.01  # ε-贪心策略中的终止epsilon,越小学习结果越逼近
        self.epsilon_decay = 300  # e-greedy策略中epsilon的衰减率,此值越大衰减的速度越快
        self.lr = 0.1  # 学习率
        ################################################################################

        ################################# 保存结果相关参数 ################################
        self.result_path = curr_path + "/outputs/" + self.env_name + \
                           '/' + curr_time + '/results/'  # 保存结果的路径
        self.model_path = curr_path + "/outputs/" + self.env_name + \
                          '/' + curr_time + '/models/'  # 保存模型的路径
        self.save_fig = True  # 是否保存图片,注意这里改为 save_fig
        ################################################################################

2. Training

def train(cfg, env, agent):
    print('开始训练!')
    print(f'环境:{
      
      cfg.env_name}, 算法:{
      
      cfg.algo_name}, 设备:{
      
      cfg.device}')
    rewards = []  # 记录每回合的奖励,用来记录并分析奖励的变化
    ma_rewards = []  # 由于得到的奖励可能会产生振荡,使用一个滑动平均的量来反映奖励变化的趋势

	# 开始回合训练
    for i_ep in range(cfg.train_eps):
        ep_reward = 0  # 记录每个回合的奖励
        state = env.reset()  # 重置环境,开始新的回合

		# 开始当前回合的行走,直至走到终点
        while True:  
            action = agent.choose_action(state)  # 根据算法选择一个动作
            next_state, reward, done, _ = env.step(action)  # 与环境进行一次动作交互
            agent.update(state, action, reward, next_state, done)  # Q学习算法更新
            state = next_state  # 更新状态
            ep_reward += reward
            if done:
                break
        rewards.append(ep_reward)
        if ma_rewards:
            ma_rewards.append(ma_rewards[-1] * 0.9 + ep_reward * 0.1)
        else:
            ma_rewards.append(ep_reward)
        print("回合数:{}/{},奖励{:.1f}".format(i_ep + 1, cfg.train_eps, ep_reward))
    print('完成训练!')
    return rewards, ma_rewards

2.1 Initialize the environment and agent

def env_agent_config(cfg, seed=1):
    """创建环境和智能体
    Args:
        cfg ([type]): [description]
        seed (int, optional): 随机种子. Defaults to 1.
    Returns:
        env [type]: 环境
        agent : 智能体
    """
    env = gym.make(cfg.env_name)
    env = CliffWalkingWapper(env)  # 使用自定义装饰器装饰环境
    env.seed(seed)  # 设置随机种子,每个种子对应一个随机结果,只是为了让结果可以精确复现,一般情况下可删去
    n_states = env.observation_space.n  # 状态维度,即 48 个状态
    n_actions = env.action_space.n  # 动作维度, 即 4 个动作
    agent = QLearning(n_states, n_actions, cfg)  # 为智能体设置参数
    return env, agent

2.2 Agent selection action

action = agent.choose_action(state)
The implementation of the method in the above code is as follows:

    def choose_action(self, state):
        self.sample_count += 1
        self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
                       math.exp(-1. * self.sample_count / self.epsilon_decay)  # epsilon是会递减的,这里选择指数递减
        # e-greedy 策略
        if np.random.uniform(0, 1) > self.epsilon:
            action = np.argmax(self.Q_table[str(state)])  # 选择Q(s,a)最大对应的动作
        else:
            action = np.random.choice(self.n_actions)  # 随机选择动作
        return action

The ε-greedy algorithm formula used here:
insert image description here
As the learning process increases, epsilon will decay exponentially until it approaches epsilon_end.
When the randomly selected number is greater than epsilon, that is, when the value is in the range of 1-epsilon, select the action corresponding to the maximum Q(s,a).
Now, let's try to print the current state: print(self.Q_table[str(state)])
the output result is: [ -7.45800334 -78.37958986 -7.46127197 -7.48193639]
the four values ​​in the above array are the values ​​that each action will produce.

2.3 The environment receives actions and feeds back the next state and rewards

After the action is selected, we use this action to interact with the environment once:

next_state, reward, done, _ = env.step(action)

Given an action, we can get the next state and reward from the map.

  • For example, if the action UP=0 is executed at the starting point grid 36, the next state is 24, and the reward is -1;
  • We also need to set the boundaries of the map, for example, execute the action LEFT=1 at the starting point, the next state is still 36, and the reward is −1W;
  • If the action RIGHT=3 is performed, then it will fall into the cliff, the next state is 36, and the reward is −100.

The specific logic calculation process is in C:\Python310\Lib\site-packages\gym\envs\toy_text\cliffwalking.pyview.
The parameter done is used to judge whether to reach the end point.

2.4 Agent performs policy update (learning)

Now that we have the current state, chosen action, reward, and next state, we can update the Q-table inside the agent using the Q-learning algorithm:

agent.update(state, action, reward, next_state, done)  # Q学习算法更新

The method is implemented as follows:

    def update(self, state, action, reward, next_state, done):
        Q_predict = self.Q_table[str(state)][action]  # 读取预测价值
        if done:  # 终止状态判断
            Q_target = reward  # 终止状态下获取不到下一个动作,直接将 Q_target 更新为对应的奖励
        else:
            Q_target = reward + self.gamma * np.max(self.Q_table[str(next_state)])
        self.Q_table[str(state)][action] += self.lr * (Q_target - Q_predict)

The formula involved is the incremental learning pseudocode of Q-learning mentioned in the book:
insert image description here
in this way, the value of the corresponding action of the current state is updated, that is, the policy update.

3. Result processing

In the above, we have completed a round of learning. After each round of learning, we need to record the rewards of this round for subsequent visualization:

		rewards.append(ep_reward)
        if ma_rewards:
            ma_rewards.append(ma_rewards[-1] * 0.9 + ep_reward * 0.1)
        else:
            ma_rewards.append(ep_reward)

Since the obtained rewards may oscillate, we use a sliding average to reflect the trend of reward changes, that is, use the new reward and the previous reward to calculate an average reward and add it to the list.

3.1 Model saving

After all rounds are executed, save the trained model:

	make_dir(cfg.result_path, cfg.model_path)  # 创建保存结果和模型路径的文件夹
    agent.save(path=cfg.model_path)  # 保存模型

The implementation of save:

    def save(self, path):
        import dill
        torch.save(
            obj=self.Q_table,
            f=path + "Qlearning_model.pkl",
            pickle_module=dill
        )
        print("保存模型成功!")

dill模块https://pypi.org/project/dill/
dill extends python’s pickle module for serializing(序列化) and de-serializing(反序列化) python objects to the majority of the built-in python types. Serialization is the process of converting an object to a byte stream, and the inverse of which is converting a byte stream back to a python object hierarchy.
dill provides the user the same interface as the pickle module, and also includes some additional features. In addition to pickling python objects, dill provides the ability to save the state of an interpreter session in a single command. Hence, it would be feasable to save an interpreter session, close the interpreter, ship the pickled file to another computer, open a new interpreter, unpickle the session and thus continue from the ‘saved’ state of the original interpreter session.

We use the pkl file (this storage method can save some temporary variables used in the python project process, or data such as strings, lists, dictionaries that need to be extracted and temporarily stored) to save this trained model, that is, Q sheet. Packaged modules use the dill module.

torch.save()
Save a serialized object to disk. The function uses Python's pickle program for serialization. Models, tensors and dictionaries are all target types that can be saved with this function.

3.2 Model reading

    def load(self, path):
        import dill
        self.Q_table = torch.load(f=path + 'Qlearning_model.pkl', pickle_module=dill)
        print("加载模型成功!")

Similar to model saving, use torch.load()to read the model to load the trained Q table.

3.3 Model testing

The method of model testing and training is basically the same, the only difference is that there is no need to update the Q table, that is, there is no following line of code:

agent.update(state, action, reward, next_state, done)  # Q学习算法更新

Guess you like

Origin blog.csdn.net/qq_43557907/article/details/126196776