Reinforce algorithm principle and Tensorflow code implementation

Both Q-learning and DQN algorithms are Value-based methods in reinforcement learning. They both select actions through the Q value first. Another big category in reinforcement learning is Policy Gradient Methods. Policy Gradient is a kind of reinforcement learning method that directly targets the expected return (Expected Return) through gradient descent (Gradient Descent) for policy optimization. This type of method avoids some of the difficulties faced by other traditional reinforcement learning methods, such as not having an accurate value function, or intractability due to continuous state and action spaces and uncertainty of state information. ). The most famous of these is Policy Gradient, and Policy Gradient algorithms can be divided into two categories according to the update method :

Monte Carlo update method : Reinfoce algorithm (round update)

Timing difference update method : Actor-Critic algorithm (single-step update)

Review Monte Carlo method and temporal difference method

The Monte Carlo method can be understood as after the algorithm completes a round, it uses the data of this round to learn and make an update. Because we have obtained the data of the whole round, we can also obtain the reward of each step, and we can easily calculate the future total reward of each step, ��Gt. ��Gtis the total future reward, which represents the sum of the rewards we can get from this step. �1G1represents the total reward we can get from the first step. �2G2represents the total rewards that can be obtained from the second step.

Compared with the Monte Carlo method, which is updated once in a round, the temporal difference method is updated once for each step, that is, every step is updated once, and the update frequency of the temporal difference method is higher. The temporal difference method uses the Q function to approximate the future total reward ��Gt.

Principle of Reinfoce Algorithm

Reinfoce uses a Monte Carlo method to estimate the expected rewards for taking actions in each state , and then uses these estimates to calculate policy gradients and update policy parameters. Because the Reinfoce algorithm is a model-free algorithm , it does not need to build a model of the environment, nor does it require intermediate steps such as predictive value functions, which is simpler and more direct than other reinforcement learning algorithms.

The Reinfoce algorithm intuitively improves the performance of the policy gradually through the method of gradient ascent in the parameter space of the policy.

▽�(�)=��∼��[∑�′=0∞▽��(��′∣��′)��′∑�=�′∞��−�′��]▽J(θ)=Eτ∼πθ[t′=0∑∞▽θlogπθ(At′∣St′)γt′t=t′∑∞γt−t′Rt]

Using a discount factor also helps reduce the problem of large variance when estimating gradients, since the discount factor gives lower weight to future rewards. In practical use, ��′γt′ is often removed to avoid the problem of overemphasizing the early state of the trajectory.

Although Reinfoce is simple and intuitive, one of its disadvantages is that the estimates of gradients have large variance. For a trajectory of length L, the randomness of the reward ��Rtmay grow exponentially with L. In order to alleviate the problem that the estimated variance is too large, a common method is to introduce a benchmark function ��(��)b(Si). The requirement for ��(��)b(Si) here is that it can only be a function of state ��Si (or more precisely, it cannot be a function of ��Ai). With the benchmark function �(��)b(St), the gradient of the reinforcement learning objective function ▽�(�)▽J(θ) can be expressed as:

▽�(�)=��∼��[∑�′=0∞▽��(��′∣��′)(∑�=�′∞��−�′��−�(��′))]▽J(θ)=Eτ∼πθ[t′=0∑∞▽θlogπθ(At′∣St′)(t=t′∑∞γt−t′Rt−b(St′))]

Code Implementation of Reinfoce Algorithm

Algorithm pseudocode:

Detailed code:

Consider putting the entire algorithm into a class and writing each part of the code into a corresponding function. This makes the code more concise and readable. The structure of the PolicyGradient class is as follows:

ruby

copy code

class PolicyGradient: def __init__(self, state_dim, action_num, learning_rate=0.02, gamma=0.99): ...... def get_action(self, s, greedy=False): # 基于动作分布选择动作 ...... def store_transition(self, s, a, r): # 存储从环境中采样的交互数据 ...... def learn(self): # 使用存储的数据进行学习和更新 ...... def _discount_and_norm_rewards(self): # 计算折扣化回报并进行标准化处理 ...... def save(self): # 存储模型 ...... def load(self): # 载入模型 ......

The initialization function creates some variables and models successively and selects Adam as the strategy optimizer. In the code, we can see that the policy network here has only one hidden layer.

This

copy code

def __init__(self, state_dim, action_num, learning_rate=0.02, gamma=0.99): self.gamma = gamma self.state_buffer, self.action_buffer, self.reward_buffer = [], [], [] input_layer = tl.layers.Input([None, state_dim], tf.float32) layer = tl.layers.Dense( n_units=30, act=tf.nn.tanh, W_init=tf.random_normal_initializer(mean=0, stddev=0.3), b_init=tf.constant_initializer(0.1))(input_layer) all_act = tl.layers.Dense( n_units=action_num, act=None, W_init=tf.random_normal_initializer(mean=0, stddev=0.3), b_init=tf.constant_initializer(0.1))(layer) self.model = tl.models.Model(inputs=input_layer, outputs=all_act) self.model.train() self.optimizer = tf.optimizers.Adam(learning_rate)

After initializing the policy network, we can calculate the probability of each action in a certain state through the get_action() function. By setting 'greedy=True', the action with the highest probability can be output directly .

scss

copy code

def get_action(self, s, greedy=False): _logits = self.model(np.array([s], np.float32)) _probs = tf.nn.softmax(_logits).numpy() if greedy: return np.argmax(_probs.ravel()) return tl.rein.choice_action_by_probs(_probs.ravel())

But at this time, the action we choose may not be good. Only through continuous learning can the network make better and better judgments. Each learning process is completed by the learn() function. We update the model using the normalized discounted reward and cross-entropy loss . After each update, the learned transfer data will be discarded.

python

copy code

def learn(self): discounted_ep_rs_norm = self._discount_and_norm_rewards() with tf.GradientTape() as tape: _logits = self.model(np.vstack(self.ep_obs)) neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=np.array(self.ep_as)) loss = tf.reduce_mean(neg_log_prob * discounted_ep_rs_norm) grad = tape.gradient(loss, self.model.trainable_weights) self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights)) self.ep_obs, self.ep_as, self.ep_rs = [], [], [] # 清空片段数据 return discounted_ep_rs_norm

The learn() function needs to use sampled data from the agent interacting with the environment. So we need to use store_tran-sition() to store every state, action and reward during the interaction.

ruby

copy code

def store_transition(self, s, a, r): self.ep_obs.append(np.array([s], np.float32)) self.ep_as.append(a) self.ep_rs.append(r)

The policy gradient algorithm uses a Monte Carlo method. Therefore, we need to calculate the discounted return and normalize the return, which also helps in learning.

python

copy code

def _discount_and_norm_rewards(self): discounted_ep_rs = np.zeros_like(self.ep_rs) running_add = 0 for t in reversed(range(0, len(self.ep_rs))): running_add = running_add * self.gamma + self.ep_rs[t] discounted_ep_rs[t] = running_add# 标准化片段奖励 discounted_ep_rs -= np.mean(discounted_ep_rs) discounted_ep_rs /= np.std(discounted_ep_rs) return discounted_ep_rs

Prepare the environment and algorithm first. After creating the environment, we create an instance of the PolicyGradient class named agent.

This

copy code

env = gym.make(ENV_ID).unwrapped np.random.seed(RANDOM_SEED) tf.random.set_seed(RANDOM_SEED) env.seed(RANDOM_SEED) agent = PolicyGradient( action_num=env.action_space.n, state_dim=env.observation_space.shape[0], ) t0 = time.time()

In training mode, the actions output by the model are used to interact with the environment, after which transfer data is stored and the policy is updated at each segment. To simplify the code, the agent will be updated directly at the end of each round.

css

copy code

if args.train: all_episode_reward = [] for episode in range(TRAIN_EPISODES): state = env.reset() episode_reward = 0 for step in range(MAX_STEPS): if RENDER: env.render() action = agent.get_action(state) next_state, reward, done, info = env.step(action) agent.store_transition(state, action, reward) state = next_state episode_reward += reward if done: break agent.learn() print(’Training | Episode: {} / {} | Episode Reward: {:.0f} | Running Time:{:.4f}’.format( episode + 1, TRAIN_EPISODES, episode_reward, time.time() - t0))

Add some code at the end of each game to better display the training process. We show the total reward for each round and the running reward calculated by sliding average. Run rewards can then be plotted to better observe training trends. Finally, store the trained model.

scss

copy code

agent.save() plt.plot(all_episode_reward) if not os.path.exists(’image’): os.makedirs(’image’) plt.savefig(os.path.join(’image’, ’pg.png’))

If we use test mode, the process is simpler, just load the pre-trained model and use it to interact with the environment.

css

copy code

if args.test: agent.load() for episode in range(TEST_EPISODES): state = env.reset() episode_reward = 0 for step in range(MAX_STEPS): env.render() state, reward, done, info = env.step(agent.get_action(state, True)) episode_reward += reward if done: break print(’Testing | Episode: {} / {} | Episode Reward: {:.0f} | Running Time:{:.4f}’.format( episode + 1, TEST_EPISODES, episode_reward, time.time() - t0))