Reinforcement Learning - A Sparse Reward Solution

1. Exploration and Utilization - ICM Based on Intrinsic Motivation

Pathak et al. [73] used the inverse environment model (ICM) to obtain the feature representation of the state, removed the part of the environment model that was not related to the action, and improved the effect of intrinsic motivation. Specifically, the ICM is built while building the environment model, whose input is the state st s_tstand st+1 s_{t+1}st+1, the output is at a_tat. ICM can remove state features that are not related to predicted actions in the feature space through learning, and build an environment model in the feature space to remove environmental noise.
insert image description here
insert image description here

1.domain knowledge

  • domain knowledge refers to knowledge of a particular field, profession, or activity, as opposed to general knowledge1. In other words, domain knowledge is the knowledge of experts or professionals in a certain field. For example, in software engineering, domain knowledge can refer to specific knowledge of the environment in which a target system exists2.
  • Having domain knowledge can help you better understand and communicate the operation mode and needs of a certain industry, and can also improve your work efficiency and quality.
  • domain knowledge can be improved through research, education, training, networking and questioning3.

2.curriculum learning

  • Curriculum learning is a training strategy that imitates the human learning process, allowing the model to learn from easy samples first, and gradually advance to complex samples and knowledge1.
  • Curriculum learning can improve the performance and convergence rate of models without additional computational cost.
  • The core issue of curriculum learning is how to sort the difficulty of samples or tasks, and how to adjust the progress of training according to the difficulty2.
  • Curriculum learning can be divided into pre-defined methods and automatic methods. The former requires artificial design of difficulty measurers and training schedulers, while the latter can be automatically designed in a data-driven manner3.
  • Curriculum learning has a wide range of applications and research value in various fields and tasks.

2. Multi-objective learning

1. Learning framework

Aiming at the sparsity of rewards, Andrychowicz et al. [80] proposed a multi-objective learning [7] algorithm, in which the agent can obtain rewards from the locations it has reached. The algorithm uses virtual targets to replace the original targets during training, so that the agent can get rewards quickly even in the early stage of training, which greatly accelerates the learning process.
insert image description here

2. Target Experience Replay (Hindsight Experience Replay, HER)

Replacing the original goal with an already reached goal enables the agent to get the reward faster. The HER algorithm is suitable for off-policy reinforcement learning algorithms.

3. Auxiliary tasks

In the case of sparse rewards, when the original task is difficult to complete, learning and training can often be accelerated by setting auxiliary tasks. This method mainly includes two types.

  • The first type of approach is "curriculum-style" reinforcement learning. Reward acquisition is difficult when completing the original task is difficult. At this time, the agent can start learning from simple and related tasks, and then continuously increase the difficulty of the tasks, and gradually learn more complex tasks.
  • The second type of method directly adds parallel auxiliary tasks on the basis of the original task, and the original task and the auxiliary task are jointly learned. The advantages of using such auxiliary tasks are:
    • 1) When the original task reward is sparse, the agent can get the reward from the auxiliary task, thus alleviating the problem caused by sparse reward;
    • 2) By training auxiliary tasks, the agent can master certain skills, which will help to complete the original task;
    • 3) The auxiliary task and the original task will share part of the representation at the network level, which will promote the network iteration of the original task when training the auxiliary task.

4. Reward setting

(1) Positive and negative rewards

According to these principles, we can set different reward functions according to different tasks. For example:

  • For a maze-walking task, we can set to give a positive reward when reaching the end; give a negative reward when hitting a wall; give a small negative reward for each step to encourage the agent to find exit 12 as soon as possible.
  • For a chess-playing task, we can set a positive reward when winning; a negative reward when losing; a zero reward for a tie; a potential reward for each step, and evaluate the current strategy based on the board situation. Inferior 12.
  • For an automatic driving task, we can set a positive reward when reaching the destination safely; a negative reward when an accident occurs or violates traffic rules; an intermediate reward is given every time a certain distance is driven, according to factors such as vehicle speed, road conditions, and fuel consumption. Assess current driving efficiency12.
  • Positive rewards usually indicate that the agent has done some behaviors that are conducive to achieving the goal, and negative rewards usually indicate that the agent has done some behaviors that are not conducive to achieving the goal. Depending on the problem and context, rewards can come in different shapes and sizes.

There are some methods that can help to design a reasonable reward function, such as:

  • Reward shaping : This is a method to add some additional reward signal on top of the original reward function to speed up the learning process or improve the learning outcome12. For example, in a maze pathfinding problem, in addition to giving positive rewards for reaching the end point and negative rewards for each step, you can also give positive rewards for approaching the end point or negative rewards for moving away from the end point to guide the agent to move faster. Find the correct path. However, reward shaping needs to follow some principles to avoid changing the optimal policy or introducing unwanted side effects12.

  • Hierarchical reinforcement learning: This is a method of decomposing a complex problem into multiple sub-problems, and designing the reward function and policy function of the sub-problems separately34. For example, in a game, to get an agent from one place to another, the task can be split into four stages, each stage has a sub-goal and a sub-reward3. This reduces the problem of sparse rewards and also improves learning efficiency and interpretability4.

  • The reward setting is too dense and positive, which may cause the agent to continue to receive the reward, fall into a local situation, and cannot complete the customs clearance.

Here are the answers from ChatGPT:

Handling constraints is an important consideration when designing reward functions. Constraints can be restrictions on states or actions, forcing the agent to meet certain conditions during task execution.

A common approach is to introduce a penalty function into the reward function to punish the agent for violating the constraints. Specifically, when the agent violates the constraints, a negative penalty term can be added to the reward function, so that the agent is more inclined to obey the constraints during the optimization process.

The following is a basic reward function design approach for handling constraints:

  • Positive Rewards: Provide positive rewards when the agent achieves a good state in the task or completes the goal. These rewards can encourage the agent to follow constraints and achieve good task performance results.

  • Penalty Functions: Provide negative rewards for agents violating constraints. These penalty items can be set according to the severity and violation of constraints. When a constraint is violated, the value of the penalty function can increase linearly or non-linearly depending on the degree of violation.

  • Constraint Penalty Weighting: You can adjust the weight of the constraint penalty function in the reward function. By adjusting the weights, it is possible to balance positive rewards and penalties for violating constraints, ensuring that the agent takes both task objectives and constraints into account during optimization.

The forms, weights, and coefficients of reward functions and penalty functions need to be designed according to specific constraints and task situations. At the same time, it is necessary to continuously optimize the reward function during the training process so that the agent can learn an appropriate behavior strategy while satisfying the constraints.

In addition, the processing of constraints can also be combined with other techniques, such as projection in optimization algorithms, constraint optimization algorithms, etc., to more directly ensure that the agent satisfies the constraints during execution. These methods can be selected and tuned according to the requirements of specific problems.

(2) Sparse rewards

  • The problem with sparse rewards is that the agent may stumble around for a long period of time, trying different actions, going through many different states, without getting any rewards along the way, and thus, not learning anything along the way.
  • Sparse rewards can be improved by reward reshaping, i.e. providing smaller intermediate rewards to guide the agent along the right path.
  • However, reward reshaping has its own problems. If we give an optimization algorithm a shortcut, the algorithm will take that shortcut. While shortcuts are hidden within the reward function, they are more likely to emerge when we start reshaping rewards. A poorly designed reward function can cause the agent to converge to a non-ideal solution, even if that solution yields the most reward for the agent. It looks like our intermediate rewards might guide the robot successfully toward the 10-meter goal, but the optimal solution might not be toward that first reward. Instead, it might stumble awkwardly toward it, collecting the reward, thereby reinforcing the behavior. Otherwise, the robot may tend to crawl slowly along the ground to collect the rest of the reward. For the agent, this is a reasonable high-reward solution, but for the designer, it is clearly not the preferred outcome.

(3) When negative rewards are obtained in reinforcement learning, should the round be ended?

Whether you should end the round when you get a negative reward in reinforcement learning depends on your task goals and the design of the reward function. Generally speaking, there are the following situations:

  1. If a negative reward represents a condition for task failure or termination, such as a robot falling over or a car hitting a wall, then it is reasonable to end the turn because the agent cannot proceed with action 1 from the current state.
  2. If the negative reward represents a penalty or cost in the task, such as the robot expending energy or the car going off course, then it might not be a good idea to end the round, since the agent may still have a chance to recover or improve from the current state1.
  3. If the negative reward is a pseudo-final state due to artificially truncating the length of the episode, such as in the Mountain Car environment, if the agent cannot reach the top of the mountain within 200 steps, then ending the episode may introduce unnecessary bias or instability2. In this case, the following methods can be used to avoid or mitigate the impact of pseudo-final states2:
    • Use the current step count or time as one of the state observations to let the agent know how far it is from termination.
    • Special handling of the last sample avoids using wrong Q-value estimates or target values.
    • Give a very low reward to the pseudo-final state, and let the agent try to avoid entering this state.
    • Use the cycle method to process the final state, that is, add the value function of the initial state or a special state to the target value.

(4) Reward setting notes

The reward function should be related to the target task. The reward function should encourage the agent to achieve a goal in a task, such as achieving a high score in a game, or reaching a goal location in robot control, etc.

  • Reward functions should be measurable. The reward function should be able to calculate the corresponding reward value according to the behavior of the agent and the state of the environment. This enables the agent to learn how to maximize the reward.

  • The reward function should be simple. The design of the reward function should be as simple as possible, which can reduce the complexity of the algorithm and reduce possible problems. For example, avoid overly complex calculations or nested conditions.

  • The reward function should be robust. The reward function should be somewhat robust to possible noise or anomalies in the environment to ensure that the training and performance of the algorithm will not be disturbed.

  • The reward function should avoid reward sparsity. The reward function should try to avoid providing rewards only when the task is completed, but should provide appropriate rewards during task execution to encourage the agent to explore and learn.

  • The reward function should avoid side effects. The design of the reward function should avoid unnecessary side effects, such as encouraging the agent to take potentially harmful actions, or discouraging the agent to take potentially beneficial actions.

To sum up, the design of the reward function should be carried out according to the specific task and environment. It is necessary to consider various possible situations during the design, and ensure that the design of the reward function is reasonable, simple, robust, and avoids sparsity and side effects.

(5) The positive, negative and size of rewards

  1. Positive and negative options
    Rewards can be positive, negative or zero. A positive reward indicates that the agent performed an action that is beneficial to the task, a negative reward indicates that the action is harmful, and a zero reward indicates that the action has no effect. In general, there should be more positive rewards than negative rewards, since the goal of the task is usually to get a good outcome. However, in some cases, negative rewards can be used to punish certain bad behaviors or to encourage agents to avoid certain behaviors.

  2. Size Selection
    The size of the reward should be chosen according to the task and the environment. A reward that is too small may not provide enough encouragement for the agent to learn the task, and a reward that is too large may cause the agent to overfit or behave meaninglessly in the task.

    In general, the size of the reward should be as close as possible to the actual goal of the task. For example, in a task of controlling a robot, if the goal is for the robot to move to a specific location, the reward size can be set to be the negative of the robot's distance from the goal location, so that the agent can learn how to get there as quickly as possible.

    In addition to this, the sparsity and robustness of rewards also need to be considered. Sparse rewards may cause the agent to fail to learn the correct behavior, while robust rewards can help the agent overcome noise and anomalies in the environment, improving the robustness of training and performance.

In general, the sign and size of the reward need to be selected according to the specific task and environment, and the sparsity and robustness of the reward need to be considered.

Five, the problem

1. Large action space

In reinforcement learning, if the action space is too large, it will bring many difficulties to learning, such as high computational complexity, poor generalization ability, and low exploration efficiency. For this problem, there are some solutions, such as:

2. Constraint problem

(1)

When using reinforcement learning methods, if the model has constraints such as security, diversity, budget, etc., then some special algorithms need to be used to handle these constraints so as not to violate them. 0 paper has some methods to deal with these constraints, for example:

(2) Details

Reinforcement learning is a method for an agent to learn optimal policies by interacting with the environment. However, in some practical scenarios, the behavior of the agent may need to meet some constraints, such as safety, feasibility, cost, etc. At this time, the agent's goal is not only to maximize the reward, but also to avoid violating constraint 1 .

There are some methods to deal with constrained problems in reinforcement learning, such as:

(3) Dual Gradient

  • In reinforcement learning, dual gradients can be used to optimize constrained objective functions, such as maximizing cumulative rewards or minimizing trajectory costs. The basic idea of ​​dual gradient is to transform the original problem into an unconstrained Lagrangian dual problem, and then use the gradient ascent method to update the Lagrangian multipliers, while using other optimization methods to update the original variables. In this way, the two calculations can be performed alternately until the optimal solution is converged.
  • In reinforcement learning, a common application scenario is the trajectory optimization problem, that is, given an initial state and a target state, find an optimal trajectory that minimizes the trajectory cost. Trajectory cost can be trajectory length, energy consumption, collision risk, etc. This problem can usually be expressed in the following form:
    insert image description here
    where xt xtx t​ is the state variable,ut utu t​ is the control variable,c ( xt ​ , ut ​ ) c(xt​,ut​)c(xt,u t ​) is the cost function of each step,c T ​ ( x T ​ ) c_T​(x_T​)cT(xT​) is the terminal cost function,f ( xt , ut ) f(xt​,ut​)f(xt,u t ​) is the state transition function,TTT is the track length,xinit x_{init}xinit​和xgoal x_{goal}xgoal​ are the initial state and the target state.
    In order to solve this problem using the dual gradient method, we can introduce the Lagrange multiplier λ t ​ λtλ t ​To deal with equality constraints, and define the Lagrangian function:
    insert image description here
    insert image description here


To address these issues, some researchers have proposed the concept of dual strategies, which use two strategies simultaneously for learning and exploration: a main strategy and an auxiliary strategy. The main strategy is used to optimize the objective function, while the auxiliary strategy is used to provide the exploration signal. The dual strategy can use the structural information of the action space, such as similarity, hierarchy or decomposability, etc., to improve learning efficiency and exploration ability. Dual strategies can also take advantage of the idea of ​​dual learning, that is, to achieve mutual promotion and constraints by minimizing the distance or divergence between the main strategy and the auxiliary strategy.

There are many specific ways to implement the dual strategy, such as:

(4) Learning rate

If the average constraint is greater than the target constraint, the strategy is too random and the learning rate needs to be reduced

Guess you like

Origin blog.csdn.net/qq_45889056/article/details/130403035