开悟Optimization guide for intermediate tracks

mojia_map

Table of contents

Cognitive model

Reference scheme (disassembled by module)


Cognitive model

The model controls 1 hero for a mirror 1 v 1 battle

Actor cluster resource is 64-core CPU

Problem characteristics: a single fair confrontation scenario (match with hero image), low sample production per unit time, and relatively abundant cumulative training resources.

Difficulties of the intermediate track:

  1. RL convergence guarantee under training: In the case of small resources and long-term training, how to solve the problem of sample diversity, so that the training effect is comparable to the effect of high-concurrency and medium-time training.

  2. Customized optimization for game heroes: Neither the environment nor the Baseline has been optimized for game heroes. Participating teams can study hero design and mechanism gameplay characteristics, improve related design such as characteristics, rules, action space, and rewards, improve the ability to represent environmental states, integrate known prior sums, reduce the difficulty of strategy learning and increase the upper limit of a single hero's ability.

  3. In a single fair confrontation scenario, more details need to be paid attention to and improved to obtain a more complete strategy.

In addition, participating teams can:

  1. Eliminate various bugs (such as unexpected movement/movement

  2. Implement multi-style and diverse strategies that support online adjustment or scheduling. Further optimization for the opponent's behavior, including modeling and predicting the opponent's behavior, so as to preemptively strike or limit the evasion to resolve the crisis.

  3. Refine the reward schemes at different stages in the bureau (such as dynamic weights), and the reward weights at different stages of course learning (such as process-oriented or result-oriented)

Reference scheme (disassembled by module)

uTools_1692861613810

(CNN+MLP)-LSTM's Encoder-Decoder structure Multi--Head Value, improves the Value estimation effect by grouping, reduces the variance and uses the global Perfect Information to help Value estimation

uTools_1692861652042

Custom Features:

uTools_1692861824105

uTools_1692861933366

Post-rules (processing model output)

For details, see "Development Guide"

By developing post-rules, it can support the secondary mapping of model output -> action. Under certain implementations, from RL

From the perspective of the Agent, it is equivalent to a change in the environment.

bonus system

For specific operations, please refer to "Development Guide", "Environment Introduction", and it is recommended to refer to the relevant designs of the previous promotion teams

Possible optimization directions include:

  1. Adjust reward subitem weight

    1. static weight

    2. Intra-office dynamic weight (eg reward intra-office attenuation)

  2. New bonus items

opponent model

For specific operations, please refer to the "Development Guide", and it is recommended to refer to the relevant designs of the previous promotion teams

Corresponding location: actor/actor..py

Possible optimization directions include:

  1. Ratio of old and new models

  2. Opponent Model Pool/League

  3. Custom Evaluation Matches

learning strategy

Curriculum learning: decompose complex and comprehensive learning objectives into multi-stage, smaller-scale learning tasks.

Through the way of pre-training model + multi-stage training, the overall learning goal is achieved step by step. Heart

Example: Realize the transition from dense rewards to sparse rewards through inter-session reward decay

In the early stage of training, focus on dense rewards related to specific behaviors, and guide the agent to learn basic operations

In the middle of training, enhance the dense rewards that are strongly related to the outcome of the game, and guide the agent to establish an advantage in a single game

In the later stage of training, increase the sparse reward weight to guide the agent to directly focus on the final outcome

Knowledge Distillation : Reinforcement Learning + Supervised Learning Joint Optimization Scheme

Example: One of the internal solutions to the multi-hero power-up problem

Single hero teacher training: reinforcement learning, single hero mirror self-play

Distillation (Single Hero Teacher->Multiple Hero Student): Store Teacher Logits in the Actor sample

Learner adds a distillation of supervised learning

Multi-hero Student training: reinforcement learning, multi-hero mixed battle

Note: You can refer to the principle. Limited by the scale of resources, the input-output ratio of actually replicating this practice in the competition may not be outstanding.

System Optimization Angle

Learner efficiency

  1. effective throughput

    1. In the computing efficiency competition scenario, considering that the Actor sample throughput per unit time is much lower than that of the Learner, it is generally not necessary to further increase the sample throughput

      Batch calculations in Graph should be combined as much as possible to increase the calculation density, corresponding to the underlying calculation from GEMV (memory-intensive) -> GEMM (computation-intensive)

      Use operators with better performance on GCU to implement (can be combined with Tunnel-Development Guide and actual Profiling situation)

    2. The MemPool implementation of the current framework version of the sample pool is not optimal, and there may be a certain degree of sample waste. By improving the related implementation of the MemPool data structure and changing the random read and write method, the situation of samples being overwritten can be reduced.

  2. Sample efficiency and convergence speed

    Reinforcement learning algorithm optimization: other conditions remain unchanged, the more efficient the reinforcement learning algorithm is, the shorter the time required to converge to the same ability

    1. Dual-Clip PPO / Value Clip performs double clipping on policy loss to avoid the influence of advantage value outlier on convergence stability. Similarly, the value loss can be clipped to avoid the influence of excessive single-step update range on the convergence of the value network.

    2. Value Normalization

      uTools_1692862574716

      For the policy gradient method, along with the RL process, the learning objective of the value network may change drastically, which affects the learning effect of value estimation, and thus affects the overall convergence effect and stability. Experiments show that by introducing Vlue normalization, the sample efficiency of PPO-like algorithms may be improved.

    1. PPG algorithm

      uTools_1692862714147

    1. SAC algorithm

      uTools_1692862734298

      PPO has an on-policy assumption, and in the distributed reinforcement learning scenario, in order to achieve more efficient parallel computing, there is a compromise on the degree of on-policy. SAC is an off-policy random strategy RL algorithm. It is similar to DDPG in the way of sample utilization. The difference is that it generates stochastic policy, which shows better sample efficiency than DDPG in some benchmarks.

      Considering that changing the PPO algorithm involves a large amount of development and debugging workload, it is recommended that most teams give priority to optimization based on the PPO algorithm, and teams with enough learning can directly try to change the algorithm.

uTools_1692862883445

Actor efficiency

Throughput optimization

  1. CPU inference optimization: Actor processes run in a CPU multi-machine environment, which can optimize model computing performance for CPU platforms

    1. PyTorch JIT optimization

    2. Integrate third-party reasoning frameworks (eg: onnx-runtime)

Improve sample distribution

  1. You can try to reduce the generation of invalid/homogeneous samples and improve sample diversity through intrinsic reward and other methods

  2. You can try to guide the generation of samples that meet certain conditions, such as making the sample distribution skew towards a specific opponent model

ABSTool tool use

uTools_1692863242511

Guess you like

Origin blog.csdn.net/Williamtym/article/details/132477420