ml-agents project practice (1)

This article first appeared in: Walker AI

Reinforcement learning is a type of problem in machine learning and artificial intelligence that studies how to achieve a specific goal through a series of sequential decisions. It is a kind of algorithm that allows the computer to realize that it doesn't understand anything from the beginning, and there is no idea in the head. Through continuous trial and error, it finally finds the rules and learns the way to achieve the goal. This is a complete reinforcement learning process. Here we can quote the figure below to make a more intuitive explanation.

ml-agents project practice (1)

Agent is an agent, our algorithm, which appears as a player in the game. Through a series of strategies, the agent outputs an action to act on the environment, and the environment returns the state value after the action, which is the observation (Observation) and reward value (Reward) in the figure. After the environment returns the reward value to the agent, it updates its own state, and the agent obtains a new Observation.

1. ml-agents

1.1 Introduction

At present, most of the Unity games are huge, the engine is perfect, and the training environment is easy to build. Since Unity can be cross-platform, it can be converted to WebGL and published on the web after training under Windows and Linux platforms. And mlagents is an open source plug-in of Unity, which allows developers to train in the Unity environment, without even writing python-side code, without having to understand PPO, SAC and other algorithms. As long as the developers configure the parameters, they can easily use reinforcement learning algorithms to train their own models.

If you are interested in the algorithm, please click here to learn the algorithm PPO , SAC .

<u>For more information, click to go</u>

1.2 Anaconda, tensorflow and tensorboard installation

The ml-agents introduced in this article need to communicate with Tensorflow through Python. During training, Observation, Action, Reward, Done and other information are obtained from the Unity side of ml-agents and passed to Tensorflow for training, and then the model decisions are passed to Unity. Therefore, before installing ml-agents, you need to install tensorflow according to the following link.

Tensorboard facilitates data visualization and facilitates analysis of whether the model meets expectations.

Click for installation details

1.3 ml-agents installation steps

(1) Go to github to download ml-agents (this example uses the release6 version)

github can download

ml-agents project practice (1)

(2) Unzip the compressed package, put com.unity.ml-agentsit com.unity.ml-agents.extensionsinto Unity's Packages directory (please create one if not), manifest.jsonand add these two directories.

ml-agents project practice (1)

(3) After the installation is complete, after importing into the project, create a new script, enter the following references to verify the installation is successful

using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Policies;

public class MyAgent : Agent

{

}

2. ml-agents training example

2.1 Summary and engineering

ml-agents project practice (1)

Environment is usually described by a Markov process. The agent generates an Action by adopting a certain policy, interacts with the Environment, and generates a Reward. Afterwards, the agent adjusts and optimizes the current policy according to the reward.

The actual project of this example refers to the rules of eliminating music, and you can score three of the same colors. This example removes the extra rewards of four consecutive colors and multiple links (to facilitate the design environment)

ml-agents project practice (1)

Projects at the downloads click Go

Export part of the Unity project, please refer to the official click go .

The following will share the methods of project practice from four perspectives, interface extraction, algorithm selection, design environment, and parameter adjustment.

2.2 The AI ​​interface of the game framework is removed

Extract the interfaces required by the Observation and Action of the project from the game. Used to pass in the current state of the game and perform game actions.

static List<ML_Unit> states = new List<ML_Unit>();

public class ML_Unit
{
    public int color = (int)CodeColor.ColorType.MaxNum;
    public int widthIndex = -1;
    public int heightIndex = -1;
}
//从当前画面中,拿到所有方块的信息,包含所在位置x(长度),位置y(高度),颜色(坐标轴零点在左上)
public static List<ML_Unit> GetStates()
{
    states.Clear();
    var xx = GameMgr.Instance.GetGameStates();
    for(int i = 0; i < num_widthMax;i++)
    {
        for(int j = 0; j < num_heightMax; j++)
        {
            ML_Unit tempUnit = new ML_Unit();
            try
            {
                tempUnit.color = (int)xx[i, j].getColorComponent.getColor;
            }
            catch
            {
                Debug.LogError($"GetStates i:{i} j:{j}");
            }
            tempUnit.widthIndex = xx[i, j].X;
            tempUnit.heightIndex = xx[i, j].Y;
            states.Add(tempUnit);
        }
    }
    return states;
}

public enum MoveDir
{
    up,
    right,
    down,
    left,
}

public static bool CheckMoveValid(int widthIndex, int heigtIndex, int dir)
{
    var valid = true;
    if (widthIndex == 0 && dir == (int)MoveDir.left)
    {
        valid = false;
    }
    if (widthIndex == num_widthMax - 1 && dir == (int)MoveDir.right)
    {
        valid = false;
    }

    if (heigtIndex == 0 && dir == (int)MoveDir.up)
    {
        valid = false;
    }

    if (heigtIndex == num_heightMax - 1 && dir == (int)MoveDir.down)
    {
        valid = false;
    }
    return valid;
}

//执行动作的接口,根据位置信息和移动方向,调用游戏逻辑移动方块。widthIndex 0-13,heigtIndex 0-6,dir 0-3 0上 1右 2下 3左
public static void SetAction(int widthIndex,int heigtIndex,int dir,bool immediately)
{
    if (CheckMoveValid(widthIndex, heigtIndex, dir))
    {
        GameMgr.Instance.ExcuteAction(widthIndex, heigtIndex, dir, immediately);
    }
}

2.3 Game AI algorithm selection

Stepping into the first topic of the reinforcement learning project, facing many algorithms, choosing a suitable algorithm can do more with less. If you are not familiar with the characteristics of the algorithm, you can directly use the PPO and SAC that comes with ml-agents.

The PPO algorithm that the author first used in this example, tried a lot of adjustments, and averaged 9 steps to get the right one, and the effect was rather bad.

Later, we carefully analyzed the game environment. Because of the match-3 games of this project, the environment is completely different every time. The result of each step has little influence on the next step, and the demand for Markov chain is not strong. . Since PPO is a policy-based algorithm of OnPolicy, the policy update is very careful each time it is updated, which makes the result difficult to converge (I tried XX cloth but still did not converge).

Compared with DQN, which is OffPolicy's value-base algorithm, it can collect a large number of environment parameters to establish Qtable, and gradually find the maximum value of the corresponding environment.

To put it simply, PPO is online learning. Every time I run a few hundred steps, I go back and learn what was done right and what I did wrong in these hundreds of steps, and then update the learning, and then run a few hundred steps, and so on. Not to mention the slow learning efficiency, it is also difficult to find the global optimal solution.

DQN is offline learning, you can run hundreds of millions of steps, and then go back and take out all the places you ran to learn, and then it is easy to find the global optimal point.

(In this example, PPO is used for demonstration, and follow-up is shared in ml-agents external algorithm, using external tool stable_baselines3, using DQN algorithm for training)

2.4 Game AI design environment

After we have determined the algorithm framework, how to design Observation, Action and Reward becomes the decisive factor in determining the training effect. In this game, the environment here mainly has two variables, one is the position of the block, and the other is the color of the block.

--Observation:

In view of the above picture, our example has a length of 14, a width of 7, and 6 colors.

The swish used by ml-agents as the activation function can use not too large floating-point numbers (-10f ~10f), but in order to make the environment more pure for the agents and the training effect is more ideal, we still need to encode the environment.

In this example, the author uses the Onehot method to encode the environment, and locate the coordinate zero point in the upper left corner. In this way, the environment code of the cyan square in the upper left corner can be expressed as long [0,0,0,0,0,0,0,0,0,0,0,0,0,1],

High [0,0,0,0,0,0,1], the color is processed according to a fixed enumeration (yellow, green, purple, pink, blue, red) color [0,0,0,0,1,0 ].

The environment contains (14+7+6) 14 * 7 = 2646

Code example:

public class MyAgent : Agent
{
    static List<ML_Unit> states = new List<ML_Unit>();
    public class ML_Unit
    {
        public int color = (int)CodeColor.ColorType.MaxNum;
        public int widthIndex = -1;
        public int heightIndex = -1;
    }

    public static List<ML_Unit> GetStates()
    {
        states.Clear();
        var xx = GameMgr.Instance.GetGameStates();
        for(int i = 0; i < num_widthMax;i++)
        {
            for(int j = 0; j < num_heightMax; j++)
            {
                ML_Unit tempUnit = new ML_Unit();
                try
                {
                    tempUnit.color = (int)xx[i, j].getColorComponent.getColor;
                }
                catch
                {
                    Debug.LogError($"GetStates i:{i} j:{j}");
                }
                tempUnit.widthIndex = xx[i, j].X;
                tempUnit.heightIndex = xx[i, j].Y;
                states.Add(tempUnit);
            }
        }
        return states;
    }

    List<ML_Unit> curStates = new List<ML_Unit>();
    public override void CollectObservations(VectorSensor sensor)
    {
        //需要判断是否方块移动结束,以及方块结算结束
        var receiveReward = GameMgr.Instance.CanGetState();
        var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();
        if (!codeMoveOver || !receiveReward)
        {
            return;
        }

        //获得环境的状态信息
        curStates = MlagentsMgr.GetStates();
        for (int i = 0; i < curStates.Count; i++)
        {
            sensor.AddOneHotObservation(curStates[i].widthIndex, MlagentsMgr.num_widthMax);
            sensor.AddOneHotObservation(curStates[i].heightIndex, MlagentsMgr.num_heightMax);
            sensor.AddOneHotObservation(curStates[i].color, (int)CodeColor.ColorType.MaxNum);
        }
    }
}

--Action:

Each square can move up, down, left, and right. The minimum information we need to record contains 14*7 squares, and each square can move in 4 directions. In this example, the directions are enumerated (up, right, down, left).

The upper left is the zero point. The cyan square in the upper left corner occupies the first four actions of the Action. They are (the cyan square in the upper left corner moves up, the cyan square in the upper left corner moves to the right, and the cyan square in the upper left corner moves down.

The cyan square in the upper left corner moves to the left).

Then the action contains a total of 14 7 4 = 392

Attentive readers may find that the cyan square in the upper left corner cannot move up or left. At this time, we need to set Actionmask to block these actions that are prohibited by the rules.

Code example:

public class MyAgent : Agent
{
    public enum MoveDir
    {
        up,
        right,
        down,
        left,
    }

    public void DecomposeAction(int actionId,out int width,out int height,out int dir)
    {
        width = actionId / (num_heightMax * num_dirMax);
        height = actionId % (num_heightMax * num_dirMax) / num_dirMax;
        dir = actionId % (num_heightMax * num_dirMax) % num_dirMax;
    }

    //执行动作,并获得该动作的奖励
    public override void OnActionReceived(float[] vectorAction)
    {
        //需要判断是否方块移动结束,以及方块结算结束
        var receiveReward = GameMgr.Instance.CanGetState();
        var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();
        if (!codeMoveOver || !receiveReward)
        {
            Debug.LogError($"OnActionReceived CanGetState = {GameMgr.Instance.CanGetState()}");
            return;
        }

        if (invalidNums.Contains((int)vectorAction[0]))
        {
            //方块结算的调用,这里可以获得奖励(这里是惩罚,因为这是在屏蔽动作内,训练的时候会调用所有的动作,在非训练的时候则不会进此逻辑)
            GameMgr.Instance.OnGirdChangeOver?.Invoke(true, -5, false, false);
        }
        DecomposeAction((int)vectorAction[0], out int widthIndex, out int heightIndex, out int dirIndex);
        //这里回去执行动作,移动对应的方块,朝对应的方向。执行完毕后会获得奖励,并根据情况重置场景
        MlagentsMgr.SetAction(widthIndex, heightIndex, dirIndex, false);
    }

    //MlagentsMgr.SetAction调用后,执行完动作,会进入这个函数
    public void RewardShape(int score)
    {
        //计算获得的奖励
        var reward = (float)score * rewardScaler;
        AddReward(reward);
        //将数据加入tensorboard进行统计分析
        Mlstatistics.AddCumulativeReward(StatisticsType.action, reward);
        //每一步包含惩罚的动作,可以提升探索的效率
        var punish = -1f / MaxStep * punishScaler;
        AddReward(punish);
        //将数据加入tensorboard进行统计分析
        Mlstatistics.AddCumulativeReward( StatisticsType.punishment, punish);
    }

    //设置屏蔽动作actionmask
    public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker)
    {
        // Mask the necessary actions if selected by the user.
        checkinfo.Clear();
        invalidNums.Clear();
        int invalidNumber = -1;
        for (int i = 0; i < MlagentsMgr.num_widthMax;i++)
        {
            for (int j = 0; j < MlagentsMgr.num_heightMax; j++)
            {
                if (i == 0)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.left;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }
                if (i == num_widthMax - 1)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.right;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }

                if (j == 0)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.up;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }

                if (j == num_heightMax - 1)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.down;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }
            }
        }
    }
}

In the original project elimination process, a large number of coroutines were used and there was a high delay. We need to squeeze out the delay time when retraining.

In order not to affect the main logic of the game, under normal circumstances, the fillTime in the yield return new WaitForSeconds(fillTime) in the coroutine is changed to 0.001f, so that the model can be maximized after the Action is selected without extensively modifying the game logic. Get Reward soon.

public class MyAgent : Agent
{
    private void FixedUpdate()
    {
        var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();
        var receiveReward = GameMgr.Instance.CanGetState();
        if (!codeMoveOver || !receiveReward /*||!MlagentsMgr.b_isTrain*/)
        {       
            return;
        }
        //因为有协程需要等待时间,需要等待产生Reward后才去请求决策。所以不能使用ml-agents自带的DecisionRequester
        RequestDecision();
    }
}

2.5 Parameter adjustment

After designing the model, we first run a preliminary version to see how much the result differs from our design expectations.

First configure the yaml file to initialize the parameters of the network:

behaviors:
SanXiaoAgent:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0005
beta: 0.005
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
vis_encode_type: simple
memory: null
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
init_path: null
keep_checkpoints: 25
checkpoint_interval: 100000
max_steps: 1000000
time_horizon: 128
summary_freq: 1000
threaded: true
self_play: null
behavioral_cloning: null
framework: tensorflow

Please refer to the official interface for the training code. This example uses the release6 version. The command is as follows

mlagents-learn config/ppo/sanxiao.yaml --env=G:\mylab\ml-agent-buildprojects\sanxiao\windows\display\121001display\fangkuaixiaoxiaole --run-id=121001xxl --train --width 800 --height 600 --num-envs 2 --force --initialize-from=121001

After the training is completed, open Anaconda, enter tensorboard --logdir=results --port=6006 in the main directory of the ml-agents project, copy http://PS20190711FUOV:6006/ and open it on the browser to see the training results.

(mlagents) PS G:\mylab\ml-agents-release_6> tensorboard --logdir=results --port=6006
TensorBoard 1.14.0 at http://PS20190711FUOV:6006/ (Press CTRL+C to quit)

The training effect chart is as follows:

ml-agents project practice (1)

Move count is the average number of steps that need to be taken to eliminate a block. It takes about 9 cloths to take the correct step. In the case of using Actionmask, the square can be eliminated once in about 6 steps.

–Reward:

According to the Reward in the above table, check the average value of the reward design. I like to control it between 0.5 and 2. The rewardScaler can be adjusted if it is too large or too small.

//MlagentsMgr.SetAction调用后,执行完动作,会进入这个函数
public void RewardShape(int score)
{
    //计算获得的奖励
    var reward = (float)score * rewardScaler;
    AddReward(reward);
    //将数据加入tensorboard进行统计分析
    Mlstatistics.AddCumulativeReward(StatisticsType.action, reward);
    //每一步包含惩罚的动作,可以提升探索的效率
    var punish = -1f / MaxStep * punishScaler;
    AddReward(punish);
    //将数据加入tensorboard进行统计分析
    Mlstatistics.AddCumulativeReward( StatisticsType.punishment, punish);
}

3. Summary and talk

The current official practice of ml-agents uses imitation learning and uses expert data to train the network.

The author tried PPO in this example, and it worked. However, PPO is currently difficult to train for match three, and it is more difficult to converge, and it is difficult to find the global optimum.

Setting up the environment and Reward requires rigorous testing, otherwise it will cause great errors in the results and difficult to troubleshoot.

Reinforcement learning is currently iterating faster. If there are any errors in the above, please correct me and make progress together.

Due to limited space, it is not possible to publish all the codes of the entire project. If you are interested in research, you can leave a message below, and I can send the complete project to you by email.

In the follow-up, we will share the external algorithm of ml-agents, use the external tool stable_baselines3, and use the DQN algorithm for training.


PS: For more technical dry goods, please pay attention to [public account| xingzhe_ai], and discuss with Xingzhe!

Guess you like

Origin blog.51cto.com/15063587/2585343