Install

Official document
Mac
release_19 version

git clone --branch release_19 https://github.com/Unity-Technologies/ml-agents.git

Tried twice, the network speed is too slow, clone failed. Go directly to the github website to download the zip file and unzip it

unzip ml-agents-release_19.zip

This project contains:

com.unity.ml-agentsUnity package
com.unity.ml-agents.extensionsUnity package, experimental, optional, depends on com.unity.ml-agents
mlagentsPython library, training agents, depends on mlagents_envs
mlagents_envsA low-level python library
gym-unityA python library that supports OpenAI Gym
ProjectSome demo
unity packages are imported in the unity project;
the python library is installed in the python environment alone, and can be installed from the source code of this clone, but pip installation is more convenient:

conda create -n ml-agents python=3.6
conda activate ml-agents
python -m pip install mlagents==0.28.0

Demo

Official documents
Open the project directly from UnityHub: ml-agents-release_19/Project, as shown in the figure below, there are many examples
insert image description here
to open 3DBall/Scenes/3DBall scenes, and you can run them directly to see the effect of Agents.

train

Go to the Project folder and run

mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun

Start the python training program. The training configuration file of the project in Examples is preset under the config folder, and run-id is the name of the training. Then click Run in Unity, provide the training environment, and start training. To stop training, directly CTRL+C, to resume training, run:

mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun --resume

During the training, a reuslt folder will be created in the current directory to store the log and model results during the training process, similar to TF/Pytorch:

results
└── first3DBallRun
├── 3DBall
│ ├── 3DBall-151000.onnx
│ ├── 3DBall-151000.pt
│ ├── checkpoint.pt
│ └── events.out.tfevents.1653732428.bogon.89658.0
├── 3DBall.onnx
├── configuration.yaml
└── run_logs
├── timers.json
└── training_status.json

You can use tensorboard to view the indicators during the training process

tensorboard --logdir results

Enter localhost:6006 in the browser, and you can see that
insert image description here
there are 12 Agents in curve scenes such as reward and loss. These 12 Agents are independent of each other and share a model. During training, they all contribute to the update of model parameters independently, which is equivalent to Open 12 training threads or batch 12. In short, it is equivalent to increasing the training speed by 12 times.
So far, let’s walk through the demo of ML-Agents and run it, but the details inside are still unclear. It is necessary to build a complete ML-Agents system from scratch.

Build from 0

official document

1. Create a new 3D Project and name it "RollBall"

2. Import the unity package com.unity.ml-agents

Window->Package Manager->Add package from disk, select ml-agents-release_19/com.unity.ml-agents/package.json, the import is successful.

3. Create a new object

insert image description here

As shown in the figure, the Agent is a small ball (RollerAgent). The goal is to walk to the block (Target) and add a Rigidbody component to the RollerAgent. If it leaves the plane, it will fall due to gravity and fail.

4. Write Agent script

Create a new script RollerAgent.cs, here is the core, and the state, reward, and action are all defined here.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Actuators;

public class RollerAgent : Agent
{
    
    
    Rigidbody rBody;
    void Start()
    {
    
    
        rBody = GetComponent<Rigidbody>();
    }

    public Transform Target;

    public override void OnEpisodeBegin()
    {
    
    
        // If the Agent fell, zero its momentum
        if (this.transform.localPosition.y < 0)
        {
    
    
            this.rBody.angularVelocity = Vector3.zero;
            this.rBody.velocity = Vector3.zero;
            this.transform.localPosition = new Vector3(0, 0.5f, 0);
        }

        // Move the target to a new spot
        Target.localPosition = new Vector3(Random.value * 8 - 4, 0.5f, Random.value * 8 - 4);
    }

    // the data will be fed into a neural network as a feature vector
    public override void CollectObservations(VectorSensor sensor)
    {
    
    
        // Target and Agent positions
        sensor.AddObservation(Target.localPosition);
        sensor.AddObservation(this.transform.localPosition);

        // Agent velocity
        sensor.AddObservation(rBody.velocity.x);
        sensor.AddObservation(rBody.velocity.z);
    }

    public float forceMultiplier = 10;
    public override void OnActionReceived(ActionBuffers actionBuffers)
    {
    
    
        // Actions, size = 2
        Vector3 controlSignal = Vector3.zero;
        controlSignal.x = actionBuffers.ContinuousActions[0];
        controlSignal.z = actionBuffers.ContinuousActions[1];
        rBody.AddForce(controlSignal * forceMultiplier);

        // Rewards
        float distanceToTarget = Vector3.Distance(this.transform.localPosition, Target.localPosition);

        // Reached target
        if (distanceToTarget < 1.42f)
        {
    
    
            SetReward(1.0f);
            EndEpisode();
        }

        // Fell off platform
        else if (this.transform.localPosition.y < 0)
        {
    
    
            EndEpisode();
        }
    }

    public override void Heuristic(in ActionBuffers actionsOut)
    {
    
    
        var continuousActionsOut = actionsOut.ContinuousActions;
        continuousActionsOut[0] = Input.GetAxis("Horizontal");
        continuousActionsOut[1] = Input.GetAxis("Vertical");
    }
}

Reinforcement learning is mainly composed of agent (Agent), environment (Environment), state (State), action (Action), reward (Reward). An event (episode) from the start to task success/task failure/timeout, optimize Reward in an episode.
OnEpisodeBeginState initialization when the event starts.
CollectObservationsSet the state, the state data will be passed into the model, and the model will output the action according to the current state.
OnActionReceivedThe change of action to env is controlled here, and the change of env to state is calculated by unity; Reward is also given here.

5. Add Agent-related components

Add the following components to the Agent and modify some parameters:
insert image description here
RollerAgentThe script written above
DecisionRequester"request decisions on its own at regular intervals" is not very clear at present, it seems that it can be used without it?
BehaviorParametersModel parameter configuration, including state vector dimension, action dimension, model file, etc.

6. Environmental testing

So far, the environment and Agent have been set up, and there is no Model yet. Before training the model, manually test it by adding the Heuristic function (in the script above), and control the movement of the ball by pressing the up, down, left, and right keys, which is equivalent to The Model behind the Agent is the person who operates it. In this way, the correctness of the environment construction can also be verified.

7. Training

Create a new model training configuration file Config/rollerball_config.yaml under the Assets directory:

behaviors:
  RollerBall:
    trainer_type: ppo
    hyperparameters:
      batch_size: 10
      buffer_size: 100
      learning_rate: 3.0e-4
      beta: 5.0e-4
      epsilon: 0.2
      lambd: 0.99
      num_epoch: 3
      learning_rate_schedule: linear
      beta_schedule: constant
      epsilon_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 128
      num_layers: 2
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    max_steps: 500000
    time_horizon: 64
    summary_freq: 2000

For parameter descriptions, see Config parameters .
Run in the Assets directory

mlagents-learn Config/rollerball_config.yaml --run-id=RollerBall

And run the unity project, start training, observe that the reward is almost the same, Ctrl+C terminates the training

…
[INFO] RollerBall. Step: 46000. Time Elapsed: 126.200 s. Mean Reward: 0.908. Std of Reward: 0.289. Training.
[INFO] RollerBall. Step: 48000. Time Elapsed: 131.253 s. Mean Reward: 0.862. Std of Reward: 0.345. Training.
[INFO] RollerBall. Step: 50000. Time Elapsed: 136.253 s. Mean Reward: 0.878. Std of Reward: 0.328. Training.
[INFO] RollerBall. Step: 52000. Time Elapsed: 141.376 s. Mean Reward: 0.915. Std of Reward: 0.279. Training.
[INFO] RollerBall. Step: 54000. Time Elapsed: 146.467 s. Mean Reward: 0.879. Std of Reward: 0.327. Training.

The training log and results are in the results directory under Assets:

results
├── RollerBall
│ ├── RollerBall
│ │ ├── RollerBall-55804.onnx
│ │ ├── RollerBall-55804.onnx.meta
│ │ ├── RollerBall-55804.pt
│ │ ├── RollerBall-55804.pt.meta
│ │ ├── checkpoint.pt
│ │ ├── checkpoint.pt.meta
│ │ ├── events.out.tfevents.1653813243.bogon.91531.0
│ │ └── events.out.tfevents.1653813243.bogon.91531.0.meta
│ ├── RollerBall.meta
│ ├── RollerBall.onnx
│ ├── RollerBall.onnx.meta
│ ├── configuration.yaml
│ └── configuration.yaml.meta
└── RollerBall.meta

Use tensorboard to view the training process indicator curve:

tensorboard --logdir results

Localhost:6006 in the browser can see:
insert image description here

8. Agent test

Assign the trained model results/RollerBall/RollerBall.onnx to the Model parameter of the Behavior Parameter component:
insert image description here
run unity, and you can see the effect.

9. Parallel acceleration

Two ways:

One is to copy multiple copies of TrainingArea in the scene. During training, relevant data will be collected according to BehaviorName (parameter of BehaviorParameter), which is equivalent to multiple agents training at the same time and contributing to model parameters, which is equivalent to multi-thread or multi-batch training .
the other is

mlagents-learn config/rollerball_config.yaml --run-id=RollerBall --num-envs=2

The first method was used in the test, but the second method has not been tried yet.

ML-Agents records