Install
Official document
Mac
release_19 version
git clone --branch release_19 https://github.com/Unity-Technologies/ml-agents.git
Tried twice, the network speed is too slow, clone failed. Go directly to the github website to download the zip file and unzip it
unzip ml-agents-release_19.zip
This project contains:
com.unity.ml-agents
Unity packagecom.unity.ml-agents.extensions
Unity package, experimental, optional, depends on com.unity.ml-agentsmlagents
Python library, training agents, depends on mlagents_envsmlagents_envs
A low-level python librarygym-unity
A python library that supports OpenAI GymProject
Some demo
unity packages are imported in the unity project;
the python library is installed in the python environment alone, and can be installed from the source code of this clone, but pip installation is more convenient:
conda create -n ml-agents python=3.6
conda activate ml-agents
python -m pip install mlagents==0.28.0
Demo
Official documents
Open the project directly from UnityHub: ml-agents-release_19/Project, as shown in the figure below, there are many examples
to open 3DBall/Scenes/3DBall scenes, and you can run them directly to see the effect of Agents.
train
Go to the Project folder and run
mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun
Start the python training program. The training configuration file of the project in Examples is preset under the config folder, and run-id is the name of the training. Then click Run in Unity, provide the training environment, and start training. To stop training, directly CTRL+C
, to resume training, run:
mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun --resume
During the training, a reuslt folder will be created in the current directory to store the log and model results during the training process, similar to TF/Pytorch:
results
└── first3DBallRun
├── 3DBall
│ ├── 3DBall-151000.onnx
│ ├── 3DBall-151000.pt
│ ├── checkpoint.pt
│ └── events.out.tfevents.1653732428.bogon.89658.0
├── 3DBall.onnx
├── configuration.yaml
└── run_logs
├── timers.json
└── training_status.json
You can use tensorboard to view the indicators during the training process
tensorboard --logdir results
Enter localhost:6006 in the browser, and you can see that
there are 12 Agents in curve scenes such as reward and loss. These 12 Agents are independent of each other and share a model. During training, they all contribute to the update of model parameters independently, which is equivalent to Open 12 training threads or batch 12. In short, it is equivalent to increasing the training speed by 12 times.
So far, let’s walk through the demo of ML-Agents and run it, but the details inside are still unclear. It is necessary to build a complete ML-Agents system from scratch.
Build from 0
1. Create a new 3D Project and name it "RollBall"
2. Import the unity package com.unity.ml-agents
Window->Package Manager->Add package from disk, select ml-agents-release_19/com.unity.ml-agents/package.json, the import is successful.
3. Create a new object
As shown in the figure, the Agent is a small ball (RollerAgent). The goal is to walk to the block (Target) and add a Rigidbody component to the RollerAgent. If it leaves the plane, it will fall due to gravity and fail.
4. Write Agent script
Create a new script RollerAgent.cs, here is the core, and the state, reward, and action are all defined here.
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Actuators;
public class RollerAgent : Agent
{
Rigidbody rBody;
void Start()
{
rBody = GetComponent<Rigidbody>();
}
public Transform Target;
public override void OnEpisodeBegin()
{
// If the Agent fell, zero its momentum
if (this.transform.localPosition.y < 0)
{
this.rBody.angularVelocity = Vector3.zero;
this.rBody.velocity = Vector3.zero;
this.transform.localPosition = new Vector3(0, 0.5f, 0);
}
// Move the target to a new spot
Target.localPosition = new Vector3(Random.value * 8 - 4, 0.5f, Random.value * 8 - 4);
}
// the data will be fed into a neural network as a feature vector
public override void CollectObservations(VectorSensor sensor)
{
// Target and Agent positions
sensor.AddObservation(Target.localPosition);
sensor.AddObservation(this.transform.localPosition);
// Agent velocity
sensor.AddObservation(rBody.velocity.x);
sensor.AddObservation(rBody.velocity.z);
}
public float forceMultiplier = 10;
public override void OnActionReceived(ActionBuffers actionBuffers)
{
// Actions, size = 2
Vector3 controlSignal = Vector3.zero;
controlSignal.x = actionBuffers.ContinuousActions[0];
controlSignal.z = actionBuffers.ContinuousActions[1];
rBody.AddForce(controlSignal * forceMultiplier);
// Rewards
float distanceToTarget = Vector3.Distance(this.transform.localPosition, Target.localPosition);
// Reached target
if (distanceToTarget < 1.42f)
{
SetReward(1.0f);
EndEpisode();
}
// Fell off platform
else if (this.transform.localPosition.y < 0)
{
EndEpisode();
}
}
public override void Heuristic(in ActionBuffers actionsOut)
{
var continuousActionsOut = actionsOut.ContinuousActions;
continuousActionsOut[0] = Input.GetAxis("Horizontal");
continuousActionsOut[1] = Input.GetAxis("Vertical");
}
}
Reinforcement learning is mainly composed of agent (Agent), environment (Environment), state (State), action (Action), reward (Reward). An event (episode) from the start to task success/task failure/timeout, optimize Reward in an episode.
OnEpisodeBegin
State initialization when the event starts.
CollectObservations
Set the state, the state data will be passed into the model, and the model will output the action according to the current state.
OnActionReceived
The change of action to env is controlled here, and the change of env to state is calculated by unity; Reward is also given here.
5. Add Agent-related components
Add the following components to the Agent and modify some parameters:
RollerAgent
The script written above
DecisionRequester
"request decisions on its own at regular intervals" is not very clear at present, it seems that it can be used without it?
BehaviorParameters
Model parameter configuration, including state vector dimension, action dimension, model file, etc.
6. Environmental testing
So far, the environment and Agent have been set up, and there is no Model yet. Before training the model, manually test it by adding the Heuristic function (in the script above), and control the movement of the ball by pressing the up, down, left, and right keys, which is equivalent to The Model behind the Agent is the person who operates it. In this way, the correctness of the environment construction can also be verified.
7. Training
Create a new model training configuration file Config/rollerball_config.yaml under the Assets directory:
behaviors:
RollerBall:
trainer_type: ppo
hyperparameters:
batch_size: 10
buffer_size: 100
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 500000
time_horizon: 64
summary_freq: 2000
For parameter descriptions, see Config parameters .
Run in the Assets directory
mlagents-learn Config/rollerball_config.yaml --run-id=RollerBall
And run the unity project, start training, observe that the reward is almost the same, Ctrl+C terminates the training
…
[INFO] RollerBall. Step: 46000. Time Elapsed: 126.200 s. Mean Reward: 0.908. Std of Reward: 0.289. Training.
[INFO] RollerBall. Step: 48000. Time Elapsed: 131.253 s. Mean Reward: 0.862. Std of Reward: 0.345. Training.
[INFO] RollerBall. Step: 50000. Time Elapsed: 136.253 s. Mean Reward: 0.878. Std of Reward: 0.328. Training.
[INFO] RollerBall. Step: 52000. Time Elapsed: 141.376 s. Mean Reward: 0.915. Std of Reward: 0.279. Training.
[INFO] RollerBall. Step: 54000. Time Elapsed: 146.467 s. Mean Reward: 0.879. Std of Reward: 0.327. Training.
The training log and results are in the results directory under Assets:
results
├── RollerBall
│ ├── RollerBall
│ │ ├── RollerBall-55804.onnx
│ │ ├── RollerBall-55804.onnx.meta
│ │ ├── RollerBall-55804.pt
│ │ ├── RollerBall-55804.pt.meta
│ │ ├── checkpoint.pt
│ │ ├── checkpoint.pt.meta
│ │ ├── events.out.tfevents.1653813243.bogon.91531.0
│ │ └── events.out.tfevents.1653813243.bogon.91531.0.meta
│ ├── RollerBall.meta
│ ├── RollerBall.onnx
│ ├── RollerBall.onnx.meta
│ ├── configuration.yaml
│ └── configuration.yaml.meta
└── RollerBall.meta
Use tensorboard to view the training process indicator curve:
tensorboard --logdir results
Localhost:6006 in the browser can see:
8. Agent test
Assign the trained model results/RollerBall/RollerBall.onnx to the Model parameter of the Behavior Parameter component:
run unity, and you can see the effect.
9. Parallel acceleration
Two ways:
- One is to copy multiple copies of TrainingArea in the scene. During training, relevant data will be collected according to BehaviorName (parameter of BehaviorParameter), which is equivalent to multiple agents training at the same time and contributing to model parameters, which is equivalent to multi-thread or multi-batch training .
- the other is
mlagents-learn config/rollerball_config.yaml --run-id=RollerBall --num-envs=2
The first method was used in the test, but the second method has not been tried yet.