ML-Agents案例之机器人学走路

本案例源自ML-Agents官方的示例，Github地址：https://github.com/Unity-Technologies/ml-agents，本文是详细的配套讲解。

本文基于我前面发的两篇文章，需要对ML-Agents有一定的了解，详情请见：Unity强化学习之ML-Agents的使用、ML-Agents命令及配置大全。

我前面的相关文章有：

Unity人工智能之不断自我进化的五人足球赛

ML-Agents案例之地牢逃脱

ML-Agents案例之金字塔

扫描二维码关注公众号，回复： 13550630 查看本文章

ML-Agents案例之蠕虫

在这里插入图片描述

环境说明

在阅读本文之前，可以参考ML-Agents案例之Crawler、ML-Agents案例之蠕虫这两个案例。它们的目标都是一致的：吃掉绿色的方块，并且都是仿生机器人，区别是仿生的对象不一样，之前是仿爬虫和蠕虫，这一次是直接仿人。因此，本案例是之前两个案例的高配版，机器人的关节要比以往多得多，这将导致输入和输出维度上的急剧增加，这对训练来说是一项巨大的挑战，导致找到一个好的策略称为一项艰难的任务。

现在我们看看机器人的构造：

在这里插入图片描述

可以看到，相比于现实中的人，Unity中的机器人已经在关节上做到尽量简化了，例如：锁死了手腕的关节，没有手指脚趾，脊椎当成一个关节等。但是剩余的关节数目仍然是一个比爬虫大得多的数目。

我们查看其物体列表，发现整个人以臀部作为所有部位的父物体。然后臀部上挂的子物体有左腿、右腿、躯干，它们的关节链接的都是臀部。如下腰部连接臀部：

在这里插入图片描述

可以看到腰部的关节的x，y，z三个轴都是可以转动的，因为人的腰在生理构造上是既可以前后倾，也能左右倾，还能左右扭腰。我们可以通过Edit Angular Limits可视化修改其可以转动的幅度。

左右腿同理，连接臀部，但是自由度z轴旋转被锁定，这是因为人腿部并不能做到像腰一样左右扭动。

在这里插入图片描述

每一条腿除了和臀部连接的关节外，还有膝盖和脚踝两个关节。

膝盖只有一个x轴能够旋转，即一个自由度，连接的是大腿部。

在这里插入图片描述

同样的，脚踝三个轴都能旋转，连接的是小腿部。

在这里插入图片描述

现在我们再来看看躯干部分的各个关节，整个躯干的父物体是腹部。

首先是脊椎，也是胸部的关节，连接的是腹部，x，y，z轴皆可旋转。

在这里插入图片描述

然后链接胸部的有三个部位，分别是左手，右手，头部。

先看看头部，头部是两个轴的自由度。

在这里插入图片描述

然后是手部，左手右手一样，胳膊上有一个关节，两个轴的自由度（个人认为应该有三个轴的自由度），连接胸部。手肘上有一个关节，一个自由度，连接上臂。

在这里插入图片描述

到此为止，上面讲解的就是我们接下来要控制的关节了。同样，想要控制各个关节的运动，我们需要在智能体上挂载一个JointDriveController.cs的脚本，这个脚本不会自己运作，只有在别的脚本的调用下才会起作用。关于这个脚本的代码说明参考ML-Agents案例之Crawler。

我们在进行深度强化学习训练时，智能体接收的状态输入为：目标速度与身体平均速度的距离，一维。指向方块坐标下的身体速度和目标速度，六维。头部方向到目标方向的旋转，四维。整个身体面朝方向到目标方向的旋转，四维。目标方块相对于智能体的坐标，三维。每节躯体的移动速度、角速度、是否接触地面、相对于臀部的位置，16 * 10维。每个关节的力度、局部坐标的旋转，13 * 5维。一共是243维。

动作输出：我们一共有13个关节需要控制，加起来是26个轴的自由度，再加上控制13个轴的力度，所以一共是39维的连续输出。

代码讲解

参数设置：

private float m_TargetWalkingSpeed = 10;
public float MTargetWalkingSpeed
{
    
    
    get {
    
     return m_TargetWalkingSpeed; }
    set {
    
     m_TargetWalkingSpeed = Mathf.Clamp(value, .1f, m_maxWalkingSpeed); }
}

const float m_maxWalkingSpeed = 10; 
public bool randomizeWalkSpeedEachEpisode;
private Vector3 m_WorldDirToWalk = Vector3.right;
// 目标方块
[Header("Target To Walk Towards")] public Transform target; 
// 身体的各个部位
[Header("Body Parts")] public Transform hips;
public Transform chest;
public Transform spine;
public Transform head;
public Transform thighL;
public Transform shinL;
public Transform footL;
public Transform thighR;
public Transform shinR;
public Transform footR;
public Transform armL;
public Transform forearmL;
public Transform handL;
public Transform armR;
public Transform forearmR;
public Transform handR;

// 指向方块的设置为一个稳定的空间参考点，可以提高学习效果
OrientationCubeController m_OrientationCube;
// 脚下的箭头指示器的脚本
DirectionIndicator m_DirectionIndicator;
// 关节控制脚本
JointDriveController m_JdController;
// 环境参数，可从配置文件中获取
EnvironmentParameters m_ResetParams;

初始化方法Initialize()：

public override void Initialize()
{
    
    
    // 获取三个脚本
    m_OrientationCube = GetComponentInChildren<OrientationCubeController>();
    m_DirectionIndicator = GetComponentInChildren<DirectionIndicator>();
    m_JdController = GetComponent<JointDriveController>();
    // 初始化16个身体部位
    m_JdController.SetupBodyPart(hips);
    m_JdController.SetupBodyPart(chest);
    m_JdController.SetupBodyPart(spine);
    m_JdController.SetupBodyPart(head);
    m_JdController.SetupBodyPart(thighL);
    m_JdController.SetupBodyPart(shinL);
    m_JdController.SetupBodyPart(footL);
    m_JdController.SetupBodyPart(thighR);
    m_JdController.SetupBodyPart(shinR);
    m_JdController.SetupBodyPart(footR);
    m_JdController.SetupBodyPart(armL);
    m_JdController.SetupBodyPart(forearmL);
    m_JdController.SetupBodyPart(handL);
    m_JdController.SetupBodyPart(armR);
    m_JdController.SetupBodyPart(forearmR);
    m_JdController.SetupBodyPart(handR);
	// 获取配置文件中的配置数据
    m_ResetParams = Academy.Instance.EnvironmentParameters;
    SetResetParameters();
}
// 从配置文件中获取胸部，躯干，臀部的质量数据，没有的话全部设为8
public void SetTorsoMass()
{
    
    
    m_JdController.bodyPartsDict[chest].rb.mass = m_ResetParams.GetWithDefault("chest_mass", 8);
    m_JdController.bodyPartsDict[spine].rb.mass = m_ResetParams.GetWithDefault("spine_mass", 8);
    m_JdController.bodyPartsDict[hips].rb.mass = m_ResetParams.GetWithDefault("hip_mass", 8);
}
public void SetResetParameters()
{
    
    
    SetTorsoMass();
}

状态输入CollectObservations方法：

// 每部分躯体的输入
public void CollectObservationBodyPart(BodyPart bp, VectorSensor sensor)
{
    
    
    // 是否触地
    sensor.AddObservation(bp.groundContact.touchingGround); 
    // 获取指向方块空间中的刚体的速度和角速度
    sensor.AddObservation(m_OrientationCube.transform.InverseTransformDirection(bp.rb.velocity));
    sensor.AddObservation(m_OrientationCube.transform.InverseTransformDirection(bp.rb.angularVelocity));
    // 获取指向方块空间中的该躯体相对于臀部的位置
    sensor.AddObservation(m_OrientationCube.transform.InverseTransformDirection(bp.rb.position - hips.position));
	// 如果该躯体包含关节，那么获取私有空间中的旋转和关节的作用力
    if (bp.rb.transform != hips && bp.rb.transform != handL && bp.rb.transform != handR)
    {
    
    
        sensor.AddObservation(bp.rb.transform.localRotation);
        sensor.AddObservation(bp.currentStrength / m_JdController.maxJointForceLimit);
    }
}

public override void CollectObservations(VectorSensor sensor)
{
    
    
    var cubeForward = m_OrientationCube.transform.forward;
    var velGoal = cubeForward * MTargetWalkingSpeed;
    // 对所有躯体的速度取平均，获取整个身躯的平均速度
    var avgVel = GetAvgVelocity();
    // 输入目标速度和身躯速度的距离
    sensor.AddObservation(Vector3.Distance(velGoal, avgVel));
    // 输入指向方块空间中的身躯平均速度
    sensor.AddObservation(m_OrientationCube.transform.InverseTransformDirection(avgVel));
    // 输入指向方块空间中的目标速度
    sensor.AddObservation(m_OrientationCube.transform.InverseTransformDirection(velGoal));

    // 输入身躯和头部的朝向相对于指向方块朝向的四元数
    sensor.AddObservation(Quaternion.FromToRotation(hips.forward, cubeForward));
    sensor.AddObservation(Quaternion.FromToRotation(head.forward, cubeForward));

    // 输入指向方块空间中的目标方块坐标
    sensor.AddObservation(m_OrientationCube.transform.InverseTransformPoint(target.transform.position));
	// 输入每个躯体的信息
    foreach (var bodyPart in m_JdController.bodyPartsList)
    {
    
    
        CollectObservationBodyPart(bodyPart, sensor);
    }
}
// 获取身躯的平均速度
Vector3 GetAvgVelocity()
{
    
    
    Vector3 velSum = Vector3.zero;
    int numOfRb = 0;
    foreach (var item in m_JdController.bodyPartsList)
    {
    
    
        numOfRb++;
        velSum += item.rb.velocity;
    }
    var avgVel = velSum / numOfRb;
    return avgVel;
}

动作输出方法OnActionReceived：

public override void OnActionReceived(ActionBuffers actionBuffers)
{
    
    
    // 获取关于躯体的字典
    var bpDict = m_JdController.bodyPartsDict;
    var i = -1;
	// 获取神经网络输出的列表
    var continuousActions = actionBuffers.ContinuousActions;
    // 设置13个关节的旋转目标
    bpDict[chest].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], continuousActions[++i]);
    bpDict[spine].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], continuousActions[++i]);

    bpDict[thighL].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], 0);
    bpDict[thighR].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], 0);
    bpDict[shinL].SetJointTargetRotation(continuousActions[++i], 0, 0);
    bpDict[shinR].SetJointTargetRotation(continuousActions[++i], 0, 0);
    bpDict[footR].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], continuousActions[++i]);
    bpDict[footL].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], continuousActions[++i]);

    bpDict[armL].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], 0);
    bpDict[armR].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], 0);
    bpDict[forearmL].SetJointTargetRotation(continuousActions[++i], 0, 0);
    bpDict[forearmR].SetJointTargetRotation(continuousActions[++i], 0, 0);
    bpDict[head].SetJointTargetRotation(continuousActions[++i], continuousActions[++i], 0);

    // 设置13个关节的旋转力度
    bpDict[chest].SetJointStrength(continuousActions[++i]);
    bpDict[spine].SetJointStrength(continuousActions[++i]);
    bpDict[head].SetJointStrength(continuousActions[++i]);
    bpDict[thighL].SetJointStrength(continuousActions[++i]);
    bpDict[shinL].SetJointStrength(continuousActions[++i]);
    bpDict[footL].SetJointStrength(continuousActions[++i]);
    bpDict[thighR].SetJointStrength(continuousActions[++i]);
    bpDict[shinR].SetJointStrength(continuousActions[++i]);
    bpDict[footR].SetJointStrength(continuousActions[++i]);
    bpDict[armL].SetJointStrength(continuousActions[++i]);
    bpDict[forearmL].SetJointStrength(continuousActions[++i]);
    bpDict[armR].SetJointStrength(continuousActions[++i]);
    bpDict[forearmR].SetJointStrength(continuousActions[++i]);
}

每0.02秒执行一次的FixedUpdate：

 void FixedUpdate()
 {
    
    
     // 更新指向方块和指向箭头
     UpdateOrientationObjects();
     var cubeForward = m_OrientationCube.transform.forward;

     // 当实际移动速度和目标速度越接近，获得的奖励越高，完全相同时奖励为1
     var matchSpeedReward = GetMatchingVelocityReward(cubeForward * MTargetWalkingSpeed, GetAvgVelocity());
     // 检测异常
     if (float.IsNaN(matchSpeedReward))
     {
    
    
         throw new ArgumentException(
             "NaN in moveTowardsTargetReward.\n" +
             $" cubeForward: {
      
      cubeForward}\n" +
             $" hips.velocity: {
      
      m_JdController.bodyPartsDict[hips].rb.velocity}\n" +
             $" maximumWalkingSpeed: {
      
      m_maxWalkingSpeed}"
         );
     }

     // 头部的朝向与身上指向方块的朝向的点积，当两个向量的方向越接近，这个奖励值越高
     var lookAtTargetReward = (Vector3.Dot(cubeForward, head.forward) + 1) * .5F;
     // 检测异常
     if (float.IsNaN(lookAtTargetReward))
     {
    
    
         throw new ArgumentException(
             "NaN in lookAtTargetReward.\n" +
             $" cubeForward: {
      
      cubeForward}\n" +
             $" head.forward: {
      
      head.forward}"
         );
     }
	// 头部朝向的奖励和移动速度的奖励相乘得到总奖励
     AddReward(matchSpeedReward * lookAtTargetReward);
 }
// 获取速度奖励，身躯速度接近目标速度时，奖励接近于1
public float GetMatchingVelocityReward(Vector3 velocityGoal, Vector3 actualVelocity)
{
    
    
    var velDeltaMagnitude = Mathf.Clamp(Vector3.Distance(actualVelocity, velocityGoal), 0, MTargetWalkingSpeed);
    return Mathf.Pow(1 - Mathf.Pow(velDeltaMagnitude / MTargetWalkingSpeed, 2), 2);
}

void UpdateOrientationObjects()
{
    
    
    m_WorldDirToWalk = target.position - hips.position;
    // 更新指向方块
    m_OrientationCube.UpdateOrientation(hips, target);
    // 更新指向箭头
    if (m_DirectionIndicator)
    {
    
    
        m_DirectionIndicator.MatchOrientation(m_OrientationCube.transform);
    }
}

OrientationCubeController.cs中有：

public void UpdateOrientation(Transform rootBP, Transform target)
{
    
    
    var dirVector = target.position - transform.position;
    dirVector.y = 0; //flatten dir on the y. this will only work on level, uneven surfaces
    var lookRot =
        dirVector == Vector3.zero
        ? Quaternion.identity
        : Quaternion.LookRotation(dirVector); //get our look rot to the target

    //UPDATE ORIENTATION CUBE POS & ROT
    transform.SetPositionAndRotation(rootBP.position, lookRot);
}

每一个episode（回合）开始时执行的方法OnEpisodeBegin：

public override void OnEpisodeBegin()
{
    
    
    // 重新初始化各个身躯体的参数
    foreach (var bodyPart in m_JdController.bodyPartsDict.Values)
    {
    
    
        bodyPart.Reset(bodyPart);
    }
    // 随机旋转，使得智能体初始时面朝的方向随机
    hips.rotation = Quaternion.Euler(0, Random.Range(0.0f, 360.0f), 0);
	// 更新指向方块和指向箭头
    UpdateOrientationObjects();

    // 根据参数设置随机行走速度
    MTargetWalkingSpeed =
        randomizeWalkSpeedEachEpisode ? Random.Range(0.1f, m_maxWalkingSpeed) : MTargetWalkingSpeed;
	// 根据配置文件重新设置对应的参数（例如躯体的质量等）
    SetResetParameters();
}

最后是挂载在目标方块的脚本，只需要在机器人触碰时调用奖励函数，然后位置随机就行了：

// 每当激活时，随机位置
void OnEnable()
{
    
    
    m_startingPos = transform.position;
    if (respawnIfTouched)
    {
    
    
        MoveTargetToRandomPosition();
    }
}

// 当掉下平台时，重新随机位置
void Update()
{
    
    
    if (respawnIfFallsOffPlatform)
    {
    
    
        if (transform.position.y < m_startingPos.y - fallDistance)
        {
    
    
            Debug.Log($"{
      
      transform.name} Fell Off Platform");
            MoveTargetToRandomPosition();
        }
    }
}
// 设置随机位置的方法
public void MoveTargetToRandomPosition()
{
    
    
    var newTargetPos = m_startingPos + (Random.insideUnitSphere * spawnRadius);
    newTargetPos.y = m_startingPos.y;
    transform.position = newTargetPos;
}

当智能体触碰到目标方块时，触发对应的事件（时间可以在编辑器内添加订阅），并重置方块的位置：

private void OnCollisionEnter(Collision col)
{
    
    
    if (col.transform.CompareTag(tagToDetect))
    {
    
    
        onCollisionEnterEvent.Invoke(col);
        if (respawnIfTouched)
        {
    
    
            MoveTargetToRandomPosition();
        }
    }
}

配置文件

PPO算法：

behaviors:
  Walker:
    trainer_type: ppo
    hyperparameters:
      batch_size: 2048
      buffer_size: 20480
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: linear
    network_settings:
      normalize: true
      hidden_units: 512
      num_layers: 3
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.995
        strength: 1.0
    keep_checkpoints: 5
    max_steps: 30000000
    time_horizon: 1000
    summary_freq: 30000

SAC算法：

behaviors:
  Walker:
    trainer_type: sac
    hyperparameters:
      learning_rate: 0.0003
      learning_rate_schedule: constant
      batch_size: 1024
      buffer_size: 2000000
      buffer_init_steps: 0
      tau: 0.005
      steps_per_update: 30.0
      save_replay_buffer: false
      init_entcoef: 1.0
      reward_signal_steps_per_update: 30.0
    network_settings:
      normalize: true
      hidden_units: 256
      num_layers: 3
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.995
        strength: 1.0
    keep_checkpoints: 5
    max_steps: 15000000
    time_horizon: 1000
    summary_freq: 30000

效果演示

在这里插入图片描述

后记

本案例是继爬虫机器人和蠕虫机器人之后的人形机器人案例，相比于之前的两个案例，关节更多，训练更为艰难，但是训练出来的效果也更为可观，在机器人能够走路的基础上，我们可以把训练好的会走路的机器人作为预训练的模型，然后训练机器人做更多更好玩的事情，比如说，我们可以在这基础上训练机器人踢足球，也许我们可以把这个案例和Unity人工智能之不断自我进化的五人足球赛中的方法结合起来，最终期待能训练出一个精彩的5V5仿人的机器人踢的足球赛！