3D gesture tracking based on MediaPipeUnityPlugin

VID_20230421_005228

brief description

At present, several kinds of gesture tracking are relatively easy to use, many of which are based on AR and require hardware support.

Google's MediaPipe has a relatively low threshold and can run with a computer camera.
At the same time, MediaPipeUnityPlugin has helped us port it into Unity.

But MediaPipe is based on image recognition, so my hand is moving forward and backward, and the near, far, and small are just zooming in and out, and the obtained data only occurs on one plane, lacking hardware support, so there is no depth information.
Fortunately, even if it's just a plane, the mighty Google at least helped us do the relative depth of the bones
Please add a picture description

So, what this article is going to do is, as shown in the top video, simulate the depth of the shot, so that our hands can really move in a three-dimensional space

If you don’t know MediaPipeUnityPlugin yet, first read my previous article about the basic use of MediaPipe-UnityPlugin.
If you already know the HandTracking of MediaPipeUnityPlugin, you can skip directly to [5. Get coordinates] link


References

Assets\MediaPipeUnity\Samples\Scenes\Hand Tracking\Hand Tracking.unity
insert image description here

Although the plug-in has provided a complete case, because we want to add additional functions on this basis, we still imitate building our own Graph and Solution, and remove some unused content to simplify the code

1. Declare a MyGraph

public class MyHandTrackingGraph : GraphRunner
{
    
    
	public override void StartRun(ImageSource imageSource)
    {
    
    
    	// 此处进行输出流的启动 和 CalculatorGraph的StartRun
    }
    protected override IList<WaitForResult> RequestDependentAssets()
    {
    
    
    	// 此处加载用到的数据文件
    }
}
  • First declare our output stream and the name of the output stream

Output stream We use the OutputStream<TPacket, TValue> provided by the plug-in.
There are 6 output streams for hand tracking:
palm detection , rectangular area based on palm detection , coordinates of hand nodes , world coordinates of hand nodes , rectangular area based on coordinates , Left and right hand detection

They are as follows:

// 手掌检测
OutputStream<DetectionVectorPacket, List<Detection>> _palmDetectionsStream;
const string _PalmDetectionsStreamName = "palm_detections";
// 基于手掌检测的矩形区
OutputStream<NormalizedRectVectorPacket, List<NormalizedRect>> _handRectsFromPalmDetectionsStream;
const string _HandRectsFromPalmDetectionsStreamName = "hand_rects_from_palm_detections";
// 手节点的坐标
OutputStream<NormalizedLandmarkListVectorPacket, List<NormalizedLandmarkList>> _handLandmarksStream;
const string _HandLandmarksStreamName = "hand_landmarks";
// 手节点的世界坐标
OutputStream<LandmarkListVectorPacket, List<LandmarkList>> _handWorldLandmarksStream;
const string _HandWorldLandmarksStreamName = "hand_world_landmarks";
// 基于坐标的矩形区
OutputStream<NormalizedRectVectorPacket, List<NormalizedRect>> _handRectsFromLandmarksStream;
const string _HandRectsFromLandmarksStreamName = "hand_rects_from_landmarks";
// 左右手检测
OutputStream<ClassificationListVectorPacket, List<ClassificationList>> _handednessStream;
const string _HandednessStreamName = "handedness";

and the name of our input stream

const string _InputStreamName = "input_video";

We only use the coordinates of the hand node this time , so we only need

const string _InputStreamName = "input_video";

OutputStream<NormalizedLandmarkListVectorPacket, List<NormalizedLandmarkList>> _handLandmarksStream;
const string _HandLandmarksStreamName = "hand_landmarks";
  • Then initialize and configure it in ConfigureCalculatorGraph()
protected override Status ConfigureCalculatorGraph(CalculatorGraphConfig config)
{
    
    
    _handLandmarksStream = new OutputStream<NormalizedLandmarkListVectorPacket, List<NormalizedLandmarkList>>(
        calculatorGraph, _HandLandmarksStreamName, config.AddPacketPresenceCalculator(_HandLandmarksStreamName), timeoutMicrosec);
        
    return base.ConfigureCalculatorGraph(config);
}

The constructor of OutputStream is as follows

/// <summary>
///   实例化一个 OutputStream class
///   图形必须具有 PacketPresenceCalculator 节点,用于计算流是否有输出
/// </summary>
/// <remarks>
///   当您希望同步获取输出,但不希望在等待输出时阻塞线程时,这很有用
/// </remarks>
/// <param name="calculatorGraph"> 流的所有者 </param>
/// <param name="streamName"> 输出流的名称 </param>
/// <param name="presenceStreamName"> 当输出存在时,输出true的流的名称 </param>
/// <param name="timeoutMicrosec"> 如果输出数据包为空,则 OutputStream 实例会丢弃数据包,直到此处指定的时间结束 </param>
public OutputStream(CalculatorGraph calculatorGraph, string streamName, string presenceStreamName, long timeoutMicrosec = 0) : this(calculatorGraph, streamName, false, timeoutMicrosec)
{
    
    
    this.presenceStreamName = presenceStreamName;
}
  • Then define an interface for asynchronous output monitoring and an interface for obtaining synchronous output
public event EventHandler<OutputEventArgs<List<NormalizedLandmarkList>>> OnHandLandmarksOutput
{
    
    
    add => _handLandmarksStream.AddListener(value);
    remove => _handLandmarksStream.RemoveListener(value);
}

public bool TryGetNext(out List<NormalizedLandmarkList> handLandmarks, bool allowBlock = true)
{
    
    
    var currentTimestampMicrosec = GetCurrentTimestampMicrosec();
    return TryGetNext(_handLandmarksStream, out handLandmarks, allowBlock, currentTimestampMicrosec);
}
  • Start and release in StartRun() and Stop()
public override void StartRun(ImageSource imageSource)
{
    
    
    _handLandmarksStream.StartPolling().AssertOk();
    calculatorGraph.StartRun().AssertOk();
}

public override void Stop()
{
    
    
    _handLandmarksStream?.Close();
    _handLandmarksStream = null;
    base.Stop();
}
  • Then define an input interface for passing the input source to the input stream
public void AddTextureFrameToInputStream(TextureFrame textureFrame)
{
    
    
    AddTextureFrameToInputStream(_InputStreamName, textureFrame);
}
  • So far, the input and output have been written, and our Graph should look like this
public class MyHandTrackingGraph : GraphRunner
{
    
    
    const string _InputStreamName = "input_video";

    OutputStream<NormalizedLandmarkListVectorPacket, List<NormalizedLandmarkList>> _handLandmarksStream;
    const string _HandLandmarksStreamName = "hand_landmarks";
    
    public event EventHandler<OutputEventArgs<List<NormalizedLandmarkList>>> OnHandLandmarksOutput
    {
    
    
        add => _handLandmarksStream.AddListener(value);
        remove => _handLandmarksStream.RemoveListener(value);
    }
    
    public bool TryGetNext(out List<NormalizedLandmarkList> handLandmarks, bool allowBlock = true)
    {
    
    
        var currentTimestampMicrosec = GetCurrentTimestampMicrosec();
        return TryGetNext(_handLandmarksStream, out handLandmarks, allowBlock, currentTimestampMicrosec);
    }
    
    public override void StartRun(ImageSource imageSource)
    {
    
    
        _handLandmarksStream.StartPolling().AssertOk();
        calculatorGraph.StartRun().AssertOk();
    }

    public override void Stop()
    {
    
    
        _handLandmarksStream?.Close();
        _handLandmarksStream = null;
        base.Stop();
    }
    
    public void AddTextureFrameToInputStream(TextureFrame textureFrame)
    {
    
    
        AddTextureFrameToInputStream(_InputStreamName, textureFrame);
    }
    
    protected override Status ConfigureCalculatorGraph(CalculatorGraphConfig config)
    {
    
    
        _handLandmarksStream = new OutputStream<NormalizedLandmarkListVectorPacket, List<NormalizedLandmarkList>>(
            calculatorGraph, _HandLandmarksStreamName, config.AddPacketPresenceCalculator(_HandLandmarksStreamName), timeoutMicrosec);
        
        return base.ConfigureCalculatorGraph(config);
    }

    protected override IList<WaitForResult> RequestDependentAssets()
    {
    
    
        
    }
}

There is still a RequestDependentAssets() to load the data file "hand_landmark_full.bytes" of our hand coordinate detection here, and it will find this file
through the loading method we selected in Bootstrap

protected override IList<WaitForResult> RequestDependentAssets()
{
    
    
    return new List<WaitForResult>()
    {
    
    
        WaitForAsset("hand_landmark_full.bytes"),
    };
}

2. Declare a MySolution

public class MyHandTrackingSolution : ImageSourceSolution<MyHandTrackingGraph>
{
    
    
    protected override void OnStartRun()
    {
    
    
        // 此处绑定graph的输出监听
    }

    protected override void AddTextureFrameToInputStream(TextureFrame textureFrame)
    {
    
    
        // 此处接入到graph的输入接口
    }

    protected override IEnumerator WaitForNextValue()
    {
    
    
        // 此处获取graph的同步输出
    }
}
  • First we declare a callback, bound to graph's OnHandLandmarksOutput
    (the graphRunner here is our MyHandTrackingGraph)
protected override void OnStartRun()
{
    
    
    graphRunner.OnHandLandmarksOutput += OnHandLandmarksOutput;
}
void OnHandLandmarksOutput(object stream, OutputEventArgs<List<NormalizedLandmarkList>> eventArgs)
{
    
    
	var handLandmarks = eventArgs.value;
    // 这里对得到的数据进行处理
}
  • Then pass the image input to the graph
protected override void AddTextureFrameToInputStream(TextureFrame textureFrame)
{
    
    
    graphRunner.AddTextureFrameToInputStream(textureFrame);
}
  • Then get the synchronous output of graph

Here we distinguish, there are two modes of synchronization, blocking synchronization and non-blocking synchronization, determined by the parameter runningMode (defined in ImageSourceSolution)

protected override IEnumerator WaitForNextValue()
{
    
    
    List<NormalizedLandmarkList> handLandmarks = null;
    if (runningMode == RunningMode.Sync)
    {
    
    
        graphRunner.TryGetNext(out handLandmarks, true));
    }
    else if (runningMode == RunningMode.NonBlockingSync)
    {
    
    
        yield return new WaitUntil(() => graphRunner.TryGetNext(out handLandmarks, false));
    }
    
    // 这里对得到的数据进行处理
}

Well, our Solution is done like this, how to use the obtained data, we will talk about it later


3. Scene construction

Create a new scene, set up an empty node to hang the script, call it Main,
insert image description here
hang up our Graph and Solution, and configure
insert image description here
the 3 Configs here to directly use the preset ones, located in Assets\MediaPipeUnity\Samples\Scenes\Hand Tracking
insert image description here

You can see that we are still short of Bootstrap, Screen and TextureFramePool

Bootstrap and TextureFramePool can also be hung directly below
insert image description here

Screen creates a new UI/RawImage, hangs it under it, and simply sets the Rect
insert image description here

Because we choose the real WebCamera as the input source, we have to add a WebCamSource, which can also be hung under the Main.
insert image description here
Default Width is the resolution of the screen. If it is increased, it will be stuck, and the default 1280 is enough.
But our Screen will be Zoom to 1280, so we add AutoFit to Screen to make it full screen
insert image description here

  • At this point, we can try to run

4. MyGraph II

A good picture came out, but as expected, an error was reported Let's look at the source, it happened in the calculatorGraph.StartRun().AssertOk()
insert image description here
of StartRun in Graph

Look at the error report again, the config we used has SidePacket, so we have to give him a whole one
( I mentioned it when I talked about CalculatorGraph in the previous article )

var sidePacket = new SidePacket();
calculatorGraph.StartRun(sidePacket).AssertOk();

How to configure it, there is an interface

sidePacket.Emplace("packet名", new xxxPacket());

Where does the name come from? Let’s open hang_tracking_cpu.txt and take a look
. We found our side_packet below. Of course, we can also know that
insert image description here
we can carry it directly from the case by directly reading the error report. Among them, SetImageTransformationOptions() is the interface provided by GraphRunner, which encapsulates the previous 3 A SidePacket configuration, our GraphRunner also provides a StartRun interface to pass in the SidePacket

var sidePacket = new SidePacket();
SetImageTransformationOptions(sidePacket, imageSource, true); // GraphRunner提供的接口
sidePacket.Emplace("model_complexity", new IntPacket(1);
sidePacket.Emplace("num_hands", new IntPacket(2));
//calculatorGraph.StartRun(sidePacket).AssertOk();
StartRun(sidePacket);
  • Ok, now there is no error reporting, but the screen seems to be reversed, let's flip the Screen

insert image description here

  • Try to reach out and get the output in the asynchronous callback

Asynchronous acquisition needs to select asynchronous in the runningMode of the solution
insert image description here

void OnHandLandmarksOutput(object stream, OutputEventArgs<List<NormalizedLandmarkList>> eventArgs)
{
    
    
    if (eventArgs.value != null) Debug.Log("--- Recv ---");
}

Of course it is also possible to synchronize
insert image description here

protected override IEnumerator WaitForNextValue()
{
    
    
    List<NormalizedLandmarkList> handLandmarks = null;
    if (runningMode == RunningMode.Sync)
    {
    
    
        graphRunner.TryGetNext(out handLandmarks, true));
    }
    else if (runningMode == RunningMode.NonBlockingSync)
    {
    
    
        yield return new WaitUntil(() => graphRunner.TryGetNext(out handLandmarks, false));
    }
    
    if (handLandmarks != null) Debug.Log("--- Recv ---");
}

insert image description here

Very good, you can proceed to the next step


5. Get coordinates

It is known that the bones tracked by MediaPipe gestures have 21 nodes, as shown in the figure below

Please add a picture description

First of all, what we get is List<NormalizedLandmarkList>,
the number of traversal is 0 to 2, which is the number of recognized hands, one NormalizedLandmarkList for each hand

There is a Landmark under it, which is also a collection type.
The NormalizedLandmark is traversed, and the number is 21. It corresponds to the bone node of the hand. There are 3 attributes below: X, Y, Z, which are obviously the coordinates. From this we can
get

if (eventArgs.value != null)
{
    
    
	var handLandmarks = eventArgs.value;
	foreach(var normalizedLandmarkList in handLandmarks)
	{
    
    
		var landmarks = normalizedLandmarkList.Landmark;
		foreach(var landmark in landmarks)
		{
    
    
		    var pos = new Vector3(landmark.X, landmark.Y, landmark.Z);
		}
	}
}

6. Visualization

We preset a small sphere, and generate 21 alignments to our coordinate points at runtime. By the
way, add a collision body and a rigid body so that it can interact with scene objects (the rigid body remembers to remove gravity)
insert image description here

Create a new script to control it, call it MyView

Bind our prefab and generate it ahead of time in Start

public Transform objRoot; // 存放小球的父节点
public GameObject boneObj;
List<GameObject> m_boneObjList;

void Start()
{
    
    
	m_boneObjList= new List<GameObject>();
	while (m_boneObjList.Count < 21)
	{
    
    
	    m_boneObjList.Add(Instantiate(boneObj, objRoot));
	}
}

Then we can assign the coordinates in Update

void Update()
{
    
    
	var landmarks = ... //上面拿到的landmarks
	for (int i = 0; i < landmarks.Count; i++)
    {
    
    
        var mark = landmarks[i];
        var pos = new Vector3(mark.X, mark.Y, mark.Z);
        m_boneObjList[i].transform.localPosition = pos;
    }
}

Hang MyView in the scene and run it

insert image description here

It seems to be reversed, let's flip x and y

var pos = new Vector3(-mark.X, -mark.Y, mark.Z);

insert image description here

Okay, no problem, it may look a little small, we can multiply pos by a multiple to enlarge it

We move the coordinate acquisition from MySolution to MyView, and simply encapsulate it

public Transform objRoot; // 存放小球的父节点
public GameObject boneObj;
List<GameObject> m_boneObjList;

List<NormalizedLandmarkList> m_currList;
bool m_newLandMark = false;

void Start()
{
    
    
	m_boneObjList= new List<GameObject>();
	while (m_boneObjList.Count < 21)
	{
    
    
	    m_boneObjList.Add(Instantiate(boneObj, objRoot));
	}
}

public void DrawLater(List<NormalizedLandmarkList> list)
{
    
    
    m_currList = list;
    m_newLandMark = true;
}

public void DrawNow(List<NormalizedLandmarkList> list)
{
    
    
	if (list.Count == 0) return;
    var landmarks = list[0].Landmark; // 先忽略多只手的情况,只处理第一只手
    if (landmarks.Count <= 0) return;

    for (int i = 0; i < landmarks.Count; i++)
    {
    
    
        var mark = landmarks[i];
        var pos = new Vector3(mark.X, mark.Y, mark.Z);
        m_boneObjList[i].transform.localPosition = pos;
    }
}

void LateUpdate()
{
    
    
    if (m_newLandMark) UpdateDraw();
}

void UpdateDraw()
{
    
    
    m_newLandMark = false;
    DrawNow(m_currList);
}

Synchronous and asynchronous processing are distinguished here. Synchronous data is directly connected to DrawNow(), and asynchronous data is connected to DrawLater(), which is later processed from Update because it does not allow operations in asynchronous threads.

Now there are only a few bone points that seem unintuitive. We use LineRenderer to add a few Lines to him
and figure out that 5 lines are enough. The corresponding bone points are as follows
insert image description hereinsert image description here

Let's record these bone points and draw them

public LineRenderer[] lines;

readonly int[][] m_connections = {
    
    
    new []{
    
    0, 1, 2, 3, 4},
    new []{
    
    0, 5, 6, 7, 8},
    new []{
    
    9, 10, 11, 12},
    new []{
    
    13, 14, 15, 16},
    new []{
    
    0, 17, 18, 19, 20},
};

void DrawLine()
{
    
    
    for (int i = 0; i < m_connections.Length; i++)
    {
    
    
        var connections = m_connections[i];
        var pos = new Vector3[connections.Length];
        for (int j = 0; j < connections.Length; j++)
        {
    
    
            pos[j] = m_boneObjList[connections[j]].transform.position;
        }

        lines[i].positionCount = pos.Length;
        lines[i].SetPositions(pos);
    }
}

Remember to bind the reference outside to see the effect
【picture】


7. Deep Simulation

After talking about so much preliminary work, we finally come to our core link

Let me talk about the idea first, we mainly need to do two things:
– ①Find the depth of the hand, that is, the distance moved back and forth
– ②Restore the “enlarged/shrunk” hand back to its original size

The implementation method is also very simple and rude. We don't expect exact values, but only a rough idea:
1. We find two adjacent bone points near the palm, such as [0] and [1], because the relative positions of the points here relatively stable.
2. We take their distance, and then the value at a certain moment is the initial value [d0]; then take the value of a certain position later as [d1].
3. This distance will zoom in and out with the distance, so you can get their scaling ratio [d0] / [d1].
4. In this way, ② can be completed, and the subsequent coordinates are multiplied by this scaling, and the original size will naturally be restored.
5. After getting the zoom ratio, then ① is also easy to handle. We directly set a coefficient and multiply the zoom ratio by this coefficient to obtain the depth value

Does it sound too simple and rude? Haha, there is no way. The data we can use at hand can only do this.
After slowly adjusting the initial value and coefficient, we can also get a very simulated effect.

First, we define two constants as initial distance and depth coefficients

static float m_baseDist = 0.5f;
static float m_depthRatio = 30f;

Then process as described above to get the scaling ratio scale and depth value depth

void GetDepthByLandmark(NormalizedLandmark mark1, NormalizedLandmark mark2, out float depth, out float scale)
{
    
    
    var pos1 = new Vector3(mark1.X, mark1.Y, mark1.Z);
    var pos2 = new Vector3(mark2.X, mark2.Y, mark2.Z);
    var length = Vector3.Distance(pos1, pos2);
    scale = m_baseDist / length;
    depth = scale * m_depthRatio;
}

Then we give the obtained value directly to objRoot

GetDepthByLandmark(landmarks[0], landmarks[1], out var depth, out var scale);

var rootPos = objRoot.localPosition;
objRoot.localPosition = new Vector3(rootPos.x, rootPos.y, depth);
objRoot.localScale = new Vector3(scale, scale, scale);

At this point, our MyView should look like this

public class MyView : MonoBehaviour
{
    
    
	public Transform objRoot; // 存放小球的父节点
	public GameObject boneObj;
	List<GameObject> m_boneObjList;

	public LineRenderer[] lines;
	
	List<NormalizedLandmarkList> m_currList;
	bool m_newLandMark = false;
	
	void Start()
	{
    
    
		m_boneObjList= new List<GameObject>();
		while (m_boneObjList.Count < 21)
		{
    
    
		    m_boneObjList.Add(Instantiate(boneObj, objRoot));
		}
	}
	
	public void DrawLater(List<NormalizedLandmarkList> list)
	{
    
    
	    m_currList = list;
	    m_newLandMark = true;
	}
	
	public void DrawNow(List<NormalizedLandmarkList> list)
	{
    
    
		if (list.Count == 0) return;
	    var landmarks = list[0].Landmark; // 先忽略多只手的情况,只处理第一只手
	    if (landmarks.Count <= 0) return;
	
		GetDepthByLandmark(landmarks[0], landmarks[1], out var depth, out var scale);
		var rootPos = objRoot.localPosition;
		objRoot.localPosition = new Vector3(rootPos.x, rootPos.y, depth);
		objRoot.localScale = new Vector3(scale, scale, scale);

	    for (int i = 0; i < landmarks.Count; i++)
	    {
    
    
	        var mark = landmarks[i];
	        var pos = new Vector3(mark.X, mark.Y, mark.Z);
	        m_boneObjList[i].transform.localPosition = pos * 14; // 我这里要14倍才看起来不那么小
	    }
	}
	
	void LateUpdate()
	{
    
    
	    if (m_newLandMark) UpdateDraw();
	}
	
	void UpdateDraw()
	{
    
    
	    m_newLandMark = false;
	    DrawNow(m_currList);
	}

	readonly int[][] m_connections = {
    
    
        new []{
    
    0, 1, 2, 3, 4},
        new []{
    
    0, 5, 6, 7, 8},
        new []{
    
    9, 10, 11, 12},
        new []{
    
    13, 14, 15, 16},
        new []{
    
    0, 17, 18, 19, 20},
    };

    void DrawLine()
    {
    
    
        for (int i = 0; i < m_connections.Length; i++)
        {
    
    
            var connections = m_connections[i];
            var pos = new Vector3[connections.Length];
            for (int j = 0; j < connections.Length; j++)
            {
    
    
                pos[j] = m_boneObjList[connections[j]].transform.position;
            }

            lines[i].positionCount = pos.Length;
            lines[i].SetPositions(pos);
        }
    }

	static float m_baseDist = 0.5f;
	static float m_depthRatio = 30f;
	void GetDepthByLandmark(NormalizedLandmark mark1, NormalizedLandmark mark2, out float depth, out float scale)
	{
    
    
	    var pos1 = new Vector3(mark1.X, mark1.Y, mark1.Z);
	    var pos2 = new Vector3(mark2.X, mark2.Y, mark2.Z);
	    var length = Vector3.Distance(pos1, pos2);
	    scale = m_baseDist / length;
	    depth = scale * m_depthRatio;
	}
}

run it

insert image description here
Right, if you feel a bit crooked, you can also adjust the position of objRoot


8. Add interaction

We add some objects to the scene, and add colliders and rigid bodies to the objects to be interacted

insert image description here

So far, we have achieved the effect in the video

VID_20230421_005228


We have only dealt with the case of one hand so far. If we need two hands, we can use two balls and Line to do the same.

Ideas for improvement:
If conditions allow, it is better to provide some hardware support, such as ARCore's Depth API, which can help us obtain the depth of a certain coordinate on the screen, which is included in Unity's AR package.
I am currently trying, if I succeed, I will write a follow-up article...

Guess you like

Origin blog.csdn.net/EdmundRF/article/details/130126940