1. Description

So for the past few days I've been rummaging through various internet resources looking for a way to calculate the distance of an object using monocular vision. Along the way, I discovered deep learning-based monocular depth estimation models and some landmark-based distance approximation methods. Given my stubbornness about low-resource alternatives to this problem, the following example was found to be the only choice.

2. Distance measurement

Typically, depth or distance is calculated using stereo vision methods. Stereoscopic vision is a powerful technique inspired by human vision that uses the concept of binocular disparity to accurately approximate the distance of objects from the camera. However, it requires a stereo camera capable of depth estimation by simultaneously capturing two slightly offset images of the scene. I lack this luxury, so I'm focusing my efforts on landmark-based distance approximation methods.

2.1 Import computing library

So here's a background, without wasting a minute, we'll walk through the code:

import mediapipe as mp
import cv2
import numpy as np
import math

mp_pose = mp.solutions.pose
pose = mp_pose.Pose()

2.2 Import dependencies

Instantiate the media pipeline class instance mp_pose. Pose() and assign it to the variable pose .

cap = cv2.VideoCapture('distance.mp4')
while cap.isOpened():
  ret,img = cap.read()

#Converting to RGB
  img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
  results = pose.process(img)

#check to see if bodylandmarks are being detected   
  if results.pose_landmarks is not None:
    mp_drawing = mp.solutions.drawing_utils
    mp_drawing.draw_landmarks(img,results.pose_landmarks, mp_pose.POSE_CONNECTIONS)

    cv2.imshow('ImgWindow',img)

  if cv2.waitKey(1) & 0xFF == ord('q'):
    cap.release()
    cv2.destroyAllWindows()

Load a video capture window from OpenCV and convert the image to RGB as cv2 loads an image in BGR format. Next, we draw the extracted landmarks on the image as a sanity check. Here are the results so far:

Image: 02 - Landmark Detection

2.3 is now detecting landmarks,

The next step is to extract the landmark coordinates, which we will use as reference points for detecting distances. We will use the nose landmark as a reference, even though the body can only be seen from behind - the model is still able to infer the key points of the nose.

#Extracting the nose landmark. 
landmarks = []
for landmark in results.pose_landmarks.landmark:
    landmarks.append((landmark.x, landmark.y, landmark.z))
        
nose_landmark = landmarks[mp_pose.PoseLandmark.NOSE.value]
_,_, nose_z = nose_landmark

We extracted the landmarks and appended all detected landmarks to a landmarks list . From there we extract the nose_landmark which encapsulates the three values of the x, y and z axes . The x-axis and y-axis values provide the position of the point along the x and y planes, while the z-axis provides relative depth/distance information from all other keypoints. By utilizing this depth information, we can estimate the object's distance from the camera. We can refer to any key point, the image below, from the official media pipeline website, provides all key point locations with their respective names and numbers to help you in your selection.

Figure: 03 — List of Landmarks ( Credits) )

#Calculating distance from the z-axis
#Set the depth_scale to 1
depth_scale = 1
def depth_to_distance(depth_value,depth_scale):
    return -1.0 / (depth_value*depth_scale)

distance = depth_to_distance(nose_z,depth_scale)
cv2.putText(img, "Depth in unit: " + str(np.format_float_positional(distance, precision=2)),(20,50),cv2.FONT_HERSHEY_SIMPLEX,1,(255,255,255),3)
cv2.imshow('Video',img)

2.4 Measuring distance

The function height_to_distance converts z-axis values to distance values. The parameter "depth value" is the value obtained from the nose landmark and "depth scale" is used to adjust the depth value to the desired unit of measurement. The depth scale is usually provided by an algorithm. In this case, the depth scale is in meters, as described on the model documentation website. Also, the constant -1.0 is used to invert depth values, since depth is usually negative.

The above depth values are fluctuating rapidly. Also, there are random spikes in the values. The random spikes are due to the fact that the camera is not stationary, causing occasional glitches in the keypoint detection, resulting in negative or very large values. To stabilize fluctuations in distance, we can use an exponential moving average filter. A filter will approximate the value to a more stable state, reducing fluctuations. A description of filtering techniques is beyond the scope of this article. So we'll keep it for other dates and other applications. Below is the code that applies this filtering technique and gets better results.

#Tweak the alpha value to suit your needs
alpha = 0.6
previous_depth = 0.0

def apply_ema_filter(current_depth):
    global previous_depth
    filtered_depth = alpha * current_depth + (1 - alpha) * previous_depth
    previous_depth = filtered_depth  # Update the previous depth value
    return filtered_depth

filter = apply_ema_filter(nose_z)
distance = depth_to_distance(filter,depth_scale)

Here is the video result after applying the filter. The value is a bit more stable than before, but as mentioned, there are still some random negative and positive peaks due to the camera's mobility. A stationery camera will give better results.

3. Display the overall code

import mediapipe as mp
import cv2
import numpy as np

mp_pose = mp.solutions.pose
pose = mp_pose.Pose(static_image_mode=False)

#Tweak this parameter to suit your own need
alpha = 0.6
previous_depth = 0.0

def apply_ema_filter(current_depth):
    global previous_depth
    filtered_depth = alpha * current_depth + (1 - alpha) * previous_depth
    previous_depth = filtered_depth  # Update the previous depth value
    return filtered_depth

#Play with this constant value and vary it to check what suits your needs.
def depth_to_distance(depth_value, depth_scale):
  return -1.0 / (depth_value * depth_scale)

#Change it to your own camera feed
cap = cv2.VideoCapture('distance.mp4')
while cap.isOpened():
  ret,frame = cap.read()

#Grayscaling the image
  img = cv2.cvtColor(frame,cv2.COLOR_BGR2RGB)
  results = pose.process(img)

#check to see if bodylandmarks are being detected
  if results.pose_landmarks is not None:
    landmarks = []
    for landmark in results.pose_landmarks.landmark:
      landmarks.append((landmark.x, landmark.y, landmark.z))

    nose_landmark = landmarks[mp_pose.PoseLandmark.NOSE.value]
    _, _, nose_z = nose_landmark

    img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
    filter = apply_ema_filter(nose_z)
    distance = depth_to_distance(filter,1)
#Convert the distance to your own requirement.
    cv2.putText(img, "Depth in unit: " + str(np.format_float_positional(distance, precision=2)),(20,50),cv2.FONT_HERSHEY_SIMPLEX,1,(255,255,255),3)
    cv2.imshow('ImgWindow',img)

  if cv2.waitKey(1) & 0xFF == ord('q'):
   cap.release()
   cv2.destroyAllWindows()

4. Conclusion:

Although this monocular method of finding distance is an effective way to play around with the concept. However, results from dedicated stereo vision equipment are recommended for applications requiring high accuracy and precision. I hope, this satisfies your monocular vision based distance estimation needs

Quick and easy: easily calculate object distance