Real Time pose estimation of a textured object

Nowadays, augmented reality is one of the top research topic in computer vision and robotics fields. The most elemental problem in augmented reality is the estimation of the camera pose respect of an object in the case of computer vision area to do later some 3D rendering or in the case of robotics obtain an object pose in order to grasp it and do some manipulation. However, this is not a trivial problem to solve due to the fact that the most common issue in image processing is the computational cost of applying a lot of algorithms or mathematical operations for solving a problem which is basic and immediateley for humans.

Goal

In this tutorial is explained how to build a real time application to estimate the camera pose in order to track a textured object with six degrees of freedom given a 2D image and its 3D textured model.

The application will have the followings parts:

Read 3D textured object model and object mesh.
Take input from Camera or Video.
Extract ORB features and descriptors from the scene.
Match scene descriptors with model descriptors using Flann matcher.
Pose estimation using PnP + Ransac.
Linear Kalman Filter for bad poses rejection.

Theory

In computer vision estimate the camera pose from n 3D-to-2D point correspondences is a fundamental and well understood problem. The most general version of the problem requires estimating the six degrees of freedom of the pose and five calibration parameters: focal length, principal point, aspect ratio and skew. It could be established with a minimum of 6 correspondences, using the well known Direct Linear Transform (DLT) algorithm. There are, though, several simplifications to the problem which turn into an extensive list of different algorithms that improve the accuracy of the DLT.

The most common simplification is to assume known calibration parameters which is the so-called Perspective-n-Point problem:

Perspective-n-Point problem scheme

Problem Formulation: Given a set of correspondences between 3D points $p_i$ expressed in a world reference frame, and their 2D projections $u_i$ onto the image, we seek to retrieve the pose ( $R$ and $t$ ) of the camera w.r.t. the world and the focal length $f$ .

OpenCV provides four different approaches to solve the Perspective-n-Point problem which return $R$ and $t$ . Then, using the following formula it’s possible to project 3D points into the image plane:

$s\ \left [ \begin{matrix} u \\ v \\ 1 \end{matrix} \right ] = \left [ \begin{matrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{matrix} \right ] \left [ \begin{matrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{matrix} \right ] \left [ \begin{matrix} X \\ Y \\ Z\\ 1 \end{matrix} \right ]$

The complete documentation of how to manage with this equations is in Camera Calibration and 3D Reconstruction.

Source code

You can find the source code of this tutorial in the samples/cpp/tutorial_code/calib3d/real_time_pose_estimation/ folder of the OpenCV source library.

https://docs.opencv.org/3.0-beta/doc/tutorials/calib3d/real_time_pose/real_time_pose.html#realtimeposeestimation