Preliminary study on 3D computer vision (3D reconstruction) - camera model + binocular system + point cloud model

Preliminary study on 3D computer vision (3D reconstruction) - camera model + binocular system + point cloud model


Let's say that the note has finally reached the part of 3D computer vision. The content of this article is mainly to talk about the basic concepts and applications of 3D computer vision. In the follow-up, we will learn 3D reconstruction hiahiahia~

Let me talk about stereo vision first. Stereo vision is a computer vision technology whose purpose is to deduce the depth information of each pixel in the image from two or more images. It has many application fields, including For cutting-edge robots, unmanned driving and drones, etc., stereo vision draws on the "parallax" principle of human eyes, that is, the observation of an object in the real world by the left and right eyes is different, and our brain is precisely Using the difference between the left and right eyes, we can recognize the distance of objects (this is parallax)

1. Camera model

As usual, let’s first talk about what a camera model is. The concept of a camera model is actually very simple. Not only a camera can be called a camera model, but some tools that can capture images, or a series of image collectors can be called a camera model. , the research of the camera model is to obtain the P' point on the imaging plane from all the P points in real life. four coordinate systems

1.1 The coordinate system in the camera model

There are 4 coordinate systems in the pinhole camera model: world coordinate system, camera coordinate system, image physical coordinate system and image pixel coordinate system

  • World coordinate system : It is the absolute coordinate system of the objective three-dimensional world, also known as the objective coordinate system, which is the coordinate of the object in the real world. The world coordinate system changes with the size and position of the object, and the unit is the length unit
  • Camera coordinate system : take the optical center of the camera as the origin of the coordinate system, take the x and y directions parallel to the image as the x-axis and y-axis, the z-axis and the optical axis are parallel, x, y, z are perpendicular to each other, and the unit is the length unit
  • Image physical coordinate system : take the intersection of the principal optical axis and the image plane as the coordinate origin, the x' and y' directions are as shown in the figure, and the unit is the length unit
  • Image pixel coordinate system : take the vertex of the image as the coordinate origin, the u and v directions are parallel to the x' and y' directions, and the unit is in pixels

Through the coordinate system, that is, a mathematical model, to understand the imaging process of the camera model

1.2 Conversion between four coordinate systems

World Coordinate System -> Camera Coordinate System

From the introduction of the coordinate system above, we know that the world coordinate system and the camera coordinate system are both three-dimensional coordinate systems, and there are only two kinds of position changes of objects in three-dimensional space, namely translation and rotation. At this time, we use our Euclidean transformation. , even if the homogeneity of the original coordinates is multiplied by our Euclidean transformation matrix, the final result can be obtained (the use of Euclidean transformation is essentially to achieve matrix multiplication, and the homogeneity is to facilitate our multiplication)

insert image description here

But in many cases, translation and rotation are performed more than once. In the case of multiple consecutive translations and rotations, if we perform two Euclidean transformations on the vector a, the rotation and translation are R1, t1 and R2, t2 respectively

insert image description here

Camera coordinate system -> image physical coordinate system

From camera coordinate transformation to image physical coordinates, we use the principle of similar triangles, and list the corresponding matrix calculation formulas according to the properties of similar triangles to obtain the results we need, but the camera coordinates are 3-dimensional, and now require The physical coordinates of the image are 2-dimensional, so we still need to use homogeneous

insert image description here

Image physical coordinate system -> image pixel coordinate system

The conversion between the image physical coordinate system and the image pixel coordinate system is actually the unit conversion and the conversion of the center origin, one is the center point and the length unit, and the other is the upper left corner and the pixel unit

insert image description here

Camera imaging principle

The entire process of these coordinate system transformations is the imaging principle of the camera Moxiang

insert image description here

1.3 Lens distortion and perspective transformation

The reason for lens distortion is: the deviation of manufacturing precision and assembly process will introduce distortion, resulting in distortion of the original image, and the deviation of manufacturing accuracy and assembly process will introduce distortion, resulting in distortion of the original image

Radial distortion is the distortion caused by the shape of the lens, which is called radial distortion. Radial distortion is divided into pincushion distortion and barrel distortion. Tangential distortion is due to the difference between the lens itself and the camera sensor plane (imaging plane) or image plane. produced in parallel. This situation is mostly caused by the installation deviation of the lens pasted on the lens module.

Perspective transformation is to project a picture to a new viewing plane (Viewing Plane), also known as Projective Mapping (Projective Mapping). The affine transformation we often say is a special case of perspective transformation. The purpose of perspective transformation is to transform reality into Straight line objects may appear as oblique lines on the picture, and are transformed into straight lines through perspective transformation

Affine Transformation (Affine Transformation or Affine Map), also known as affine mapping, refers to the process in which an image undergoes a linear transformation and a translation from one vector space to another vector space in geometry.

2. Binocular system

Before understanding the binocular system, let's take a look at the monocular system. The camera model we mentioned above is actually the most typical monocular system. Its final imaging is a 2D image. It cannot measure the distance of real objects, that is, it cannot Deduce the depth information of each pixel in the image, so we launched a binocular system based on human eyes

Since the binocular system has two cameras, it can image two images. Based on two or more images, we can infer the depth information of each pixel in the image. There are 4 key terms in the system: polar plane , pole, baseline and epipolar line, the binocular system reasoning depth information also uses the calculation relationship of similar triangles, and completes the calculation of the distance between the P point in the display and the imaging point

insert image description here

Parallax (Disparity) : Corresponding the image points of the same control physical point in different images, this difference is the parallax image, and the parallax is inversely proportional to our depth information (distance)

3. Point cloud model

3.1 Introduction to Point Cloud

Three-dimensional image is a special form of information expression, which is characterized by three-dimensional data in the expressed space. Compared with two-dimensional images, three-dimensional images can realize the decoupling of natural objects and backgrounds with the help of information in the third dimension. , for visual measurement, the two-dimensional information of the object often changes with the projective method, but its three-dimensional features have better unity for different measurement methods

Different from photos, 3D images are a general term for a type of information , and the information also needs to have a specific form of expression, which includes: depth map (expressing the distance between the object and the camera in grayscale), geometric model (created by CAD software) , point cloud model (all reverse engineering equipment samples objects into point clouds) and point cloud model is the most common and basic 3D model

A massive collection of points expressing the spatial distribution of the target and the characteristics of the surface of the target under the same spatial reference system. After obtaining the spatial coordinates of each sampling point on the surface of the object, a collection of points is obtained, which is called "point cloud" (Point Cloud) , the content of the point cloud: combine the principles of laser measurement and photogrammetry to obtain a point cloud, including three-dimensional coordinates (XYZ), laser reflection intensity (Intensity) and color information (RGB)

3.2 Three levels of point cloud processing

Low-level processing methods: the most representative filtering methods

  1. Filtering methods: bilateral filtering, Gaussian filtering, conditional filtering, straight-through filtering, random sampling consistency filtering
  2. Key points: ISS3D, Harris3D, NARF, SIFT3D

Middle-level processing methods: feature description methods, etc.

  1. Feature description: calculation of normal and curvature, eigenvalue analysis, SHOT, PFH, FPFH, 3D Shape Context, Spin Image
  2. Segmentation and Classification

Segmentation: region growing, Ransac line and plane extraction, global optimization plane extraction, K-Means, Normalize Cut (Context based), 3D Hough Transform (line and plane extraction), connectivity analysis

Classification: point-based classification, segmentation-based classification, deep learning-based classification (PointNet, OctNet)

High-level processing methods: registration, SLAM map optimization and 3D reconstruction, etc.

  1. Registration: Point cloud registration is divided into two stages: coarse registration (Coarse Registration) and fine registration (Fine Registration). The purpose of fine registration is to make the spatial position difference between point clouds based on coarse registration Minimization, coarse registration refers to the registration of point clouds when the relative pose of the point cloud is completely unknown, which can provide a good initial value for fine registration

**Registration algorithm based on exhaustive search:** traverse the entire transformation space to select the transformation relationship that minimizes the error function or list the transformation relationship that satisfies the most point pairs

**Registration algorithm based on feature matching: **Construct the matching correspondence between point clouds through the morphological characteristics of the measured object itself, and then use the relevant algorithm to estimate the transformation relationship

  1. SLAM graph optimization and 3D reconstruction, etc.

3.2 Spin image (3D -> 2D)

Spin image is very important. It is the most classic feature description method based on the spatial distribution of point clouds. Its idea is to convert the point cloud distribution in a certain area into a two-dimensional Spin image, and then spin image of the scene and model similarity measure

3.2.1 Generation steps of Spin Image

  1. Define an Oriented Point, and generate a cylindrical coordinate system with this Oriented Point as the axis
  2. Define the parameters of Spin image, because it is a two-dimensional image with a certain size and resolution, in other words, it has a fixed number of rows and columns, two-dimensional grid size and grid
  3. Project the three-dimensional coordinates in the cylinder to the two-dimensional spin image. This process can be understood as a spin image rotated 360 degrees around the normal vector n, and the scanned three-dimensional space points will fall on the two-dimensional grid.

insert image description here

  1. Calculate the strength of each grid according to the points that fall into each grid in the spin image
  2. When all the points fall into the spin image, the display is performed based on the intensity in each grid (pixel). The most direct method is to calculate the number of points falling into each grid, but this 1 The method is too simple and rude. In order to achieve a better display effect, reduce the sensitivity to position, reduce the impact of noise and increase stability, Johnson proposed to use bilinear interpolation to distribute a point in four pixels.

insert image description here

Through the above steps, we can get the pixel value of each pixel of the spin image, that is, the converted two-dimensional image can be successfully displayed

insert image description here

3.2.2 Three key parameters of spin image

A very important step in the process of generating a Spin image is to define the size (number of rows and columns), resolution (two-dimensional grid size) and grid of the generated image. Two of the corresponding three parameters are what we are in The second of the three key parameters to be mentioned in this part

  • Size, that is, the number of rows and columns of the spin image. Generally, we set the two values ​​​​to be equal to 2 values
  • Resolution, that is, the actual pixel size of the 2D grid, it is more appropriate to use a size close to the original 3D grid, so the average value of all sides of the 3D grid is usually used as this size
  • Support angle, that is, the size limit of the normal vector angle, refers to the angle between the normal vector of the vertex in the space and the normal vector of the point selected to create the cylindrical coordinate system

Sometimes we don't need to put all the points into our two-dimensional grid. At this time, adjust the support angle to retain the main information and reduce the amount of subsequent calculations. Generally, it is set to 60~90 degrees

insert image description here

3.3 Introduction to 3D reconstruction

3.3.1 Development Status of 3D Reconstruction

3D reconstruction includes three aspects, SFM-based motion recovery structure, Deep learning-based depth estimation and structure reconstruction, and 3D reconstruction based on RGB-D depth camera

Let me introduce SFM (Structure From Motion), which is mainly based on the principle of multi-vision geometry, and is used to realize 3D reconstruction from motion, that is, to deduce three-dimensional information from 2D images without time series. It is an important branch of computer vision and is widely used. Applied to AR/VR, automatic driving and other fields, although SFM is mainly based on the principle of multi-dimensional vision geometry, with the continuous accumulation of CNN in two-dimensional images, many CNN-based 2D depth estimations have achieved certain results, and the use of CNN to explore three-dimensional reconstruction is also constantly In-depth topics

However, although the deep learning method is on the rise, the enthusiasm for the traditional multi-visual geometry method has not subsided. The practical application is still based on multi-view geometry. The deep learning method is still in the process of exploration, and there is still a distance from the real practical application.

insert image description here

3.3.2 SFM and OpenMVG

Restoring 3D scene structure from 2D images is the main task of computer vision, which is widely used in 3D navigation, virtual games and other fields. SFM is a problem of estimating camera parameters and 3D point positions. According to the topology of the image addition sequence in the SfM process , SfM methods can be divided into incremental, global, hybrid, hierarchical, semantic-based SFM and deep learning-based SFM

openMVG (Open Multiple View Geometry): Open source multi-view stereo geometry library, which is a well-known open source library for multi-view stereo geometry in the cv world. It believes in "simple and maintainable" and provides a set of powerful interfaces. Each module has been tested and strives to provide a consistent and reliable experience

Some classic applications implemented by OpenMVG

  • Solve the problem of precise matching of multi-view stereo geometry;

  • Provide a series of feature extraction and matching methods that SFM needs to use;

  • Complete SFM tool chain (calibration, estimation, reconstruction, surface treatment, etc.);

  • OpenMVG tries its best to provide highly readable codes to facilitate secondary development by developers. The core functions are as streamlined as possible, so

    You may need other libraries to complete your system

Disadvantages of SFM: Although SfM has achieved remarkable results and applications in computer vision, most SfMs are based on the assumption that the surrounding environment is still. The camera is moving, but the target is still. When facing moving objects, the overall system reconstruction The effect is significantly reduced, so traditional SFM is based on the assumption that the target is a rigid body


Copyright statement: The above learning content and pictures come from or refer to——Badou Artificial Intelligence Wang Xiaotian
If the article is helpful to you, remember to support it with one click~

Guess you like

Origin blog.csdn.net/qq_50587771/article/details/124305725