Robot and Computer Vision Combat Series (4): Moving Object Detection and Camera Motion State Estimation Based on Difference Between Frames

Basic idea and core steps
Advantages and disadvantages
Actual code
- Detect the position and contour of moving objects
- A code example of python opencv that realizes the estimation of the camera motion state based on the difference between the front and rear image frames
According to the rotation matrix and translation vector, determine whether the camera is forward, backward, left, or right

This is the fourth part of the robot and computer vision combat series. It will introduce how to detect moving objects and estimate the camera motion state based on the difference between the front and rear image frames. This is a simple and effective method that can be used in scenarios such as autonomous driving, drones, and video surveillance to extract dynamic information from images and determine whether the camera is stationary, panning, or rotating.
Moving object detection example

Basic idea and core steps

The basic idea of moving object detection and camera motion state estimation is to use the optical flow between two frames of images to calculate the motion vector of each pixel, and then divide the static area according to the size and direction of the motion vector and dynamic range, as well as camera motion modes. Specifically, there are the following steps:

Calculate optical flow. Optical flow refers to the time displacement of each pixel in the image, which can be solved by some classic algorithms, such as Lucas-Kanade method and Horn-Schunck method. Here we use the function cv2.calcOpticalFlowFarneback() provided by OpenCV to calculate the dense optical flow and get the amount of motion in the horizontal and vertical directions of each pixel.
Compute motion vectors. The motion vector refers to the horizontal and vertical components of the motion of each pixel, which can be calculated by the formula v = (u^2 + v 2) ^0.5 , where u and v are the horizontal and vertical motions, respectively. The motion vector reflects the motion speed of each pixel, and the larger it is, the faster it is.
Divide static area and dynamic area. The static area refers to the part of the image that does not change significantly, such as the background, the ground, etc.; the dynamic area refers to the part of the image that changes significantly, such as pedestrians, vehicles, etc. We can use a threshold T to judge whether each pixel belongs to a static area or a dynamic area, that is, if v < T, it is a static area, otherwise it is a dynamic area. The threshold T can be adjusted according to the actual situation, and generally takes a smaller value, such as 5 or 10.
Estimate the camera motion state. The camera motion state refers to the way the camera moves in space, which can be divided into three types: static, translation or rotation. We can use the following methods to judge the motion state of the camera:

If all pixels belong to the static area, the camera is in a static state;
If all pixels belong to the dynamic area, the camera is in a translation state;
If some pixels belong to the static area and some pixels belong to the dynamic area, the camera is in a rotating state.

Advantages and disadvantages

The advantage of this method is that it is simple and easy to implement, and does not require complex models or training data; the disadvantage is that it requires high precision in optical flow calculation and is easily affected by factors such as noise, occlusion, and illumination changes.

Actual code

Detect the position and contour of moving objects

# python opencv实现基于前后图像帧之间的差异进行运动物体检测
# 首先，导入所需的模块，包括opencv和numpy
import cv2
import numpy as np

# 然后，创建一个视频捕获对象，参数为0表示使用默认的摄像头，也可以用视频文件的路径替换
camera = cv2.VideoCapture(0)

# 检查摄像头是否成功打开，如果失败则退出程序
if not camera.isOpened():
    print("摄像头未打开")
    exit()

# 获取视频的宽度和高度，并打印出来
width = int(camera.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(camera.get(cv2.CAP_PROP_FRAME_HEIGHT))
print("视频宽度：", width)
print("视频高度：", height)

# 创建一个结构元素，用于形态学操作，这里使用椭圆形状，大小为9*4
es = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (9, 4))

# 定义一个变量，用于存储背景图像，初始为None
background = None

# 循环读取视频帧
while True:
    # 从摄像头获取一帧图像，返回值有两个，第一个是布尔值，表示是否成功读取，第二个是图像矩阵
    ret, frame = camera.read()

    # 如果读取失败，则退出循环
    if not ret:
        break

    # 对图像进行预处理，包括转换为灰度图，高斯滤波去噪，以及缩放到合适的大小
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (21, 21), 0)
    gray = cv2.resize(gray, (width // 4, height // 4))

    # 如果背景图像为空，则将当前帧作为背景图像
    if background is None:
        background = gray
        continue

    # 计算当前帧与背景图像之间的差异，并进行阈值化处理，得到一幅二值图像
    diff = cv2.absdiff(background, gray)
    diff = cv2.threshold(diff, 25, 255, cv2.THRESH_BINARY)[1]

    # 对二值图像进行膨胀操作，填补空洞和缺陷，并找出其中的轮廓
    diff = cv2.dilate(diff, es, iterations=2)
    contours, hierarchy = cv2.findContours(diff.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # 遍历轮廓列表，根据轮廓的面积大小进行筛选，过滤掉太小的区域
    for c in contours:
        if cv2.contourArea(c) < 1500:
            continue

        # 计算轮廓的外接矩形，并在原始帧上绘制绿色的边框
        x, y, w, h = cv2.boundingRect(c)
        x *= 4 # 还原缩放后的坐标
        y *= 4
        w *= 4
        h *= 4
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)

    # 显示原始帧和差分图像
    cv2.imshow("frame", frame)
    cv2.imshow("diff", diff)

A code example of python opencv that realizes the estimation of the camera motion state based on the difference between the front and rear image frames

Estimation of camera motion state is an important problem in computer vision, which can be used in applications such as video stabilization, 3D reconstruction, and augmented reality. A commonly used method is to calculate the translation and rotation of the camera based on the difference between before and after image frames, also known as optical flow. This article will introduce how to use python and opencv to implement this method, and give a code example.

First, we need to read two frames from the video and convert them to grayscale. Then, we use the function cv2.calcOpticalFlowFarneback() provided by opencv to calculate the dense optical flow between two frames of images. This function returns an array with the same size as the image, and each element represents the displacement vector of the corresponding pixel between the two frames of images. We can use the cv2.cartToPolar() function to convert these vectors into polar coordinates to get the displacement magnitude and direction of each pixel.

Next, we need to estimate the translation and rotation of the camera based on the magnitude and direction of the optical flow. A simple method is to average the displacement magnitude and direction of all pixels to obtain a global displacement vector and rotation angle. However, this method may be affected by noise and outliers, leading to inaccurate estimates. Therefore, we can adopt a more robust approach, the RANSAC (Random Sampling Consensus) algorithm. The basic idea of the RANSAC algorithm is to randomly select a small part of all data points as interior points, then fit a model based on these interior points, and calculate the error between the model and all data points. If the error is less than a threshold, the model is considered reasonable, and the data points whose error is less than the threshold are added to the inlier set. This process is repeated many times, and finally the model with the smallest error is selected as the final result.

In this example, we use the RANSAC algorithm to fit an affine transformation matrix that can represent the translation and rotation of the camera on a two-dimensional plane. We randomly sample two pixels and their corresponding optical flow vectors, then calculate the affine transformation matrix based on these two pairs of points, and use this matrix to transform the positions of all pixels. Then we calculate the Euclidean distance between the transformed position and the original position, and use the points whose distance is less than a threshold (eg, 5 pixels) as inliers. We repeat this process 100 times and choose the affine transformation matrix with the largest number of interior points as the final result. Finally, we extract the translation vector and rotation angle from the affine transformation matrix and output it to the screen.

Here is a complete code example:

python
import cv2
import numpy as np

# 读取视频文件
cap = cv2.VideoCapture("video.mp4")

# 读取第一帧图像
ret, frame1 = cap.read()
# 转换为灰度图
gray1 = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)

# 循环读取后续帧图像
while True:
# 读取下一帧图像
ret, frame2 = cap.read()
if not ret:
break
# 转换为灰度图
gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)

# 循环读取后续帧图像
while True:
# 读取下一帧图像
ret, frame2 = cap.read()
if not ret:
break
# 转换为灰度图
gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)

# 检测角点特征，并提取描述符
orb = cv2.ORB_create()
kp1, des1 = orb.detectAndCompute(gray1, None)
kp2, des2 = orb.detectAndCompute(gray2, None)

# 匹配前后帧的特征点，并筛选出良好的匹配对
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda x: x.distance)
good_matches = matches[:20]

# 使用随机抽样一致（RANSAC）算法，估计前后帧之间的单应矩阵（homography matrix）
src_pts = np.float32([kp1[m.queryIdx].pt for m in good_matches]).reshape(-1, 1, 2)
dst_pts = np.float32([kp2[m.trainIdx].pt for m in good_matches]).reshape(-1, 1, 2)
H, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC)

# 从单应矩阵中分解出相机的旋转矩阵和平移向量
K = np.array([[800.0, 0.0, 320.0], [0.0, 800.0, 240.0], [0.0, 0.0, 1.0]]) # 相机内参矩阵，根据实际情况修改
R, t = cv2.recoverPose(H, src_pts[mask.ravel() == 1], dst_pts[mask.ravel() == 1], K)

# 根据旋转矩阵和平移向量，判断相机是向前、向后、向左、向右

opencv python

According to the rotation matrix and translation vector, determine whether the camera is forward, backward, left, or right

In this article, we will introduce how to use the opencv python library to determine whether the camera is moving forward, backward, left, or right based on the rotation matrix and translation vector. This is a useful technique that can be used for camera pose estimation in computer vision and robotics.

First, we need to understand what a rotation matrix and translation vector are. A rotation matrix is a 3x3 matrix that represents the rotation of one coordinate system relative to another. A translation vector is a 3x1 vector that represents the translation of one coordinate system relative to another. For example, if we have two camera coordinate systems C1 and C2, then the rotation matrix R and the translation vector t can represent the transformation of C2 relative to C1:

C2 = R * C1 + t

where * denotes matrix multiplication. This means, if we know the coordinates of a point P in C1, then we can calculate its coordinates in C2 by the following formula:

P’ = R * P + t

where P' is the coordinate of P in C2. Conversely, if we know the coordinates of P' in C2, then we can calculate its coordinates in C1 by the following formula:

P = R^(-1) * (P’ - t)

where R^(-1) is the inverse matrix of R. Note that it is assumed here that the origins of the two camera coordinate systems are at the optical center of the camera, and the axes of the two camera coordinate systems are right-handed.

So, how to judge whether the camera is moving forward, backward, left, or right based on R and t? A simple approach is, assuming that the camera's position in C1 is O(0, 0, 0), then its position in C2 is O'(t_x, t_y, t_z), where t_x, t_y, t_z are t of the three components. Then, we can judge the moving direction of the camera according to the position of O' in C2:

If t_z > 0, then the camera is moving forward.
If t_z < 0, then the camera is moving backwards.
If t_x > 0, then the camera is moving to the right.
If t_x < 0, then the camera is moving to the left.
If t_y > 0, then the camera is moving up.
If t_y < 0, then the camera is moving down.

Of course, this method can only give a rough direction, and cannot accurately describe the trajectory of the camera. If we want to describe the motion trajectory of the camera more accurately, we need to consider the influence of the rotation matrix R on the translation direction. For example, if R represents a transformation that rotates 90 degrees counterclockwise about the y-axis, then the original x-axis becomes the z-axis, and the original z-axis becomes the -x-axis. At this time, we cannot directly judge whether the camera is moving forward or backward, left or right based on t_x and t_z.

You can practice robotics and computer vision together in the next tutorial. You can follow me or subscribe to my column.

Robot and Computer Vision Combat Series (4): Moving Object Detection and Camera Motion State Estimation Based on Difference Between Frames

Robot and Computer Vision Combat Series (4): Moving Object Detection and Camera Motion State Estimation Based on Difference Between Frames

Basic idea and core steps

Advantages and disadvantages

Actual code

Detect the position and contour of moving objects

A code example of python opencv that realizes the estimation of the camera motion state based on the difference between the front and rear image frames

According to the rotation matrix and translation vector, determine whether the camera is forward, backward, left, or right

Guess you like