【Tracking】Real-Time Camera Tracking: When is High Frame-Rate Best

paper:

Real-Time Camera Tracking: When is High Frame-Rate Best?

0. Summary:

Higher frame rates promise better tracking of fast motion, but advanced real-time vision systems rarely exceed the standard 10-60Hz range because the amount of computation required is considered too great. Indeed, in trackers that take advantage of prediction, increasing the frame rate reduces the computational cost per frame. Furthermore, when the physics of image formation are taken into account, high frame rates mean a lower upper limit on shutter time, resulting in reduced motion blur but increased noise. Therefore, taking these factors into consideration,as the frame rate changes, how to optimize application-related performance requirements, including accuracy, robustness and computational cost ? Taking 3D camera tracking as our test problem opens an avenue of systematic investigation by carefully synthesizing realistic videos based on ray tracing detailed 3D scenes, experimentally obtained photometric response and noise models, and fast camera motions. Ourmulti-frame-rate, multi-resolution, multi-light-level datasetis based on tens of thousands of hours of CPU rendering time. Our experiments lead to quantitative conclusions about frame rate selection and highlight the critical role of full consideration of physical image formation in driving tracking performance.

1 Introduction

High frame rate footage of bursting balloons or similar phenomena is impressive as it offers the potential to observe very fast motion. Researchers at the University of Tokyo have demonstrated the potential for real-time tracking at extremely high frame rates in earlier studies (e.g. [1]). They developed a custom vision chip that can run at 1000Hz. Using basic algorithms (related to those used in optical mice), they tracked the ball as it was thrown, bounced and shook, and combined this with a slave-controlled robotic camera platform and manipulator. While there are now commercial cameras or even cameras mounted on common mobile devices that can provide video above 100Hz, advanced real-time visual tracking algorithms rarely exceed 60Hz.

This paper experimentally analyzes when and to what extent an increase in tracking frame rate resulting from this trade-off is beneficial, providing a rigorous procedure for determining the advantages of increasing tracking frame rate. Our first contribution is an accurate framework for rendering realistic videos of agile motion in realistically lit 3D scenes, with multiple fully controllable settings and complete ground truth, enabling a system that has a strong impact on tracking performance. sexual investigation. By accurately modeling physical image generation, and using parameters from real cameras and usage scenarios, we generated a /span>. Our second contribution is to conduct experiments using these data, by analyzing the performance of basic dense alignment methods and investigating the limits of camera tracking and how it depends on frame rate. Quantitative results on frame rate optimization. A large number of realistic video data sets

1.1
Models, Predictions and Active Processing
Our scope is real-time model-based tracking, i.e. obtaining parameters that describe motion under specific circumstances Continuous estimation of these parameters is required "in the loop" for driving applications in fields such as robotics or human-computer interaction (see Figure 1(a)). In particular, we focus on camera self-motion tracking through known, mainly rigid scenes. The 3D model of the scene required for this tracking can come from a previous modeling or a full SLAM system from an interleaved reconstruction process such as PTAM [2] or DTAM [3]. The fact that in these and other state-of-the-art SLAM systems, reconstruction (often called map building) operates as a separable, lower-frequency process leads us to believe that even in SLAM it is necessary to track frame rate properties The analysis is performed separately from the reconstruction process that generates the tracking model.
Insert image description hereFigure 1. a) Real-time visual tracking applications, and the frequency of required control signals. b) Track experimental configuration and evaluation. In each frame of our synthetic sequence, after tracking, we record the translational distance between the estimated pose T̂t,w and the ground truth Tt,w; the average of these distances forms our accuracy measure. We then use the ground truth pose as the starting point for the next frame tracking step.

A model consists of geometric parts, ranging from abstract point and line features to non-parametric depth maps, meshes or implicit surfaces; and photometric descriptions used to implement image correspondences, such as local descriptors or full RGB textures. For some types of models, it has been demonstrated that real-time processing of "tracking as detection" can be achieved using efficient keypoint matching methods (such as [4]). However, while such methods play an important role in vision applications in relocalizing after tracking loss, they are not suitable for the main advantage of increasing frame rates from the system's main tracking loop - generating better and better frame by frame. Prediction to achieve guided "active" processing.

Many trackers based on keypoint or edge feature matching have implemented explicit predictions and achieved real-time efficiency by performing an exhaustive search over a small window bounded by probabilities (e.g. [5]). Dense alignment methods do not use features but aim to find correspondences between each pixel of the input video frame and the textured 3D model, requiring fewer iterations to optimize their cost function by initializing from a better starting point. In camera tracking, modern parallel processing resources now allow dense alignment to be more accurate and robust than feature-based methods (e.g. [6], [3]); the performance of multi-view stereo techniques is already very good, and we firmly believe that in In vision problems involving mobile cameras, dense surface scene models can be assumed by default (especially through the option of using monocular video with commercial depth camera sensors, e.g. [7]).

1.2
Tracking via dense image-to-model alignment
Whole image alignment tracking [6] [8] [3] Targeting dense model operations , by evaluating the gradient of the similarity function and descending iteratively toward the minimum like the original Lucas-Kanade algorithm [9]. In 3D camera tracking, for each live video frame, optimal alignment is found by minimizing the cost function of a 3D surface model with text relative to the 6DOF parameters ψ of the generalized rigid body camera-scene motion. This cost function is calculated as the sum of squares of image brightness differences over all pixels of the real-time frames of all valid views with the texture model: just like in [3], where I_l represents the current real-time video frame and I_r represents the slave model (from The reference image obtained by the projection of the last tracking pose of the textured scene model, ξ_r represents the corresponding depth map; the perspective projection is represented by π, and the fixed camera internal parameter matrix is ​​K. This cost function is non-convex, but under the assumption of small frame-to-frame motion, it is linearized and optimized iteratively via gradient descent. The size of its convergence basin depends on the scale of the main structures in the image and texture model.
Insert image description here
Tracking algorithms based on dense gradient descent implicitly implement active processing because they will require fewer iterations to converge to the minimum if run from a well-predicted starting point. If the frame rate increases, we expect the optimization to converge faster from one frame to the next, and there will be fewer large failures due to reduced inter-frame motion and increasingly effective linearization in Lucas-Kanade tracking. possibility. Therefore, in addition to the increasing use of dense tracking methods that we now see, we chose such a framework to conduct our experimental studies to analyze tracking performance, as image alignment with iterative gradient optimization is pursued in a straightforward and pure way Optimum tracking performance (which aims to align all data in the image with the scene model) and automate changes in performance (computation, accuracy and robustness) per frame. We might choose other feature-based approaches that place an abstraction (feature selection and description) between image data and tracking performance, with the abstraction being different for each feature type, which would make us question whether we are discovering the fundamentals of tracking performance. Attributes are also attributes of features. Feature algorithms have discrete tuning parameters and thresholds. For example, under a certain degree of motion blur, most detectors will not be able to find any features with strong gradients; but dense tracking can still be performed in some quantifiable way (as shown by DTAM tracking through camera defocus). Furthermore, as we will see in our experiments, there is a complex interaction between physical image-forming blur and noise effects and frame rate. The dense tracking framework allows the analysis of these degraded images without changing the algorithm parameters that may be necessary for feature-based methods.

"Experimental Evaluation of Dense 3D Tracking"
We hope to answer the following questions through the analysis of this article: In a tuned tracking system, a new process appears one day processor, with twice the computing speed of previous models at the same cost and power consumption. We can use this new processor to improve tracking performance, but how best to modify the algorithm? Some possible options include: Increase the resolution; increase the frame rate and algorithm update frequency; increase other parameters, such as the number of optimization iterations. Obviously, the tracking performance needs to be defined first (see also the article by Sturm et al.). If we consider the distribution of the model parameter space distance between the estimated pose and the true pose over the trajectory, the performance of most trackers can be measured by their average accuracy at essentially normal operation and the "range" between achieving this operation and severe failure. Robustness" indicator. When computational cost is also considered, tracking performance is described by at least three target characteristics, and the importance of each target will vary depending on the specific application. The theoretical optimal performance of a tracker is not a single point in the target space, but a Pareto front of possible operating points. With this understanding in mind, the question at the beginning of this section can be answered by sliding on the computational cost axis and weighing possible accuracy and robustness improvements. In our experiments, we analyze data where motion makes severe tracking failures rare. Further research into the robustness of tracking requires large amounts of data to obtain meaningful statistics on tracking failures. We hope to revisit this issue in the future, as one of the main advantages of high frame rates may be improved per-second robustness. Therefore, in our results we mainly focus on two metrics: accuracy of normal tracking and computational cost. Our main results are presented in the form of bi-objective graphs, with the Pareto front suggesting application-relevant action points that the user might choose.
We use custom photorealistic video generation as the basis for our experiments, and there are several reasons why we chose this approach compared to setting up equivalent real camera experimental options. First, we have full control over parameters (such as continuously variable frame rate, resolution and shutter control and precise repeatable motion), which is difficult to match in a real experimental environment because the settings of real cameras have discrete ranges. Our perfect fidelity to camera motion and scene geometry is subject to possible experimental errors under absolutely real conditions. Perhaps most importantly, our experiments highlight the critical importance of correctly understanding scene lighting conditions when evaluating tracking; even in a complex and expensive experimental setup that includes motion capture or a robot tracking camera movements as well as laser scanning to capture Scene geometry, achieving repeatability under different camera settings also requires controlled lighting and photometric setup. As detailed in the next section, we put great effort into producing experimental videos that are not only as realistic as possible, but also based on real camera, motion, and typical 3D scene parameters. This point is emphasized in the video submission in which we show samples from our dataset. Of course, this does not mean that we are not interested in pursuing real experiments in future work. The known scene geometry enables DTAM (the dense tracking component in dense monocular SLAM) to be used as the tracker in our experiments. DTAM is a pyramid-style GPGPU implementation that uses the whole image alignment method explained in Section 1.2. DTAM uses a traditional image pyramid ranging from low resolution 80×60 to full resolution 640×480, with a fixed strategy for determining when to transition from one pyramid level to the next, i.e. when transitioning from one iteration to the next Switch when the iterative attitude error is reduced by no more than 0.0001cm. It is important to note that this does imply some knowledge of the true values ​​during tracking, which we believe is reasonable in experimental evaluations, but requires a different convergence metric to be used in real tracking.
Given the camera settings, we multiply the average execution time required for the tracker to converge to the error minimum multiplied by the frame rate being used. This gives a measure of how burdensome the tracer is on the processor in dimensionless computational load units, where a value of 1 indicates full occupancy for real-time operations and larger values ​​indicate the need to plug in multiple processing cards (considering that DTAM's tracing implementation has extremely high parallelism, it is entirely feasible) to achieve this performance. To quantify the accuracy, we use an error metric based only on the average of the Euclidean distance between the estimated pose t_est and the true pose t_gt over the entire trajectory. See Figure 1(b).
Insert image description here
Figure 3. Left: Examples of images taken from a real camera using different exposure times (microseconds). The red rectangular box marks the pixels manually selected to evenly sample the scene irradiance, and the luminance values ​​are captured and used as input for CRF estimation. All images were taken with zero gain and gamma turned off. Right: Camera response function (CRF) for each R, G and B color channel of our Basler piA640-210gc camera, experimentally determined using the method of [14]. This camera has a very linear CRF before saturation; image brightness can be considered to be proportional to irradiance over most ranges. Note that the irradiance values ​​determined by this method are scaled and not absolute.

Insert image description here

determines the camera response function. We use the chart-free calibration method proposed in [14], using multiple images with known exposure time ranges in a static scene (see left side of Figure 3), to record changing brightness values ​​at a selected number of pixel positions across the scene. changes in irradiance. Since f is monotonic and reversible, we can map a given brightness value and exposure time back to irradiance via the inverse CRF f−1. For each pixel i at exposure time Δt j, we take the logarithm of the noise-free version of Equation 7: log f−1(B ij) = log E i + log Δt j. (4) Using measurements of 35 image pixels at 15 different exposure times, we use a second-order smoothing prior under the L2 error norm Solve for the parametric form of f−1. Figure 3 (right) shows the results of a Basler piA640-210gc camera with a very linear CRF (except for gamma).


Noise level function calibration. To obtain the noise level function (NLF), multiple images of the same static scene were taken within the same exposure time range. For selected pixel locations, the mean brightness and standard deviation at each exposure setting were recorded (see Figure 4), separately for each color channel. The lower envelope of this scatter plot is used to obtain the observed NLF, parameterized by σ_s and σ_c. To this end, we define a function very similar to the one used in [13] for modeling the likelihood of measured data predicted given the theoretical standard deviation τ(B_n):
τ(B_n) = EΔtσ_s^2 + σ_c^2,
where E is the irradiance mapped to brightness level B_n, which can be easily obtained using the inverse CRF obtained previously. We use the Matlab optimization toolbox and the standard functions fmincon (with constraints σ_s≥0 and σ_c≥0) and fminunc to obtain optimal values. The optimization results are shown in Figure 4 overlaid with the observed NLF for each channel.

Insert image description here

determines the camera response function. We use the chart-free calibration method proposed in [14], using multiple images with known exposure time ranges in a static scene (see left side of Figure 3), to record changing brightness values ​​at a selected number of pixel positions across the scene. changes in irradiance. Since f is monotonic and reversible, we can map a given brightness value and exposure time back to irradiance via the inverse CRF f−1. For each pixel i at exposure time Δt j, we take the logarithm of the noise-free version of Equation 7: log f−1(B ij) = log E i + log Δt j. (4) Using measurements of 35 image pixels at 15 different exposure times, we use a second-order smoothing prior to solve for the parametric form of f−1 under the L2 error norm. Figure 3 (right) shows the results of a Basler piA640-210gc camera with a very linear CRF (except for gamma).

Noise level function calibration. To obtain the noise level function (NLF), multiple images of the same static scene were taken within the same exposure time range. For selected pixel locations, the mean brightness and standard deviation at each exposure setting were recorded (see Figure 4), separately for each color channel. The lower envelope of this scatter plot is used to obtain the observed NLF, parameterized by σ_s and σ_c. To this end, we define a function very similar to the one used in [13] for modeling the likelihood of measured data predicted given the theoretical standard deviation τ(B_n):
τ(B_n) = EΔtσ_s^2 + σ_c^2,
where E is the irradiance mapped to brightness level B_n, which can be easily obtained using the inverse CRF obtained previously. We use the Matlab optimization toolbox and the standard functions fmincon (with constraints σ_s≥0 and σ_c≥0) and fminunc to obtain optimal values. The optimization results are shown in Figure 4 overlaid with the observed NLF for each channel.

3.2 Insert real camera trajectory.
Converting from still images to video requires a camera trajectory that simulates the camera following the scene. After initially trying various methods of synthesizing trajectories and finding them unsatisfactory, we decided to capture a trajectory from an actual camera tracking experiment. Using DTAM [3] we tracked an extremely handheld jittery motion that was at the limit of DTAM's state-of-the-art tracking capabilities, using a 30Hz camera. These poses are then transformed into POV-Ray's reference coordinate system using a single similarity transformation, rendering the synthetic scene in a similar manner. To obtain images at any frame rate, we interpolate poses, use cubic interpolation for translations, and spherical linear interpolation (slerp) for rotations. At our lowest experimental frame rate of 20Hz, the maximum horizontal frame-to-frame motion we observed was 260 pixels, almost half the size of the entire 640-pixel wide image, indicating fast motion.

3.3 ​​Render realistic composite sequences.
We follow previous research, such as [15], and combine “perfect” images generated by POV-Ray with our physical camera model to generate realistic video sequences. Since we do not have an absolute scale for the irradiance values ​​obtained from the CRF, the scene illumination cannot be defined in absolute terms, and instead a new constant α represents the overall scene brightness: B = (αE) Δt . (6) In our experiments, the values ​​of α are {1, 10, 40}, and the increased scene illumination will lead to better image signal-to-noise ratio. The reference E of each pixel is set to the base pixel value obtained from POV-Ray.

To account for camera motion, we model motion blur. For a given shutter time Δt, the average of multiple POV-Ray frames i is calculated along the camera trajectory during that period, and then the noise level function is applied:
Insert image description here

Finally, we quantize the resulting brightness function into a color image with 8 bits per channel. Figure 5 gives examples of similar motion at 20Hz and 100Hz.

Insert image description here
Figure 5. Synthetic photorealistic image with shutter time set to half of maximum at a specific frame rate. Left: At 100Hz, the image has little blur, but is dark and noisy. The brightness values ​​of the displayed image have been rescaled for visual presentation purposes. Right: At 20Hz, motion blur dominates. The inset highlights our correct handling of image saturation through fuzzy mean calculation in irradiance space.

4.1 Tracking Analysis and Results
Assuming perfect lighting conditions in the experiment, we used the above framework to synthesize videos with different frame rates ranging from 20 to 200Hz and durations of 5 seconds, the camera moves quickly. To push the frame rate even further, we also synthesized 400Hz and 800Hz sequences. Although we only show results from this sequence and focus on clarity of interpretation, we have other data sections with different scenes and motions where we find very similar results for tracking performance as a function of frame rate.

Insert image description here
Figure 6. Left: Error plot for different frame rates as a function of available computational budget under perfect lighting conditions. Points of high curvature indicate switching from one pyramid level to another. Right: Pareto front plot, marking the optimal frame rate for each available budget. Use a number to label the budget value for each optimal frame rate. The optimization goal is minimum error/minimum processing load performance.

4.2 Experiments under realistic lighting settings
We now extend our experiments to consider shutter time-dependent noise and blur artifacts, which were modeled in Section 3.1 , and will affect most real-world lighting conditions. We present a set of results for different global illumination levels.

We used the same main frame rate range of 20-200Hz as the perfect image, and the same 5 second motion sequence. For each frame rate we need to choose a shutter time, and although we performed some preliminary experiments on shutter time optimization (not presented here), in these results we always choose half of the reciprocal of each frame rate as Shutter time. To generate each synthetic video frame, we rendered multiple ray tracing results for fuzzy averaging from interpolated camera poses, with 1.25 milliseconds between adjacent ray tracing results at the chosen shutter time. Therefore, to generate a 20Hz sequence with a 25ms shutter time, we need to render 5 × 20 × 20 = 2000 ray tracing frames. In practice, this number is the same for every frame rate (high frame rates correspond by reducing the number of frames used for blur mean calculation), so in total 20000 rendering results are required (some of them are duplicates).

An important detail is that by pre-blurring the reprojected model template it is possible to improve the effect of dense matching and match it to the desired level for the shutter time used, which we have achieved ;The depth map will also be blurred to the same extent. We conducted some initial characterization experiments to confirm the correct functionality of our photorealistic synthesis framework, including this template blurring (Figure 7).
Insert image description here
Figure 7. Experiments used to confirm the properties of our realistic synthetic videos.
Left: For a single motion and 200Hz frame rate, we only varied the shutter time to confirm that lowering the illumination indeed resulted in lower signal-to-noise ratio and worse tracking performance< /span>
Right: An experiment clearly shows that the quality of tracking improves if an intentionally blurred prediction is matched to a blurred live image. . The signal-to-noise ratio here is quantified directly in terms of the number of bits per color channel.

4.3 High Illumination
Figure 8(a) shows the Pareto front when α = 40, where the scene illumination is higher but not perfect. The image has a good signal-to-noise ratio, but the high frame rate image is darker; the 200Hz image is almost 5 times darker than the corresponding image under perfect lighting, and as the frame rate advances, the image becomes darker and noisier. For clarity, 400Hz and 800Hz are omitted from the figure. We observe that 200Hz at low resolution is still the best choice for low budgets and the image gradient information is still powerful enough to guide matching. And again, since the baseline is short enough to aid in accurate matching, a few iterations of 200Hz are sufficient. Moving further up the budget, we found 160Hz to be the best option for higher resolutions (i.e. 320×240). This is where error plotting with frame rate interleaving intersects, increasing resolution rather than frame rate gives better results. When the processing load increases further, higher frame rates are preferred, and the pattern repeats itself with 640×480. However, the data indicate that the highest resolution occurs later compared to the transition in the perfect sequence result, in Figure 6(b). The highest resolution obviously has more benefits when the signal-to-noise ratio is high.
Insert image description here
Figure 8. Pareto front for illumination levels (a) α = 40, (b) α = 10 and (c) α = 1. The numbers on the curve represent the frame rate that can be used for a given computational budget to achieve the desired error.

4.4 Medium light
In Figure 8(b), we reduce α to 10, which represents a typical indoor lighting environment **Figure 8(c) shows a similar curve for scene illumination α = 1, where the image is very dark. **Our main observation is that the Pareto front does not include frame rates above 80Hz, even under heavy processing loads. The signal-to-noise ratio of these images is so low that essentially all tracking information is destroyed at high frame rates. The overall quality of the tracking results here is much lower for the reasons we expected (the error curves are higher on the scale) 4.5 Low Light because at very high There is too much noise at the frame rate to make multiple iterations worthwhile. When we converted to 160×120 resolution, we found that the optimal frame rate had shifted from 100Hz to 100-140Hz compared to 200Hz in high light conditions. Higher resolutions of 320×240 and 640×480 follow a similar trend. medium lighting it is better to do more iterations at lower frame rates and take advantage of the improved signal-to-noise ratio. 200Hz is still the best choice when the processing load is very low, the prediction is very powerful, and only one or two alignment iterations are needed. When the processing load increases slightly, even without a change in resolution, the choice of optimal frame rate will shift towards 100Hz, compared with perfect lighting and high lighting conditions, which require high frame rates at very low processing load. on the contrary. In

Conclusion
Our experiments provide insights into the trade-offs involved in high frame rate tracking. **Under perfect lighting and virtually unlimited signal-to-noise ratio, the highest accuracy is achieved through a combination of high frame rates and high resolutions, limited only by the available computing budget. **However, when using real camera models, there is an optimal frame rate for a given light level due to the trade-off between signal-to-noise ratio and motion blur. Therefore, even if the budget allows, the frame rate cannot be increased arbitrarily because it will cause image degradation. Lowering the scene lighting shifts the optimal frame rate towards slightly lower values, which have a higher signal-to-noise ratio and slightly more motion blur, and overall increasing the resolution is faster than increasing the frame rate for all resolutions significantly improve accuracy. Hasinoff et al. also studied time-bound analysis, but only for signal-to-noise ratio image quality assessment of static cameras or simple planar motion.
Our dataset, the rendering script that generated it, and other materials are available at http://www.doc.ic.ac.uk/~ahanda/HighFrameRateTracking/. We hope this will stimulate further applied and theoretical research into high frame rate tracking relevant to practical vision system design. Our dataset can also be used to analyze many other 3D vision problems.

Guess you like

Origin blog.csdn.net/Darlingqiang/article/details/133141639