【论文】Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks

Paper Link
Paper Code (MXNet)

The authors avoid the scarcity problem of previous depth estimation datasets, using frames of 3D movies as training data. The author used left view as the input, and trained an end-to-end supervised neural network (the backbone is based on VGG16) to predict the right view.
The middle part of the network outputs the disparity map of left view and right view (disparity map ), but this is only used for minimizing the MAE between the predicted right view and the ground truth. Therefore, the “disparity map” is not “real” and “accurate”.

In the scoring standard, in addition to using MAE, manual scoring is also used.

It is worth mentioning that intuitively using temporal dependency frame-by-frame video may improve the prediction ability of the model, but from the MAE results, adding 5 frames of optical flow during prediction will increase MAE. The explanation given by the author is the complexity of the temporal dependency embedding model. I guess it may also be due to the change of shooting angle and the reduction of the proportion of valid information (interference of multiple input data).

Je suppose que tu aimes

Origine blog.csdn.net/yaoyao_chen/article/details/130471834
conseillé
Classement