Anti-projection visualization: the probability of the collapse of the size of the recovery: Interpretation ICCV9 paper

Anti-projection visualization: the probability of the collapse of the size of the recovery: Interpretation ICCV9 paper

Visual Deprojection: Probabilistic Recovery of Collapsed Dimensions

 

 

Papers link:

http://openaccess.thecvf.com/content_ICCV_2019/papers/Balakrishnan_Visual_Deprojection_Probabilistic_Recovery_of_Collapsed_Dimensions_ICCV_2019_paper.pdf

Summary

We introduce visual projection: recovery tasks along the dimensions of the folded image or video. Projection occurs in each case, such as long-exposure photography, the scene is dynamic in time folded to produce motion blur image, and corner portions camera, the scene in which the light reflected by the edge of the shutter is folded along the spatial dimensions to produce 1D video. Back projection is ill-posed - usually there are many reasonable solution for a given input. We first proposed a probabilistic model to capture the ambiguity of the task. Then, we propose a variational inference strategy for convolutional neural network function approximation's. Inference from the sample at the time of the test, generated likely candidate signal from the original signal with a given input distribution in the same projection. We evaluated the method on multiple data sets. We first show that this method can restore the human gait and facial images from video projection space, and then prove that the method can restore the image blur motion digital video projection from strenuous exercise by time obtained.

1.       Introduction

Capturing visual data is typically a high-dimensional signal along a dimension "folded" projection. For example, a long exposure, motion blur picture is projected dimension along the time trajectory is generated by the [11, 25]. Recent "camera angle" using vertical angular occluder hidden scene projected light to generate 1D video [4]. The medical x-ray machine space projection radiography, wherein the x-ray distribution by the generator, the imaging signal captured anatomical impact detector [26]. Given the projection data, you can synthesize the original signal?

In this work, we propose an algorithm to make this synthesis. Our focus is to restore from images and video projection space and recover from the long exposure video image obtained by the time projection. Reversal task high dimensional projection signal is ill-posed, so that in the absence of true signal or a priori constraints, the task is not feasible. This includes spatial ambiguity in the target direction and the projection position, the projection and the time in the "time arrow" [43] (FIG. 1). We use the fact that, due to the shared structure of a given domain, most effective natural image dimension is usually much lower than the pixel representation. Let's deal with this ambiguity by building a probability model for the projection given to the signal generation. The model used by our convolution neural network ( parameter function CNNs) implemented components. Variational inference, we get an intuitive objective function. From this backprojection sampled in the test network, resulting in consistent with a reasonable exemplary projection signal input. Restore high-dimensional data from observational data in computer vision portion of the literature is very rich. The single image super resolution [15], to color image [46] and the motion blur removing [14] is a special case. Here, we focus on complete removal of space or time dimension projection, which can lead to serious loss of information. To our knowledge, our approach is the first universal recovery methods, there is a collapsed dimension.

Based on our insights related issues, to develop a first solution for appearance and motion cues (in the case of video) extrapolated to the unseen dimensions. In particular, we take advantage of the latest developments based on comprehensive and random prediction task neural networks [2, 17, 44]. We evaluate our work from both quantitative and qualitative. We show that our method can recover from the two-dimensional distribution of spatial and temporal gait video image, restore the face image from one-dimensional projection. We also demonstrate that our method can use a mobile MNIST video data sets in motion blurred image distribution modeling conditions [37].

 

 

2.       Related Work

2.1.  Corner Cameras

The projection central role in computer vision, image forming starts from the initial stage, i.e., light is projected onto a two-dimensional plane of light emitted from the three-dimensional world. We are concerned about a special kind of projection, wherein the signal of interest in a high-dimensional dimensional folded to create the observation data.

Corner hidden camera reflected light in the scene, is blocked by an edge of the obstacle, so that "see around the corner" [4]. Light reflected from a scene from a vertically integrated angle with respect to the same angular position, and generates 1D video (one-dimensional + time). The study used a time-dimensional video gradient roughly indicate the angular position of the body with respect to the corners, but there is no hidden scene reconstruction. As a first step this difficult task of rebuilding, we show video and images can be restored after folding a spatial dimension.

2.2.  Compressed Sensing

Solutions compression sensing technique underdetermined linear system by finding, effectively limited observational data from the reconstructed signal [8, 12]. This is possible because the redundancy is appropriate based on the nature of the signal. There are several ways to show that even in the case where data is randomly selected, may be from a small amount (convex optimization through exactly reconstructed signal 1000s) data [6, 7, 16]. We are dealing with an extreme variant in which a one-dimensional signal is completely lost. We also used to solve this problem based learning method, which produces a potential signal distribution, rather than an estimate.

2.3.  Conditional Image/Video Synthesis and Future Frame Prediction

Based on synthetic images and video neural network has been widespread concern. In the condition of the image synthesis, the image is synthesized based on other information, such as another or the same dimensions based label image (image to image translation) [5, 17, 29, 38, 42, 47]. In contrast to our work, most of these studies are based on data output of the same dimensions. Video synthesis algorithm focused on generating unconditional [33, 39, 40], or video to video converter [9, 34, 41]. In the next video frame prediction, image synthesizing past frame according to one or more. Some video generation algorithm as a random problem [2, 24, 44], using variational autocoder ( of VAE) style frame [23]. Input and output forms of these issues and we are similar, but the information inputs are different. In these studies, we have gained some insight from random formula.

2.4.  Inverting a Motion-blurred Image to Video

We explore an application form video image is blurred from the dramatic movement, these images are being gathered by the photons in the scene in a long time created. Two recent studies suggest a method for recovering individual images from motion blur to the video sequence determined [18, 30]. We propose a general include, but are not limited to backprojection frame dimension of time. In addition, our framework is the probability distribution to capture the signal change, rather than a single deterministic output (see Figure 1).

3.       Methods

Our goal is to use distributed data capture a specific scene of the p-( the y-| the X-). We first proposed a condition based VAE ( CVAE) the probability model [36] (Fig. 2).

 

 

3.1.  Variational Inference and Loss Function

Direct calculation of this integration is difficult, because it depends on the complexity of posterior potential function parameters and estimate the p-( z | the y-) difficult. Instead, we use the variational likelihood inference, using stochastic gradient descent to optimize it [20, 23].

  3.2.  Network Architectures

Figure 3 shows a 2D-to-3D time reinjection task of architecture. Our two-dimensional to three-dimensional projection architecture is almost the same, but in different dimensions on the order of dimensions x and shaping operator. We use low-dimensional convolution and shaping operator to handle the projected one-dimensional to two-dimensional. And the number of parameters due to the complexity of the convolution layer data set varies.

The second stage applies a series of two-dimensional convolution and sampling operations on the synthesized image data x and more channels have the same dimensions. Activating the first activating stage is connected to the second stage, to spread the feature learned by skipping connections. We will result image along the folded dimension expanded to generate three-dimensional volume. To this end, we generated two-dimensional convolution TF data channels, where T is the size of a folded dimension (time in this embodiment), F. Some features. Finally, we reshape this image as a three-dimensional body, and apply some three-dimensional convolution to re-define and generate a signal estimate.

 

 

 

 

4.       Experiments and Results

We first use FacePlace [31] evaluated our approach to face one-dimensional to two-dimensional projection space. We then               internal use of the data set collected human gait video display two-dimensional to three-dimensional projection of the results. Finally, we use the mobile MNIST [37] demonstrates the data set the time 2D to 3D projection. We specialize in projection, in all experiments, the pixels are averaged along a dimension. For all experiments, we separate the data into training / testing / validation non-overlapping groups.

4.1.  Implementation

We use a flow tensor Keras [10] in [1] back-end implementation of our model. We use the learning rate of 1e-4 to ADAM optimizer [22]. Our training is different for each experimental model. We selected separately for each data set regularization parameter beta] over, so interposed in our KL verification data items between [5,15], to obtain sufficient data reconstruction while avoiding crash mode. We all experiments dimension z is set to 10.

4.2.  Spatial Deprojections with FacePlace

FacePlace consists of more than 5000 Zhang 236 people picture composition. There are a lot of diversity of sources, including different races, multiple perspectives, facial expressions and props. We randomly selected from all 30 individual images to form a test set. We scale the image 128 × 128 pixels, and by panning, zooming data enhancement, and saturation changes. We will compare our approach with the following baseline:             

1. Selector nearest neighbor ( k-NN): using the mean square error from the closest training data test concentration selected projection from the projection k-th image.             

2. And a withdrawal process of our radio network gθ ( X, Z) in the same deterministic model ( the DET), does not contain a latent variable z.             

A linear minimum mean square error ( the LMMSE) estimator, it is assumed that x and y are the x, extracted y distribution.

For our method and DET, we use a measure of the perceived loss. FIG. 4 shows the results of visualization, some of which are randomly selected samples from our process. 1-NN performance due to test the sample varies, sometimes generate faces from the wrong people there. LMMSE a very fuzzy output, suggesting that highly nonlinear nature of the task. DET fuzzy output produced less, but still often different plausible face combined. Our approach to capture the direction of the head of uncertainty and changes in appearance, such as hair color and facial structure.               Fuzziness head direction is more obvious in horizontal projection, because the greatest impact on the dimension of attitude change.

And LMMSE and compared with the DET, the output of the proposed method is more acute, and the ratio is more realistic 1-NN. We also model a quantitative assessment. We use PSNR ( the PSNR) between the measured quality of the reconstructed images. For each test projection, we (from each model always returns the same estimated value DET) to extract the k projection estimates, and record the highest estimated value between any ground and the true values of the image PSNR. For each estimated backprojection and reprojection we record the average peak signal to noise ratio relative to the test (initial) projected output projection. FIG. 5 shows a different projection of the sample 100 test results k. As the number of samples k increases, the signal of our approach (to the projection) PSNR increase, highlighting the advantages of our probabilistic methods. best estimate k-NN approximation with the best estimate of the value of k increases reconstructed signal, but decreased k-NN projection peak signal to noise ratio curves also demonstrate many bad estimation of k-NN.

LMMSE have a perfect projection PSNR ( the PSNR), since it is precisely to capture the relationship between the signal and the linear projection by construction. When one sample, the DET signal having a higher PSNR, trusted because it averages the image, and our method does. In our proposed method after a sample of more than DET.

4.3.  Spatial Deprojections with Walking Video

We assessed qualitatively our approach from the video projection reconstruction gait vertical space. This scenario has a practical significance for the camera corner portion, as first described in section 2.1. We collected a 30 minute walking subject in the specified area 35 videos. The subject of dress, tall ( 5 feet 2 inches -6 feet 5 inches), age ( 18-60 years) and sex ( 18 meters / 12 feet) from each other. Subjects are not required in any particular way of walking, a lot of people very strange way of walking. All videos are the same background.

We will reduce the sampling rate of the video to a second 5, each frame 256 × 224 pixels, and the horizontal translation of each video application data enhancement. We let the six subjects produced a test suite. We predicted sequence (about 24 of real-time 5 seconds).             

FIG 6 illustrates several previously provided by the average distribution z = Ai [Phi] ( reconstructed sample x) is obtained. Our approach only vertical projection recover many details. Background it is easy to synthesize, because it is consistent with the data set of all videos. It is worth noting that many of the subjects looks and gestures details are restored.               And subtle changes in pixel intensity projection foreground foreground signal comprising track shape along the fold dimension clues. For example, this method seems to understand that, as time goes on, more and more dark, more and more wide traces are likely to correspond to a man approached the camera. The third theme is an illustrative result, our approach to separate white shirt and black trousers, although they are not significant in terms of projection. Is expected to detail, coupled with the shirt is usually lighter than the color of pants learning mode, it is possible to make this recovery possible.             

Finally, the method may be in conflict with the training data rarely see patterns, such as the fifth frame, long stride fourth object. In addition to these experiments, we DGAIT dataset Training [3] on a single model, the data set consists of more subject ( 53) composition, but with a simple walking pattern. We obtained similar results in the mass, as shown in Fig.

4.4.  Temporal Deprojections with Moving MNIST

Mobile MNIST dataset of 10,000 by the two mobile video sequence composed of handwritten digits. Each number can be blocked, and rebound from the edge of the frame. Given a 64 × 64 × 10 sized video clip data sub-sets, we mean to generate each time by projecting the frames x, similar to other studies generate large scale motion blurred image [18, 21, 27, 28] .             

Despite the appearance and dynamic data set is very simple, but the appearance and capture digital synthesis reasonable direction of each track is a challenge. Our approach for the three exemplary output test example of FIG . 9. To illustrate our method learned by the terms of time, our method we extracted from each projection for 10 sequences, and given the true value with respect to ground clip forward and MMSE running backwards sequence.             

Our approach blurred shape of the character image input inferred from a significant motion, i.e., the use of human standards difficult to interpret. In addition, our method of capturing a multimodal kinetic data set, by presenting two motion sequence we illustrate this: the first sequence matches a direction of the ground real time, the second time sequence matching reverse propulsion.

We use PSNR curve our quantization accuracy, similar to the first experiment shown in Figure 8. Due to generate a full covariance matrix Federation of computational cost is too high, we do not calculated in this experiment LMMSE. DET to generate blur by combining different sequences plausible time sequence.             

Similar to the first experiment, which makes the DET output only when k = 1 has the best expected signal (emission) the PSNR. PSNR signal k> 1, much better than our method DET. DET projection perform better PSNR aspect, because in this experiment, the average of all plausible sequence produce a very accurate projection. And compared FacePlace experiment, k-NN In this experiment performed relatively poorly, because it is difficult to find the nearest neighbor in a high dimension.

 

 

 

 

 

 

 

 

 

 

5.       Conclusion

In this work, we introduced a new visual projection problem: the one in the image or video into one dimension folded into a low-dimensional observation. We propose a general method of processing a first video image and a projection along any dimension of the data. Our first task is to resolve the uncertainty by introducing a probabilistic model that captures the distribution based on the projection of the original signal. We CNNs achieve the function of the parametric model to learn the structure of each shared image domain, accurate signal synthesis.             

Although the information obtained from the folded dimensions are not often look to recover from the projector to the naked eye, but our results show that most of the "missing" information is recoverable. We only precise movement space projection to prove by subtle details and video images of human faces in rebuilding it. Finally, we illustrate the use of mobile MNIST video data set can be reconstructed from the blurred image in strenuous exercise, even if multiple modal path. This work shows that a new, ambitious imaging task results are promising, and opens up exciting possibilities, reveal unseen in future applications.

 

Guess you like

Origin www.cnblogs.com/wujianming-110117/p/12630380.html