【论文翻译】中英对照翻译--(Attentive Generative Adversarial Network for Raindrop Removal from A Single Image)

【开始时间】2018.10.08

【完成时间】2018.10.09

【论文翻译】Attentive GAN论文中英对照翻译--(Attentive Generative Adversarial Network for Raindrop Removal

from A Single Image)

【中文译名】一幅单一图像中雨滴去除的专注生成对抗性网络

【论文链接】https://arxiv.org/abs/1711.10098

【补充】

1)论文的发表时间是:6 May 2018,是在CVPR2018上发表的论文

2)文章解读可参考:https://blog.csdn.net/gentlelu/article/details/80672490

【声明】本文是本人根据原论文进行翻译,有些地方加上了自己的理解,有些专有名词用了最常用的译法,时间匆忙,如有遗漏及错误,望各位包涵

                                                         题目:一幅单一图像中雨滴去除的专注生成对抗性网络

Abstract(摘要)

        Raindropsadheredtoaglasswindoworcameralenscan severely hamper the visibility of a background scene and degrade an image considerably. In this paper, we address the problem by visually removing raindrops, and thus transforming a raindrop degraded image into a clean one. The problem is intractable, since first the regions occluded by raindrops are not given. Second, the information about the background scene of the occluded regions is completely lost for most part. To resolve the problem, we apply an attentive generative network using adversarial training. Our main idea is to inject visual attention into both the generative and discriminative networks. During the training, our visual attention learns about raindrop regions and their surroundings. Hence, by injecting this information, the generative network will pay more attention to the raindrop regions and the surrounding structures, and the discriminative network will be able to assess the local consistency of the restored regions. This injection of visual attention to both generative and discriminative networks is the main contribution of this paper. Our experiments show the effectiveness of our approach, which outperforms the state of the art methods quantitatively and qualitatively.

      雨滴粘附在玻璃窗或相机镜头上会严重影响背景图像的可见性,并大大降低图像的质量。在本文中,我们通过视觉上去除雨滴解决了这个问题,将雨滴退化图像转化为清晰的图像。这个问题很难解决,因为第一没有给出雨滴遮挡的区域。第二,遮挡区域的背景场景信息大部分是完全丢失的。为了解决这一问题,我们利用对抗性训练建立了一个专注生成网络( an attentive generative network)。我们的主要想法是将视觉注意力(visual attention)注入到生成性和判别性的网络中。在训练中,我们的视觉注意力学习雨滴区域及其周围环境。因此,通过注入这些信息,生成网络将更多地关注雨滴区域和周围的结构,而判别网络将能够评估恢复区域的局部一致性(the local consistency)。将视觉注意力注入生成网络和区分网络是本文的主要贡献。实验证明了该方法的有效性,在数量和质量上都优于先进的方法。

1. Introduction(介绍)

     Raindrops attached to a glass window, windscreen or lens can hamper the visibility of a background scene and degrade an image. Principally, the degradation occurs because raindrop regions contain different imageries from those without raindrops. Unlike non-raindrop regions, raindrop regions are formed by rays of reflected light from a wider environment, due to the shape of raindrops, which is similar to that of a fish-eye lens. Moreover, in most cases, the focus of the camera is on the background scene, making the appearance of raindrops blur.

    附着在玻璃窗口、挡风玻璃或镜头上的雨滴会阻碍背景场景的可见度并降低图像的质量。这主要是由于雨滴区域与没有雨滴的区域包含不同的图像信息。与非雨滴区域不同,雨滴区域是由来自更广阔环境的反射光形成的,因为雨滴的形状类似于鱼眼透镜的形状。此外,在大多数情况下,相机的焦点是在背景场景,这使得雨滴的外观模糊。

   In this paper, we address this visibility degradation problem. Given an image impaired by raindrops, our goal is to remove the raindrops and produce a clean background as shown in Fig. 1. Our method is fully automatic. We consider that it will benefit image processing and computer vision applications, particularly for those suffering from raindrops, dirt, or similar artifacts.

   在本文中,我们解决了可见性退化问题。给出一个被雨滴破坏的图像,我们的目标是移除雨滴,并产生一个干净的背景,如图1所示。我们的方法是全自动的。我们认为,这将有利于图像处理和计算机视觉应用,特别是对于那些遭受雨滴,污垢或类似的人为影响的图片而言。

图1.演示我们的雨滴清除方法。左:输入因雨滴而退化的图像。右:我们的结果,大部分雨滴被移除,结构细节被恢复。放大图像将提供一个更好的恢复质量。

    A few methods have been proposed to tackle the raindrop detection and removal problems. Methods such as  [17, 18, 12] are dedicated to detecting raindrops but not removing them. Other methods are introduced to detect and remove raindrops using stereo [20], video [22, 25], or specifically designed optical shutter [6], and thus are not applicable for a single input image taken by a normal camera. A method by Eigen et al. [1] has a similar setup to ours. It attempts to remove raindrops or dirt using a single image via deep learning method. However, it can only handle small raindrops, and produce blurry outputs [25]. In our experimental results (Sec. 6), we will find that the method fails to handle relatively large and dense raindrops.

    针对雨滴检测和排除问题,之前已经提出了几种解决方法。诸如[17,18,12]等方法专门用于检测雨滴,但不去除雨滴。其他方法采用 stereo(立体)[20]、视频[22、25]或专门设计的光学快门[6]探测和去除雨滴,因此不适用于由普通照相机拍摄的单个输入图像。艾根等人提出的一种方法[1],与我们的相似。它试图通过深度学习的方法,用一幅图像去除雨滴或污垢。然而,它只能处理微小的雨滴,并产生模糊的输出[25]。在我们的实验结果中(第6节),我们会发现,这种方法不能处理较大和密集的雨滴。

     In contrast to [1], we intend to deal with substantial presence of raindrops, like the ones shown in Fig. 1. Generally, the raindrop-removal problem is intractable, since first the regions which are occluded by raindrops are not given. Second, the information about the background scene of the occluded regions is completely lost for most part. The problem gets worse when the raindrops are relatively large and distributed densely across the input image. To resolve the problem, we use a generative adversarial network, where our generated outputs will be assessed by our discriminative network to ensure that our outputs look like real images. To deal with the complexity of the problem, our generative network first attempts to produce an attention map. This attention map is the most critical part of our network, since it will guide the next process in the generative network to focus on raindrop regions. This map is produced by a recurrent network consisting of deep residual networks (ResNets) [8] combined with a convolutional LSTM [21] and a few standard convolutional layers. We call this attentive-recurrent network.

    与[1]相反,我们打算处理大量的雨滴,如图1中所示。一般情况下,雨滴清除问题是难以解决的,因为第一没有给出被雨滴遮挡的区域。第二,关于被遮挡区域的背景场景的信息大部分是完全丢失的。当雨滴相对较大并且密集分布在输入图像上时,问题就变得更严重了。为了解决这个问题,我们使用了一个生成的对抗性网络,在该网络中,我们生成的输出将由我们的区分网络( discriminative network)进行评估,以确保我们的输出看起来像真实的图像。为了解决这个问题的复杂性,我们的生成网络首先尝试制作一张注意力图(attention map)。这张映射图是我们网络中最关键的部分,因为它将引导生成网络的下一个过程集中在雨滴区域。该映射是由深度残差网络(ResNet)[8]和卷积LSTM[21]以及几个标准的卷积层组成的递归网络生成的。我们称之为关注-循环网络( attentive-recurrent network)。

    The second part of our generative network is an autoencoder, which takes both the input image and the attention map as the input. To obtain wider contextual information, in the decoder side of the autoencoder, we apply multi-scale losses. Each of these losses compares the difference between the output of the convolutional layers and the cor- responding ground truth that has been downscaled accordingly. The input of the convolutional layers is the features from a decoder layer. Besides these losses, for the final output of the autoencoder, we apply a perceptual loss to obtain a more global similarity to the ground truth. This final output is also the output of our generative network.

    生成网络的第二部分是以输入图像和注意图为输入的自动编码器。为了获得更广泛的上下文信息,在自动编码器的解码器端,我们采用了多尺度损失( multi-scale losses)。这些损失中的每一个都比较了卷积层的输出与已被相应缩小的相应地面真相( ground truth)之间的差异。卷积层的输入是来自解码器层的特征。除了这些损失,对于最终输出的自动编码器,我们应用一个感知损失( perceptual loss),以获得一个更全面的对 于地面真相的相似性。这最后的输出也是我们生成网络的输出。

    

    Having obtained the generative image output, our discriminative network will check if it is real enough. Like in a few inpainting methods (e.g. [9, 13]), our discriminative network validates the image both globally and locally. However, unlike the case of inpainting, in our problem and particularly in the testing stage, the target raindrop regions are not given. Thus, there is no information on the local regions that the discriminative network can focus on. To address this problem, we utilize our attention map to guide the discriminative network toward local target regions.

    在获得生成图像输出后,我们的判别网络将检查它是否足够真实。就像一些修复方法(例如,[9,13])一样,我们的判别网络对图像进行全局和局部验证。然而,与修复的情况不同,在我们的问题中,特别是在测试阶段,没有给出目标雨滴区域。因此,没有关于局部区域的信息可供 判别网络关注。为了解决这一问题,我们利用我们的注意力图来引导判别网络趋向于局部目标区域。

    Overall, besides introducing a novel method of raindrop removal, our other main contribution is the injection of the attention map into both generative and discriminative networks, which is novel and works effectively in removing raindrops, as shown in our experiments in Sec. 6. We will release our code and dataset.

    总之,除了引入一种新的雨滴去除方法外,我们的另一个主要贡献是将注意力图注入生成网络和区分网络,这是一种新颖的、有效地去除雨滴的方法,如我们的实验(第6节)中所示。.我们将发布代码和数据集。

   The rest of the paper is organized as follows. Section 2 discusses the related work in the fields of raindrop detection and removal, and in the fields of the CNN-based image inpainting. Section 3 explains the raindrop model in an image, which is the basis of our method. Section 4 describes our method, which is based on the generative adversarial network. Section 5 discusses how we obtain our synthetic and real images used for training our network. Section 6 shows our evaluations quantitatively and qualitatively. Finally, Section 7 concludes our paper.

     论文的其余部分组织如下。第二节讨论了雨滴检测和去除领域以及CNN图像修复领域的相关工作。第三节解释了图像中的雨滴模型,这是我们方法的基础。第四节介绍了基于生成对抗网络的方法。第五节讨论了我们如何获得我们的合成和真实的图像,以用于培训我们的网络。第六节从数量和质量两个方面展示了我们的评价。第7节总结了我们的论文。

2. Related Work(相关工作)

      There are a few papers dealing with bad weather visibility enhancement, which mostly tackle haze or fog (e.g. [19, 7, 16]), and rain streaks (e.g. [3, 2, 14, 24]). Unfortunately, we cannot apply these methods directly to raindrop removal, since the image formation and the constraints of raindrops attached to a glass window or lens are different from haze, fog, or rain streaks.

     有几篇关于恶劣天气视觉增强的论文,主要用于防雾霾或雾(例如[19,7,16])和雨纹( rain streaks)(例如[3,2,14,24])。不幸的是,我们不能直接应用这些方法来去除雨滴,因为玻璃窗口或镜片上的雨滴的形成和约束不同于雾霾、雾或雨纹。

    A number of methods have been proposed to detect raindrops. Kurihata et al.'s [12] learns the shape of raindrops using PCA, and attempts to match a region in the test image, with those of the learned raindrops. However, since raindrops are transparent and have various shapes, it is unclear how large the number of raindrops needs to be learned, how to guarantee that PCA can model the various appearance of raindrops, and how to prevent other regions locally similar

to raindrops to be detected as raindrops. Roser and Geiger’s [17] proposes a method that compares a synthetically generated raindrop with a patch that potentially has a raindrop. The synthetic raindrops are assumed to be a sphere section, and later assumed to be inclined sphere sections [18]. These assumptions might work in some cases, yet cannot be generalized to handle all raindrops, since raindrops can have various shapes and sizes.

    之前已经提出了许多检测雨滴的方法。Kurihata等人[12]使用PCA来学习雨滴的形状,并尝试将测试图像中的一个区域与所学雨滴的区域相匹配。然而,由于雨滴是透明的,形状各异,尚不清楚需要了解的雨滴数量有多大,如何保证PCA能够模拟雨滴的各种外观,以及如何防止局部类似雨滴的区域被检测为雨滴。Roser和盖革[17]提出了一种方法,将一个综合生成的雨滴与可能有雨滴的斑块进行比较。综合雨滴被假定为球面截面( a sphere section),后来被假定为倾斜球面截面[18]。这些假设在某些情况下可能是可行的,但却不能推广用于处理所有的雨滴,因为雨滴可以有不同的形状和大小。

    Yamashita et al.’s [23] uses a stereo system to detect and remove raindrops. It detects raindrops by comparing the disparities measured by the stereo with the distance between the stereo cameras and glass surface. It then removes raindrops by replacing the raindrop regions with the textures of the corresponding image regions, assuming the other image does not have raindrops that occlude the same background scene. A similar method using an image sequence, instead of stereo, is proposed in Yamashita et al.’s [22]. Recently, You et al.’s [25] introduces a motion based method for detecting raindrops, and video completion to remove detected raindrops. While these methods work in removing raindrops to some extent, they cannot be applied directly to a single image.

   Yamashita等人[23]使用立体声系统(stereo system)来探测和清除雨滴。它通过比较立体声测量的差异(disparities)与立体相机与玻璃表面之间的距离来检测雨滴。然后,将雨滴区域替换为相应图像区域的纹理,从而去除雨滴,假设其他图像没有遮挡相同背景场景的雨滴。在Yamashita等人的[22]中,提出了一种用图像序列代替立体声的类似方法。最近, You等人在[25]中介绍了一种基于运动的雨滴检测方法,并通过视频完成(video completion)来去除检测到的雨滴。虽然这些方法在一定程度上起到了去除雨滴的作用,但它们不能直接应用于一幅图像。

    Eigen et al.’s [1] tackles single-image raindrop removal, which to our knowledge, is the only method in the literature dedicated to the problem. The basic idea of the method is to train a convolutional neural network with pairs of raindrop-degraded images and the corresponding raindrop-free images. Its CNN consists of 3 layers, where each has 512 neurons. While the method works, particularly for relatively sparse and small droplets as well as dirt, it cannot produce clean results for large and dense raindrops. Moreover, the output images are somehow blur. We suspect that all these are due to the limited capacity of the network and the de-

ficiency in providing enough constraints through its losses. Sec. 6 shows the comparison between our results with this method’s.

    艾根等人的[1]单图像雨滴去除,据我们所知,这是在文献致力于这个问题的唯一方法。该方法的基本思想是使用雨滴退化图像和相应的无雨滴图像的图像对来训练卷积神经网络。它的cnn由3层组成,每个层有512个神经元。虽然这种方法有效,特别是对于相对稀疏的和较小的液滴以及污垢,但它不能对大而密集的雨滴产生干净的结果。此外,输出图像也有些模糊。我们怀疑,所有这些都是由于网络容量有限以及通过损失提供足够约束的不足所致。第6节将我们的结果与该方法的结果进行了比较。

   In our method, we utilize a GAN [4] as the backbone of our network, which is recently popular in dealing with the image inpainting or completion problem (e.g. [9, 13]). Like in our method, [9] uses global and local assessment in its discriminative network. However, in contrast to our method, in the image inpainting, the target regions are given, so that the local assessment (whether local regions are sufficiently real) can be carried out. Hence, we cannot apply the existing image inpainting methods directly to our problem. Another similar architecture is Pix2Pix [10], which translates one image to another image. It proposes a conditional GAN that not only learns the mapping from input image to output image, but also learns a loss function to the train the mapping. This method is a general mapping, and not proposed specifically to handle raindrop removal. In Sec. 6, we will show some evaluations between our method and Pix2Pix.

    在我们的方法中,我们使用一个GAN[4]作为我们网络的骨干,这是最近在处理图像修复或完成问题(例如[9,13])中流行的方法。与我们的方法一样,[9]在其判别网络中使用了全局和局部评估。然而,与我们的方法相比,在图像修复中,给出了目标区域,从而可以进行局部评估(不管局部区域是否足够真实)。因此,我们不能将现有的图像修复方法直接应用于我们的问题。另一个类似的架构是 Pix2Pix[10],它将一个图像转换成另一个图像。它提出了一种条件GAN,不仅学习从输入图像到输出图像的映射,而且对训练的映射也学习了一个损失函数。这种方法是一种通用的映射方法,并没有提出专门处理雨滴再移动的方法.在第6节中我们将展示我们的方法和 Pix2Pix之间的一些评估。

3. Raindrop Image Formation(雨滴图像生成)

     We model a raindrop degraded image as the combination of a background image and effect of the raindrops:

    我们将雨滴退化图像建模为背景图像和雨滴效应的结合:

     where I is the colored input image and M is the binary mask. In the mask, M(x) = 1 means the pixel x is part of a raindrop region, and otherwise means it is part of background regions. B is the background image and R is the effect brought by the raindrops, representing the complex mixture of the background information and the light reflected by the environment and passing through the raindrops adhered to a lens or windscreen. Operator ⊙ means element-wise multiplication.

    式中,I是输入彩色图像;M是二值掩码(binary mask)。在掩码中,M (x)=1表示像素 x 是雨滴区域的一部分,否则认为是背景区域的一部分;B是背景图像,R是雨滴所带来的影响,代表复杂的混合背景信息和环境中的光线反射,以及他们通过附着在镜头或挡风玻璃上的雨滴(造成的影响)。算子⊙代表元素相乘。

   Raindrops are in fact transparent. However, due to their shapes and refractive index, a pixel in a raindrop region is not only influenced by one point in the real world but by the whole environment [25], making most part of raindrops seem to have their own imagery different from the background scene. Moreover, since our camera is assumed to focus on the background scene, this imagery inside a raindrop region is mostly blur. Some parts of the raindrops, particularly at the periphery and transparent regions, convey some information about the background. We notice that the information can be revealed and used by our network.

   雨滴实际上是透明的。然而,由于雨滴的形状和折射率( refractive index),雨滴区域的像素不仅受现实世界中某一点的影响,而且还受整个环境的影响[25],使得大部分雨滴似乎都有不同于背景场景的图像。此外,由于我们的相机被假定聚焦于背景场景,雨滴区域内的图像大多是模糊的。雨滴的某些部分,特别是在边缘和透明区域,传达了一些关于背景的信息。我们注意到我们的网络可以显示和使用这些信息。

     Based on the model (Eq. (1)), our goal is to obtain the background image B from a given input I. To accomplish this, we create an attention map guided by the binary mask M. Note that, for our training data, as shown in Fig. 5, to obtain the mask we simply subtract the image degraded by raindrops I with its corresponding clean image B. We use a threshold to determine whether a pixel is part of a raindrop region. In practice, we set the threshold to 30 for all images in our training dataset. Thissimplethresholdingissufficient for our purpose of generating the attention map.

    基于模型(Eq.1),我们的目标是从给定的输入I中获取背景图像B。为了实现这一点,我们创建了一个由二值掩码M引导的注意映射(attention map)。注意,对于我们的训练数据,如图5所示。为了获得掩码(mask),我们只需将雨滴退化图像I减去与其对应的干净图像B。我们使用阈值来确定像素是否是雨滴区域的一部分。在实践中,我们将所有的图像训练数据集的阈值设置为30。这个简单的阈值设置(simplethresholding)对于我们产生注意力图的目的是足够的。

    

4. Raindrop Removal using Attentive GAN(使用注意GAN去除雨滴

      Fig. 2 shows the overall architecture of our proposed network. Following the idea of generative adversarial networks[4], there are two main parts in our network: the generative and discriminative networks. Given an input image degraded by raindrops, our generative network attempts to produce an image as real as possible and free from raindrops. The discriminative network will validate whether the image produced by the generative network looks real.

      图2展示了我们提出的网络的总体架构。遵循生成对抗网络的思想[4],在我们的网络中有两个主要部分:生成网络和判别网络。如果输入的图像因雨滴而退化,我们的生成网络试图生成尽可能真实、不受雨滴影响的图像。判别网络将验证生成网络产生的图像是否真实。

    Our generative adversarial loss can be expressed as:

    我们的生成性对抗性损失可以表示为:

        where G represents the generative network, and D represents the discriminative network. I is a sample drawn from our pool of images degraded by raindrops, which is the input of our generative network. R is a sample from a pool of clean natural images.

      其中G表示生成网络,D表示判别网络.I是从被雨滴侵蚀的图像库中提取的样本( images degraded by raindrops),这是我们的生成网络的输入。R是从一堆干净的自然图像中提取的样本。

图2.我们提出的专注的GAN的结构。该生成器由一个注意力递归网络和一个具有跳过连接的上下文自动编码器组成。该判别器由一系列卷积层组成,并由注意图引导。 Best viewedin color.

4.1. Generative Network(生成网络

    As shown in Fig. 2, our generative network consists of two sub-networks: an attentive-recurrent network and a contextual autoencoder. The purpose of the attentive-recurrent network is to find regions in the input image that need to get attention. These regions are mainly the raindrop regions and their surrounding structures that are necessary for the contextual autoencoder to focus on, so that it can generate better local image restoration, and for the discriminative network to focus the assessment on.

    如图2所示。我们的生成网络由两个子网络组成:注意力递归网络( an attentive-recurrent network )和上下文自动编码器(a contextual autoencoder)。注意-递归网络的目的是在输入图像中寻找需要引起注意的区域。这些区域主要是雨滴区域及其周围的结构,是上下文自动编码器必须关注的区域,这样才能产生更好的局部图像恢复,并使判别网络集中于评估( assessment )。

    Attentive-Recurrent Network. Visual attention models have been applied to localizing targeted regions in an image to capture features of the regions. The idea has been utilized for visual recognition and classification (e.g. [26, 15, 5]). In a similar way, we consider visual attention to be important for generating raindrop-free background images, since it allows the network to know where the removal/restoration should be focused on. As shown in our architecture in Fig. 2, we employ a recurrent network to generate our visual attention. Each block (of each time step) in our recurrent network comprises of five layers of ResNet [8] that help extract features from the input image and the mask of the previous block, a convolutional LSTM unit [21] and convolutional layers for generating the 2D attention maps.

    关注-循环网络。视觉注意模型(Visual attention models)已被应用于定位图像中的目标区域,以捕捉区域的特征。这一思想已被用于视觉识别和分类(例如[26、15、5])。同样,我们认为视觉关注对于生成无雨滴背景图像是非常重要的,因为它允许网络知道移除/恢复应该集中在哪里。如图2中我们的架构所示。我们使用一个递归网络来产生我们的视觉注意。 每一块(即每个时间步长长)都包含5层ResNet[8]-它们帮助从前一块的输入图像和掩码中提取特征,一个卷积的LSTM单元[21]和用于生成2D注意分布图的卷积层。

    Our attention map, which is learned at each time step, is a matrix ranging from 0 to 1, where the greater the value, the greater attention it suggests, as shown in the visualization in Fig. 3. Unlike the binary mask, M, the attention map is a non-binary map, and represents the increasing attention from non-raindrop regions to raindrop regions, and the values vary even inside raindrop regions. This increasing attention makes sense to have, since the surrounding regions of raindrops also needs the attention, and the transparency of a raindrop area in fact varies (some parts do not totally occlude the background, and thus convey some background

information).

    我们的注意力图,是在每个时间步骤中学习的,是一个从0到1的矩阵,其中值越大,它所表示的注意力就越多,如图3中的可视化所示。与二值掩码M不同,注意映射是一种非二值映射,它代表着从非雨滴区域到雨滴区域的注意力的增加,雨滴区域内部的关注度也是不同的。这种注意力的增加是有意义的,因为雨滴周围的区域也需要注意,而雨滴区域的透明度实际上是不同的(有些部分并不完全遮住背景,从而传达了一些背景信息)。

图3.注意映射图学习过程的可视化。这个可视化是最后的注意图,AN。我们专注-循环网络显示,在训练过程中它更多地关注雨滴区域和相关结构。

   Our convolution LSTM unit consists of an input gate i t , a forget gate  ft , an output gate  ot as well as a cell state ct . The interaction between states and gates along time dimension is defined as follows:

    我们的卷积LSTM单元包括一个输入门 i t 、一个忘记门ft、一个输出门ot以及一个单元状态ct。状态与门随时间维度的相互作用定义如下:

    where X t is the features generated by ResNet. C t encodes the cell state that will be fed to the next LSTM. H t represents the output features of the LSTM unit. Operator ∗ represents the convolution operation. The LSTM’s output feature is then fed into the convolutional layers, which generate a 2D attention map. In the training process, we initialize the values of the attention map to 0.5. In each time step, we concatenate the current attention map with the input image and then feed them into the next block of our recurrent network.

    其中,X t 是由ResNet生成的特征; C t 对将要转递到下一个LSTM的状态进行编码; H t代表LSTM单元的输出特性;运算符 * 表示卷积运算。LSTM的输出特征随后被输入到卷积层,这将产生一个2D的注意图。在训练过程中,我们将注意力图的值初始化为0.5。在每个时间步骤中,我们将当前的注意力映射与输入连接起来,然后将它们输入到我们的递归网络的下一个块中。

   

    In training the generative network, we use pairs of images with and without raindrops that contain exactly the same background scene. The loss function in each recurrent block is defined as the mean squared error (MSE) between the output attention map at time step t, or A t , and the binary mask, M. We apply this process N time steps. The earlier attention maps have smaller values and get larger when approaching the N th time step indicating the increase in confidence. The loss function is expressed as:

    在训练生成网络时,我们使用包含和不包含雨滴的具有完全相同背景场景的图像对。每个循环块中的损失函数定义为在时间步长t的输出注意映射,或者说At与二值掩码M之间的均方误差(MSE)。我们在N个时间步骤中应用这个过程。较早的注意映射值较小,且随着时间步长的增加而变大,这说明信任度的增加。损失函数表示为:

    where A t is the attention map produced by the attentive-recurrent network at time step t. A t = ATT t (F t−1 ,H t−1 ,C t−1 ), with F t−1 is the concatenation of the input image and the attention map from the previous time step. When t = 1, F t−1 is the input image concatenated with an initial attention map with values of 0.5. Function ATT t represents the attentive-recurrent network at time step t. We set N to 4 and θ to 0.8. We expect a higher N will produce a better attention map, but it also requires larger memory.

    其中At是由注意递归网络在时间步骤t生成的注意图。At=ATTt(Ft−1,Ht−1,Ct−1),而Ft−1是输入图像与注意图的连接。当t=1时,Ft−1是输入图像,它的初始注意映射的值为0.5。函数ATTt表示时间步长t处的注意递归网络。我们将N设为4,θ设为0.8。我们期望一个较高的N将产生一个更好的注意图,但它也需要更大的内存。

    Fig. 3 shows some examples of attention maps generated by our network in the training procedure. As can be seen, our network attempts to find not only the raindrop regions but also some structures surrounding the regions. And Fig. 8 shows the effect of the attentive-recurrent network in the testing stage. With the increasing of time step, our network focuses more and more on the raindrop regions and relevant structures.

   图3给出了在训练过程中由我们的网络生成的注意图的一些例子。可以看到,我们的网络不仅试图找到雨滴区域,而且还试图找到一些围绕着这些区域的结构。还有图8显示了注意-循环网络在测试阶段的效果。 随着时间的推移,我们的网络工作越来越专注于在雨滴区域及其相关结构上。

     Contextual Autoencoder. The purpose of our contextual autoencoder is to generate an image that is free from raindrops. The input of the autoencoder is the concatenation of the input image and the final attention map from the attentive-recurrent network. Our deep autoencoder has 16 conv-relu blocks, and skip connections are added to prevent blurred outputs. Fig. 4 illustrates the architecture of our contextual autoencoder.

    上下文自动编码器。我们的上下文自动编码器的目的是生成一个不受雨滴影响的图像。自动编码器的输入是输入图像和来自注意力递归网络的最终注意映射的结合。我们的深层自动编码器有16个conv-relu块,并增加了跳过连接( skip connections)以防止模糊输出。图4阐述了上下文自动编码器的体系结构。

    

    图4.我们的上下文自动编码器的架构。利用多尺度损失和感知损耗来帮助训练自动编码器.

   As shown in the figure, there are two loss functions in our autoencoder: multi-scale losses and perceptual loss. For the multi-scale losses, we extract features from different decoder layers to form outputs in different sizes. By adopting this, we intend to capture more contextual information from different scales. This is also the reason why we call it contextual autoencoder.

   如图4中所示,在我们的自动编码器中有两个损失函数:多尺度损耗( multi-scale losses)和感知损耗( perceptual loss)。针对多尺度损失,我们从不同的解码器层中提取特征,形成不同大小的输出。通过采用这种方法,我们打算从不同的尺度上获取更多的上下文信息。这也是为什么我们称它为文本自动编码器。

    We define the loss function as:

    我们将损失函数定义为:

     where S i indicates the ith output extracted from the decoder   layers, and T i indicates the ground truth that has the same scale as that of S i .  are the weights for different  scales. We put more weight at the larger scale. To be more specific, the outputs of the last 1 st , 3 rd and 5 th layers are used, whose sizes are  1/4、1/2 and 1 of the original size, respectively. Smaller layers are not used since the information is insignificant. We set λ’s to 0.6, 0.8, 1.0.

    其中Si表示从解码器层提取的第i个输出,Ti表示具有与Si相同尺度的地面真相(ground truth )。是不同尺度的权重。我们把更多的重量放在更大的尺度上。更具体地说,使用最后一层、第三层和第五层的输出,它们的尺寸分别为原来尺寸的1/4、1/2和1。由于较小的层的信息无关紧要,我们不使用较小的层。我们将λ设置为0.6、0.8、1.0。

    Aside from the multi-scale losses, which are based on a pixel-by-pixel operation, we also add a perceptual loss [11] that measures the global discrepancy between the features of the autoencoder’s output and those of the corresponding ground-truth clean image. These features can be extracted from a well-trained CNN, e.g. VGG16 pretrained on Ima- geNet dataset. Our perceptual loss function is expressed as:

    除了基于逐像素操作的多尺度损失之外,我们还添加了一个感知损失[11],它测量了自动编码器输出的特征与相应的地面真实干净图像之间的整体差异(global discrepancy )。这些特征可以从训练有素的cnn中提取出来,例如使用ImageNet数据集进行了预训练的VGG-16。我们的感知损失函数表示为:

    where V GG is a pretrained CNN, and produces features from a given input image. O is the output image of the autoencoder or, in fact, of the whole generative network: O = G(I). T is the ground-truth image that is free from raindrops.

    其中vgg是预先训练的CNN,并从给定的输入图像生成特征。O是自动编码器的输出图像,或者说,实际上是整个生成网络的输出图像:O=g(I)。T是不受雨滴影响的真实图像。

    Overall, the loss of our generative can be written as:

    总的说来,我们的生成网络的损失可以写成:

4.2. Discriminative Network(判别网络)

    To differentiate fake images from real ones, a few GAN-based methods adopt global and local image-content consistency in the discriminative part (e.g. [9, 13]) . The global discriminator looks at the whole image to check if there is any inconsistency, while the local discriminator looks at small specific regions. The strategy of a local discriminator is particularly useful if we know the regions that are likely tobefake(likeinthecaseofimageinpainting, where the regions to be restored are given). Unfortunately, in our problem, particularly in our testing stage, we do not know where the regions degraded by raindrops and the information is not given. Hence, the local discriminator must try to find those regions by itself.

    为了区分假图像和真实图像,一些基于GAN的方法在鉴别部分(例如[9,13])采用了全局和局部图像内容一致性。全局鉴别器查看整个图像以检查是否有任何不一致性,而局部鉴别器则查看小的特定区域。如果我们知道那些可能是假的区域(如图像修复的区域,其中有大量需要恢复的区域),那么局部鉴别器的策略就特别有用。不幸的是,在我们的问题,特别是在我们的测试阶段,我们不知道哪些地区因雨滴而退化,并且没有提供信息。因此,局部鉴别者必须设法自己找到这些区域。

    To resolve this problem, our idea is to use an attentive discriminator. For this, we employ the attention map generated by our attentive-recurrent network. Specifically, we extract the features from the interior layers of the discriminator, and feed them to a CNN. We define a loss function based on the CNN’s output and the attention map. Moreover, we use the CNN’s output and multiply it with the original features from the discriminative network, before feeding them into the next layers. Our underlying idea of doing this is to guide our discriminator to focus on regions indicated by the attention map. Finally, at the end layer we use a fully connected layer to decide whether the input image is fake or real. The right part of Fig. 2 illustrates our discriminative architecture.

   为了解决这个问题,我们的想法是使用一个专注的鉴别器( an attentive discriminator)。为此,我们采用了我们的专注-循环网络所产生的专注映射图( the attention map)。具体来说,我们从判别器(discriminator)的内部层中提取特征,并将它们提供给cnn。我们根据CNN的输出和注意力图定义了一个损失函数。此外,我们使用cnn的输出,并将其与判别网络中的原始特征相乘,然后再将它们输入下一层。我们这样做的基本想法是引导我们的判别器专注于由注意力图所决定的区域。最后,在最后一层,我们使用一个完全连接的层来判断输入图像是假的还是真实的。在图2的右边阐述了我们的判别器结构。

   

   The whole loss function of the discriminator can be expressed as:

   判别器的整个损失函数可以表示为:

  

        where L map is the loss between the features extracted from interior layers of the discriminator and the final attention map:

       其中Lmap映射是从鉴别器的内部层提取的特征与最终的注意映射之间的损失:

        where  Dmap represents the process of producing a 2D map by the discriminative network. γ is set to 0.05. R is a sample image drawn from a pool of real and clean images. represents a map containing only 0 values. Thus, the second term of Eq. (9) implies that for R, there is no specific region necessary to focus on.

      其中, Dmap表示由判别网络生成2d地图的过程。γ设置为0.05。R是从真实和干净的图像池中提取的样本图像。0表示仅包含0值的映射。因此,eq(9)的第二部意味着对于R,没有需要关注的特定区域。

     Our discriminative network contains 7 convolution layers with the kernel of (3, 3), a fully connected layer of 1024 and a single neuron with a sigmoid activation function. We extract the features from the last third convolution layers and multiply back in element-wise.

     我们的判别网络包含7个卷积层,核为(3,3),完全连接层为1024,单个神经元采用 sigmoid激活函数。我们从倒数的第三层卷积层中提取特征,并在一对一元素上(element-wise)进行乘法。

5. Raindrop Dataset(雨滴数据集)

    Similar to current deep learning methods, our method  requires relatively a large amount of data with groundtruths for training. However, since there is no such dataset for raindrops attached to a glass window or lens, we create our own. For our case, we need a set of image pairs, where each pair contains exactly the same background scene, yet one is degraded by raindrops and the other one is free from raindrops. To obtain this, we use two pieces of exactly the same glass: one sprayed with water, and the other is left clean. Using two pieces of glass allows us to avoid misalignment, as glass has a refractive index that is different from air, and thus refracts light rays. In general, we also need to manage any other causes of misalignment, such as camera motion, when taking the two images; and, ensure that the atmospheric conditions (e.g., sunlight, clouds, etc.) as well as the background objects to be static during the acquisition process.

    与目前的深度学习方法类似,我们的方法需要相对大量包括groundtruths的数据来进行训练。然而,由于没有雨滴附加到玻璃窗口或镜头这样的数据集,所以我们创建自己的数据集。对于我们的情况,我们需要一组图像对,其中每对包含完全相同的背景场景,但一个被雨滴污染,另一个是没有雨滴。为了获得这个结果,我们使用了两块完全相同的玻璃:一个是被喷上水的,另一个是干净的。使用两块玻璃可以使我们避免不对齐的情况( misalignment),因为玻璃的折射率不同于空气,因此会折射光线。一般情况下,在拍摄这两幅图像时,我们还需要处理任何其他导致不对齐的原因,例如摄像机的运动;并确保大气条件(如阳光、云层等)以及背景物体在交换过程中是静态的。

     In total, we captured 1119 pairs of images, with various background scenes and raindrops. We used Sony A6000 and Canon EOS 60 for the image acquisition. Our glass slabs have the thickness of 3 mm and attached to the camera lens. We set the distance between the glass and the camera varying from 2 to 5 cm to generate diverse raindrop images, and to minimize the reflection effect of the glass. Fig. 5 shows some samples of our data.

     我们总共拍摄了1119对图像,有各种背景场景和雨滴。我们使用索尼A6000和佳能Eos 60进行图像采集。我们的玻璃板厚度为3毫米,附在照相机镜头上。我们设定玻璃和照相机之间的距离从2厘米到5厘米,以产生不同的雨滴图像,并尽量减少玻璃的反射效果。图5展示了我们的数据样本。

   图5.我们数据集的样本。上图:图像因雨滴而退化。底部:对应的地面真实图像.

6. Experimental Results(实验结果)

     Quantitative Evaluation. Table 1 shows the quantitative comparisons between our method and other existing methods: Eigen13 [1], Pix2Pix [10]. As shown in the table, compared to these two, our PSNR and SSIM values are higher. This indicates that our method can generate results more similar to the groundtruths.

     定量评价。表1显示了我们的方法与其他现有方法的定量比较:本征13[1],像素2pix[10]。如表中所示,对于这两种情况,我们的PSNR和 SSIM值更高。这表明,我们的方法可以产生更类似于地面真相的结果。

表1.定量评价结果。A是我们单独的上下文编码器。A+D是自动编码器加鉴别器。A+AD是自动编码器加注意鉴别器。AA+AD是我们的完整架构:专注的自动编码器和注意力鉴别器。

     We also compare our whole attentive GAN with some parts of our own network: A (autoencoder alone without the attention map), A+D (non-attentive autoencoder plus non-attentive discriminator), A+AD (non-attentive autoencoder plus attentive discriminator). Our whole attentive GAN is indicated by AA+AD (attentive autoencoder plus attentive discriminator). As shown in the evaluation table, AA+AD performs better than the other possible configurations. This is the quantitative evidence that the attentive map is needed by both the generative and discriminative networks.

     我们还比较了我们整个专注的GAN和我们自己网络的某些部分:A(没有注意映射的自动编码器)、D(非专注的自动编码器加非专注的鉴别器)、AD(非专注的自动编码器加注意的鉴别器)。我们的整个注意力GAN是用AA+AD(注意力自动编码器加注意力鉴别器)来表示的。如评估表所示,AA+AD的性能优于其他可能的配置。这是一个定量的证据,证明生成网络和区分网络都需要注意的地图。

     Qualitative Evaluation. Fig. 6 shows the results of Eigen13 [1] and Pix2Pix [10] in comparison to our results. As can be seen, our method is considerably more effective in removing raindrops compared to Eigen13 and Pix2Pix. In Fig. 7, we also compare our whole network (AA+AD) with other possible configurations from our architectures (A, A+D, A+AD). Although A+D is qualitatively better than A, and A+AD is better than A+D, our overall network is more effective than A+AD.This is the qualitative evidence that, again, the attentive map is needed by both the generative and discriminative networks.

     定性评价。图6给出了Eigen13 [1] and Pix2Pix [10]的结果,并与我们的结果进行了比较。可以看出,与Eigen13 [1] and Pix2Pix [10]相比,我们的方法在去除雨滴方面要有效得多。在图7中我们还比较了我们的整个网络(AA+AD)和我们的架构的其他可能的配置(A,AD,A+AD)。虽然A+D在质量上优于A,A+AD优于A+D,但我们的整体网络比A+AD更有效。这是定性的证据(qualitative evidence),再次证明,生成网络和歧视网络都需要注意的地图。

图6.比较几种不同方法的结果。从左到右:地面真实预想,雨滴图像(输入), Eigen13 [1] 、Pix2Pix [10]和我们的方法。几乎所有的雨滴都被我们的方法去除,尽管它们的颜色、形状和透明度是多种多样的

  图7.比较我们网络架构的某些部分。从左到右:输入, A, A+D, A+AD,我们的完整架构(AA+AD)。

图8.由我们新颖的注意力递归网络生成的注意力图的可视化。随着时间的推移,我们的网络越来越关注雨滴区域和相关结构。

图9.仔细观察一下我们的输出和 Pix2Pix的输出之间的比较。我们的输出有较少的人工痕迹和较好的恢复结构。

      

      Application. To provide further evidence that our visibility enhancement could be useful for computer vision applications, we employ Google Vision API

(https://cloud.google.com/vision/) to test whether using our outputs can improve the recognition performance. The results are shown in Fig. 10. As can be seen, using our output, the general recognition is better than without our visibility enhancement process. Furthermore, we perform evaluation on our test dataset, and Fig. 11 shows statistically that using our visibility enhancement outputs significantly outperform those without visibility enhancement, both in terms of the average score of identifying the main object in the input image, and the number of object labels recognized.

     应用。为了进一步证明我们的可见性增强对于计算机视觉应用是有用的,我们使用 Google Vision API(https:/Cloud.google.com/vision/)测试使用我们的输出是否可以提高识别性能。结果如图10所示。可以看出,使用我们的输出,一般的识别比没有使用我们能见度增强过程要好。此外,我们还对测试数据集进行了评估。图11统计数据显示,使用我们的可见性增强输出在识别输入图像中的主要对象的平均分数和识别对象标签的数量方面都明显优于那些没有可见性增强的输出。

图10.一个改进Google Vision API结果的示例。我们的方法增加了主目标检测的分数和识别对象的分数。

图11.基于Google Vision API的改进总结:(A)在输入图像中识别主要对象的平均得分。(B)已识别的物体标签的数目。该方法将识别成绩提高10%,目标识别率提高100%。

7. Conclusion(总结)

    We have proposed a single-image based raindrop removal method. The method utilizes a generative adversarial network, where the generative network produces the attention map via an attentive-recurrent network and applies this map along with the input image to generate a raindrop-free image through a contextual autoencoder. Our discriminative network then assesses the validity of the generated output globally and locally. To be able to validate locally, we inject the attention map into the network. Our novelty lies on the use of the attention map in both generative and discriminative network. We also consider that our method is the first method that can handle relatively severe presence of raindrops, which the state of the art methods in raindrop removal fail to handle.

    我们提出了一种基于单一图像的雨滴去除方法。该方法利用生成对抗性网络,其中生成网络通过注意力递归网络生成注意映射,并将该映射与输入图像一起应用于通过上下文自动编码器,以生成的无雨滴图像。然后,我们的判别网络评估了生成的输出在全局和局部的有效性 (validity)。为了能够在局部验证,我们将注意力映射注入到网络中。我们的方法新奇之处在于注意力图在生成网络和判别网络中的使用。我们还认为,我们的方法是第一种能够处理相对严重的雨滴存在的方法,这是目前最先进的雨滴清除方法所不能处理的。

References(参考文献)

[1] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image taken through a window covered with dirt or rain. In Proceedings of the IEEE International Conference on Computer Vision, pages 633–640, 2013. 2, 3, 6, 7

[2] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley. Clearing the skies: A deep network architecture for single-image rain removal. IEEE Transactions on Image Processing, 26(6):2944–2956, 2017. 2

[3] K. Garg and S. K. Nayar. Vision and rain. International Journal of Computer Vision, 75(1):3, 2007. 2

[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. 3

[5] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015. 4

[6] T. Hara, H. Saito, and T. Kanade. Removal of glare caused by water droplets. In Visual Media Production, 2009.CVMP’09.Conferencefor, pages144–151. IEEE, 2009. 2

[7] K.He, J.Sun, andX.Tang. Singleimagehazeremoval using dark channel prior. IEEE transactions on pattern analysis and machine intelligence, 33(12):2341–2353, 2011. 2

[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016. 2, 4

[9] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017. 2, 3, 5

[10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image- to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016. 3, 6, 7

[11] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-timestyletransferandsuper-resolution. InEuropean Conference on Computer Vision, pages 694–711. Springer, 2016. 5

[12] H. Kurihata, T. Takahashi, I. Ide, Y. Mekada, H. Murase, Y. Tamatsu, and T. Miyahara. Rainy weather recognition from in-vehicle camera images for driver assistance. In Intelligent Vehicles Sym- posium, 2005. Proceedings. IEEE, pages 205–210. IEEE, 2005. 1, 2

[13] Y. Li, S. Liu, J. Yang, and M.-H. Yang. Generative face completion. arXiv preprint arXiv:1704.05838, 2017. 2, 3, 5

[14] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown. Single image rain streak separation using layer priors. IEEE Transactions on Image Processing, 2017. 2

[15] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014. 4

[16] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang. Single image dehazing via multi-scale convolutional neural networks. In European Conference on Computer Vision, pages 154–169. Springer, 2016. 2

[17] M. Roser and A. Geiger. Video-based raindrop detection for improved image registration. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 570–577. IEEE, 2009. 1, 2

[18] M. Roser, J. Kurz, and A. Geiger. Realistic modeling of water droplets for monocular adherent raindrop recognition using bezier curves. In Asian Conference on Computer Vision, pages 235–244. Springer, 2010. 1, 2

[19] R. T. Tan. Visibility in bad weather from a single image. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008. 2

[20] Y. Tanaka, A. Yamashita, T. Kaneko, and K. T. Miura. Removal of adherent waterdrops from images acquired with a stereo camera system. IEICE TRANS-ACTIONS on Information and Systems, 89(7):2021– 2027, 2006. 2

[21] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation now- casting. In Advances in neural information processing systems, pages 802–810, 2015. 2, 4

[22] A. Yamashita, I. Fukuchi, and T. Kaneko. Noises removal from image sequences acquired with moving camera by estimating camera motion from spatio-temporal information. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 3794–3801. IEEE, 2009. 2

[23] A. Yamashita, Y. Tanaka, and T. Kaneko. Removal of adherent waterdrops from images acquired with stereo camera. In Intelligent Robots and Systems, 2005.(IROS2005).2005IEEE/RSJInternationalConference on, pages 400–405. IEEE, 2005. 2

[24] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1357–1366, 2017. 2

[25] S. You, R. T. Tan, R. Kawakami, Y. Mukaigawa, and K. Ikeuchi. Adherent raindrop modeling, detectionand removal in video. IEEE transactions on pattern analysis and machine intelligence, 38(9):1721–1733, 2016. 2, 3

[26] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. Diversified visual attention networks for fine-grained object classification. arXiv preprint arXiv:1606.08572, 2016. 4

猜你喜欢

转载自blog.csdn.net/C_chuxin/article/details/82988811