论文阅读笔记（二十六）：Group Normalization

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems — BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform or compete with its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.

Batch Normalization (BN) 是深度学习的一个里程碑技术, 可以使各种网络进行训练。但是, 沿batch dimension的规范化引入了问题-当batch size变小时, BN 的误差迅速增加, 这是由不准确的批统计估计引起的。这限制了 BN 用于训练较大model和将特征传输到计算机视觉任务 (包括检测、分割和视频) 的使用, 这要求small batch受内存消耗的限制。本文将Group Normalization (GN) 作为一种简单的BN替代方法, 将通道划分为组, 并在每个组中计算规范化的平均值和方差。该方法的计算与批次尺寸无关, 其精度在各种批次尺寸下是稳定的。在 ImageNet 训练的 ResNet-50 中, 在使用batch size为2时, 该方法的误差比其 BN 对应的低 10.6%;当使用典型的batch size时, GN和BN比较好, 并优于其他规范化变体。此外, 可以自然地从pre-training转到 fine-tuning。在COCO的目标检测和分割以及Kinetics中的视频分类中, 该方法可以胜过BN-based的counterparts或与其竞争, 并能在各种任务中有效地取代强大的BN。在modern 库中, 可以通过几行代码轻松实现。

Batch Normalization (Batch Norm or BN) [26] has been established as a very effective component in deep learning, largely helping push the frontier in computer vision [58, 20] and beyond [53]. BN normalizes the features by the mean and variance computed within a (mini-)batch. This has been shown by many practices to ease optimization and enable very deep networks to converge. The stochastic uncertainty of the batch statistics also acts as a regularizer that can benefit generalization. BN has been a foundation of many state-of-the-art computer vision algorithms.

Batch Normalization (Batch Norm或 BN) [26] 已建立作为一个非常有效的组成部分, 深度学习, 主要有助于推动前沿的计算机视觉 [58, 20] 和超越 [53]。BN 通过在 (mini-) batch中计算的平均值和方差对特征进行规范化。许多实践显示了这一点, 以简化优化并使非常深的网络收敛。batch statistics的随机不确定性也作为一个 regularizer, 可以受益于泛化。BN一直是许多state-of-the-art计算机视觉算法的基础。

Despite its great success, BN exhibits drawbacks that are also caused by its distinct behavior of normalising alongthe batch dimension. In particular, it is required for BN to work with a sufficiently large batch size (e.g., 32 per worker [26, 58, 20]). A small batch leads to inaccurate estimation of the batch statistics, and reducing BN’s batch size increases the model error dramatically (Figure 1). As a result, many recent models [58, 20, 56, 24, 62] are trained with non-trivial batch sizes that are memory-consuming. The heavy reliance on BN’s effectiveness to train models in turn prohibits people from exploring higher-capacity models that would be limited by memory.

尽管它取得了巨大的成功, BN也表现出了沿着batch dimension规范化的独特表现所造成的弊端。特别是, BN 必须使用足够大的batch size (例如, 每个worker 32 [26、58、20])。small batch导致对批统计数据的不准确估计, 并且降低 BN 的batch size会显著增加model错误 (图 1)。因此, 许多最近的model [58、20、56、24、62] 都接受了non-trivial的batch size的训练, 这些类型的内存消耗很小。过度依赖 BN 的效率来训练model反过来又禁止人们探索higher-capacity的model, 这将受到内存的限制。

The restriction on batch sizes is more demanding in computer vision tasks including detection [12, 46, 18], segmentation [37, 18], video recognition [59, 6], and other highlevel systems built on them. For example, the Fast/er and Mask R-CNN frameworks [12, 46, 18] use a batch size of 1 or 2 images because of higher resolution, where BN is “frozen” by transforming to a linear layer [20]; in video classification with 3D convolutions [59, 6], the presence of spatial-temporal features introduces a trade-off between the temporal length and batch size. The usage of BN often requires these systems to compromise between the model design and batch sizes.

对batch size的限制在计算机视觉任务中的要求更高, 包括检测 [12、46、18]、分割 [37、18]、视频识别 [59、6] 以及基于它们的其他高层次系统。例如, 因为更高的分辨率，Fast/er 和 Mask R-CNN 框架 [12, 46, 18] 使用一个batch size为1或2的图像, 其中 BN 是 “frozen” 转换为线性层 [20];在视频分类与3D 卷积 [59, 6], spatial-temporal特征的存在在temporal length和batch size之间引入trade-off。BN 的使用通常要求这些系统在model设计和batch size之间妥协。

This paper presents Group Normalization (GN) as a simple alternative to BN. We notice that many classical features like SIFT [38] and HOG [9] are group-wise features and involve group-wise normalization. For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram. Analogously, we propose GN as a layer that divides channels into groups and normalizes the features within each group (Figure 2). GN does not exploit the batch dimension, and its computation is independent of batch sizes.

本文将Group Normalization (GN) 作为 BN 的一个简单替换。我们注意到, 许多经典的特征, 如SIFT [38] 和 HOG [9] 是group-wise的特征, 并涉及group-wise的规范化。例如, HOG vector是几个空间单元的结果, 每个单元格由 normalized orientation histogram表示。类似, 我们建议GN作为一个层将通道划分为组, 并对每个组中的特征进行规范化 (图 2)。不利用batch dimension, 其计算与batch size无关。

GN behaves very stably over a wide range of batch sizes (Figure 1). With a batch size of 2 samples, GN has 10.6% lower error than its BN counterpart for ResNet-50 [20] in ImageNet [49]. With a regular batch size, GN is comparably good as BN (with a gap of ∼0.5%) and outperforms other normalization variants [3, 60, 50]. Moreover, although the batch size may change, GN can naturally transfer from pretraining to fine-tuning. GN shows improved or comparable results vs. its BN counterpart on Mask R-CNN for COCO object detection and segmentation [36], and on 3D convolutional networks for Kinetics video classification [30]. The effectiveness of GN in ImageNet, COCO, and Kinetics demonstrates that GN is a competitive alternative to BN that has been dominant in these tasks.

在大量的batch size下, 它的表现非常稳定 (图 1)。在 ImageNet [49] 的 ResNet-50 [20] 中, 由于batch size为2sample, 因此, 它的误差比其 BN 对应的低10.6%。在常规的batch size下, GN表现的与BN一样好 (约0.5%的差距) 并且优于其他规范化的变体 [3, 60, 50]。此外, 虽然batch size可能会发生变化, 但可以自然地从 pretraining 转移到 fine-tuning。在 Mask R-CNN 为COCO物体检测和分割 [36] 和在3D 卷积网络为Kinetics视频分类 [30]中, GN显示出与它的 BN counterpart相比结果提升，或者是可以相互媲美的。GN在ImageNet、COCO和Kinetics上的效果表明, GN是BN的一个竞争替代品并且在这些任务中占优势。

There have been existing methods, such as Layer Normalization (LN) [3] and Instance Normalization (IN) [60] (Figure 2), that also avoid normalizing along the batch dimension. These methods are effective for training sequential models (RNN/LSTM [48, 22]) or generative models (GANs [15, 27]). But as we will show by experiments, both LN and IN have limited success in visual recognition, for which GN presents better results. Conversely, GN could be used in place of LN and IN and thus is applicable for sequential or generative models. This is beyond the focus of this paper, but it is suggestive for future research.

已经存在一些方法, 例如Layer Normalization (LN) [3] 和Instance Normalization (in) [60] (图 2), 这也避免了沿batch dimension规范化。这些方法对于训练sequential model (RNN/LSTM [48、22]) 或generative model (GANs [15、27]) 是有效的。但是, 正如我们将通过实验显示的, 在视觉识别方面, 无论是LN还是 IN 都限制了成功, 而GN提供了更好的结果。换句话说, GN可用于代替 LN 和 IN, 从而适用于sequential model 和 generative model。这超出了本文的重点, 但对今后的研究具有启发性。

Normalization. It is well-known that normalizing the input data makes training faster [33]. To normalize hidden features, initialization methods [33, 14, 19] have been derived based on strong assumptions of feature distributions, which can become invalid when training evolves.

Normalization。规范化输入数据使训练更快 [33]。为了规范化隐藏的特征, 初始化方法 [33、14、19] 是基于特征分布的强假设而产生的, 当训练进展时, 它可能会失效。

Normalization layers in deep networks had been widely used before the development of BN. Local Response Normalization (LRN) [39, 28, 32] was a component in AlexNet [32] and following models [63, 52, 57]. Unlike recent methods [26, 3, 60], LRN computes the statistics in a small neighborhood for each pixel.

深度网络中的Normalization层在 BN 的发展之前得到了广泛的应用. Local Response Normalization (LRN) [39、28、32] 是 AlexNet [32] 和以下model [63、52、57] 中的一个组件。与最近的方法 [26、3、60] 不同, LRN 计算每个像素的小邻域中的统计信息。

Batch Normalization [26] performs more global normalization along the batch dimension (and as importantly, it suggests to do this for all layers). But the concept of “batch” is not always present, or it may change from time to time. For example, batch-wise normalization is not legitimate at inference time, so the mean and variance are pre-computed from the training set [26], often by running average; consequently, there is no normalization performed when testing. The pre-computed statistics may also change when the target data distribution changes [44]. These issues lead to inconsistency at training, transferring, and testing time. In addition, as aforementioned, reducing the batch size can have dramatic impact on the estimated batch statistics.

Batch Normalization [26] 在batch dimension上执行更多的全局规范化 (同样重要的是, 它建议对所有层执行此项)。但是, “batch” 的概念并不总是存在, 或者它可能会不时地发生变化。例如, 在推理时, batch-wise normalization是不合法的, 所以平均值和方差是由训练集 [26] 预先计算出来的, 通常是由运行平均值进行的;因此, 在测试时没有执行规范化。当目标数据分布更改 [44] 时, 预计算的统计信息也可能发生变化。这些问题导致训练、转移和测试时间不一致。此外, 如前所述, 减少batch size会对估计的batch statistics产生戏剧性的影响。

Several normalization methods [3, 60, 50, 2, 45] have been proposed to avoid exploiting the batch dimension. Layer Normalization (LN) [3] operates along the channel dimension, and Instance Normalization (IN) [60] performs BN-like computation but only for each sample (Figure 2). Instead of operating on features, Weight Normalization (WN) [50] proposes to normalize the filter weights. These methods do not suffer from the issues caused by the batch dimension, but they have not been able to approach BN’s accuracy in many visual recognition tasks. We provide comparisons with these methods in context of the remaining sections.

很多规范化方法 [3、60、50、2、45]已经被提出了, 以避免利用batch dimension。Layer Normalization (LN) [3] 沿通道维度运行, Instance Normalization (in) [60] 执行类似 BN 的计算, 但仅针对每个sample (图 2)。Weight Normalization [50] 不是对特征进行操作, 而是建议对filter权重进行规范化。这些方法不受batch dimension所引起的问题的影响, 但它们在许多视觉识别任务中都无法接近 BN 的准确性。我们在其余部分的上下文中提供与这些方法的比较。

Addressing small batches. Ioffe [25] proposes Batch Renormalization (BR) that alleviates BN’s issue involving small batches. BR introduces two extra parameters that constrain the estimated mean and variance of BN within a certain range, reducing their drift when the batch size is small. BR has better accuracy than BN in the small-batch regime. But BR is also batch-dependent, and when the batch size decreases its accuracy still degrades [25].

Addressing small batches。Ioffe [25] 提出Batch Renormalization (BR), 以缓解 BN 的问题, 包括small batches。BR 引入了两个临时参数, 限制了 BN 在一定范围内的估计平均值和方差, 减少了batch size较小时的drift。在small batch的体制中, BR 的准确度优于 BN。但 BR 也是batch-dependent, 当batch size降低时其精确度仍然下降 [25]。

There are also attempts to avoid using small batches. The object detector in [42] performs synchronized BN whose mean and variance are computed across multiple GPUs. However, this method does not solve the problem of small batches; instead, it migrates the algorithm problem to engineering and hardware demands, using a number of GPUs proportional to BN’s requirements. Moreover, the synchronized BN computation prevents using asynchronous solvers (ASGD [10]), a practical solution to large-scale training widely used in industry. These issues can limit the scope of using synchronized BN.

也有避免使用small batches的尝试。[42] 中的物体检测器执行同步 BN, 其平均值和方差都是通过多个 GPUs 计算的。但是, 这种方法并不能解决small batch问题;相反, 它将算法问题迁移到工程和硬件需求上, 使用了与 BN 要求成正比的一些 GPU。此外, 同步 BN 计算可以防止使用asynchronous solvers (ASGD [10]), 这是一个实用的解决方案, 大规模的训练广泛应用于工业。这些问题可以限制使用同步 BN 的范围。

Instead of addressing the batch statistics computation (e.g., [25, 42]), our normalization method inherently avoids this computation.

Group-wise computation. Group convolutions have been presented by AlexNet [32] for distributing a model into two GPUs. The concept of groups as a dimension for model design has been more widely studied recently. The work of ResNeXt [62] investigates the trade-off between depth, width, and groups, and it suggests that a larger number of groups can improve accuracy under similar computational cost.

我们的规范化方法本质上是避免了batch statistics的计算，而不是处理这些计算 (例如 [25, 42]) 。

Group-wise computation。AlexNet [32] 提出了Group convolutions, 用于将model分布到两个 GPU 中。groups 作为model设计的一个维度的概念近年来得到了广泛的研究。ResNeXt 的工作 [62] 调查depth, width, 和 groups之间的权衡, 并且它建议更多groups能在相似的计算费用之下提高准确性。

MobileNet [23] and Xception [7] exploit channel-wise (also called “depth-wise”) convolutions, which are group convolutions with a group number equal to the channel number. ShuffleNet [64] proposes a channel shuffle operation that permutes the axes of grouped features. These methods all involve dividing the channel dimension into groups. Despite the relation to these methods, GN does not require group convolutions. GN is a generic layer, as we evaluate in standard ResNets [20].

MobileNet [23] 和 Xception [7] 开发channel-wise ( “depth-wise”) 卷积, 是把卷积用与通道数等价的组数分组。ShuffleNet [64] 提出了交换分组特征坐标轴的channel shuffle操作。这些方法都涉及将通道维度划分为组。尽管与这些方法有关, 但GN不需要把卷积分组。当我们在 ResNets [20] 标准中评估时, GN是一个通用层。

The channels of visual representations are not entirely independent. Classical features of SIFT [38], HOG [9], and GIST [40] are group-wise representations by design, where each group of channels is constructed by some kind of histogram. These features are often processed by groupwise normalization over each histogram or each orientation. Higher-level features such as VLAD [29] and Fisher Vectors (FV) [43] are also group-wise features where a group can be thought of as the sub-vector computed with respect to a cluster.

视觉表示的通道并不是完全独立的。Classical features of SIFT [38], HOG [9], 和 GIST [40] 是设计的group-wise的表示法, 每组通道由某种直方图构造。这些特性通常通过 group-wise 对每个直方图或每个方向进行规范化处理。Higher-level的特征, 如 VLAD [29] 和Fisher Vectors (FV) [43] 也是group-wise特征, 在这些特征中, group可以被认为是针对群集计算的子向量。

Analogously, it is not necessary to think of deep neural network features as unstructured vectors. For example, for conv1 (the first convolutional layer) of a network, it is reasonable to expect a filter and its horizontal flipping to exhibit similar distributions of filter responses on natural images. If conv1 happens to approximately learn this pair of filters, or if the horizontal flipping (or other transformations) is made into the architectures by design [11, 8], then the corresponding channels of these filters can be normalized together.

类似的, 不需要把深层神经网络特征看作非结构化向量。例如, 对于网络的 conv1 层(第一卷积层), 期望filter及其水平翻转在自然图像上呈现类似的filter响应分布是合理的。如果 conv1 碰巧大概学习了这对filter, 或者如果水平翻转 (或其他转换) 是通过设计 [11、8] 制成体系结构的, 则这些filter的相应通道可以一起规范化。

The higher-level layers are more abstract and their behaviors are not as intuitive. However, in addition to orientations (SIFT [38], HOG [9], or [11, 8]), there are many factors that could lead to grouping, e.g., frequency, shapes, illumination, textures. Their coefficients can be interdependent. In fact, a well-accepted computational model in neuroscience is to normalize across the cell responses [21, 51, 54, 5], “with various receptive-field centers (covering the visual field) and with various spatiotemporal frequency tunings” (p183, [21]); this can happen not only in the primary visual cortex, but also “throughout the visual system” [5]. Motivated by these works, we propose new generic group-wise normalization for deep neural networks.

higher-leve层更抽象, 它们的表现不那么直观。然而, 除了方向 (SIFT [38], HOG [9], 或者 [11, 8]), 有许多因素可能导致分组, 例如, frequency, shapes, illumination, textures。它们的系数可以相互依存。事实上, 一个公认的神经科学的计算model是通过细胞反应 [21, 51, 54, 5], 与各种感受野中心 (视野覆盖) 和各种spatiotemporal frequency tunings “(p183, [21]) 规范化;这可能不仅发生在主要视觉皮层, 而且 “贯穿整个视觉系统” [5]。在这些工作的推动下, 我们提出了一种用于深层神经网络的新的通用的group-wise normalization。

We have presented GN as an effective normalization layer without exploiting the batch dimension. We have evaluated GN’s behaviors in a variety of applications. We note, however, that BN has been so influential that many state-ofthe-art systems and their hyper-parameters have been designed for it, which may not be optimal for GN-based models. It is possible that re-designing the systems or searching new hyper-parameters for GN will give better results.
In addition, we have shown that GN is related to LN and IN, two normalization methods that are particularly successful in training recurrent (RNN/LSTM) or generative (GAN) models. This suggests us to study GN in those areas in the future. We will also investigate GN’s performance on learning representations for reinforcement learning (RL) tasks, e.g., [53], where BN is playing an important role for training very deep models [20].

在不利用batch dimension的情况下, 我们提出了GN作为一种有效的规范化层。我们已经在各种应用中评估了其表现。然而, 我们注意到, BN 一直是如此的有影响力, 许多先进的系统和他们的超参数为它设计, 可能不是最佳的GN-based model。为GN重新设计系统或搜索新的超参数可能会取得更好的效果。
此外, 我们已经表明, GN 与 LN 和 IN有关系, 两种规范化的方法在训练当前的(RNN/LSTM)或是生成 (Gans) model中特别成功。这就意味着我们要在未来的那些领域里学习。我们还将调查在reinforcement learning (RL) 任务的学习表示法的性能, 如 [53], BN为训练非常深的model [20]扮演一个重要角色。

这里写图片描述

Figure 2. Normalization methods. Each subplot shows a feature map tensor, with N as the batch axis, C as the channel axis, and (H, W ) as the spatial axes. The pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels.

图2。Normalization methods。每个子图显示一个特征映射张量, 以 N 作为batch axis, C 作为 channel axis, 和 (H, W) 作为 spatial axes。蓝色的像素按相同的平均值和方差进行规范化, 通过聚合这些像素的值来计算。

论文阅读笔记（二十六）：Group Normalization

猜你喜欢