論文：EfficientNetV2: より小さなモデルとより高速なトレーニング

正式実装：automl/efficientnetv2 at master · google/automl · GitHub

サードパーティ実装：メインの mmpretrain/efficientnet_v2.py · open-mmlab/mmpretrain · GitHub

序文

この記事は EfficientNet v1 の改良版です。v1 のトレーニングのボトルネックを体系的に調査した結果、v1 には次の問題があることがわかりました。

入力解像度が大きいとトレーニングが遅くなる
深さ方向の畳み込みは、ネットワークの浅い層では遅くなります。
同じスケーリング係数を使用してネットワークの各ステージをスケーリングすることは最適ではありません。

これらの観察に基づいて、著者らは、Fused-MBConv などの追加の操作を含む検索空間を設計し、トレーニング対応 NAS とスケーリングを適用して、モデルの精度、トレーニング速度、パラメーターサイズを共同で最適化します。最終的に検索されたネットワークは EfficientNetV2 です。

さらに、この論文では、トレーニングの初期段階ではより小さな入力と弱い正則化を使用し、トレーニングが進むにつれて入力の解像度と正則化を徐々に増加させる、改良された漸進的学習方法も提案します。このトレーニング方法に基づいて、精度の低下を引き起こすことなくトレーニング速度を向上させることができます。

この記事への寄稿

この論文では、トレーニングを意識した NAS とスケーリングに基づいた、より小型で高速なモデルシリーズである EfficientNet V2 を提案します。EfficientNetV2 は、トレーニング速度とパラメーター効率の点で以前のモデルよりも優れています。
本論文では、正則化と入力サイズを適応的に調整する改良型プログレッシブトレーニング手法を提案し、実験により、この手法がトレーニング速度を高速化するだけでなく、精度も向上することを証明しました。
EfficientNetV2 と改良されたプログレッシブトレーニング手法を組み合わせることで、ImageNet、CIFAR、Cars、および Flowers データセット上で、以前のモデルのトレーニング速度が最大 11 倍に向上し、パラメーター効率が最大 6.8 倍に向上しました。

メソッドの紹介

トレーニングの効率を理解する

著者はまず v1 のトレーニングのボトルネックを調査し、トレーニングの効率を向上させるためのいくつかの簡単なトリックを提案します。

非常に大きな画像サイズでのトレーニングは遅い

v1 では、入力が大きくなるとメモリ使用量が増加し、合計メモリが固定されるため、batch_size を減らす必要があり、これによりトレーニング速度も低下します。簡単な改善は、FixRes を適用することです。つまり、トレーニングに推論より小さい画像を使用します。表 2 に示すように、入力画像が小さいほど、計算が小さくなるため、より大きなバッチサイズを使用でき、トレーニング速度が向上します。最大2.2倍になります。

この論文では、学習プロセス中に入力画像のサイズと正則化を段階的に調整する、より高度な学習方法を提案します。これについては後で詳しく説明します。

深さ方向の畳み込みは初期の層では遅いですが、後の段階では効果的です

v1 のもう 1 つのトレーニングボトルネックは、ディープコンボリューションの広範な使用です。ディープコンボリューションは、従来のコンボリューションよりもパラメーターと FLOP が少ないものの、最新の高速化手法を完全には使用できません (従来のコンボリューション操作用の最新のハードウェアとトレーニングフレームワークは、特別な最適化によって高速化されますが、最適化されていません)。深さ方向の畳み込み)。最近提案された Fused-MBConv は、モバイルデバイスやサーバーの高速化を有効に活用できます。図 2 に示すように、MBConv の 3x3 深さ方向畳み込みと 1x1 拡張畳み込みを従来の 3x3 畳み込みに置き換えます。

これら 2 つのブロックを体系的に比較するために、表 3 に示すように、作成者は EfficientNet-B4 の MBConv を Fused-MBConv に徐々に置き換えました。ステージ 1 ～ 3 を置き換える場合、パラメーターと FLOP を少し増やすことで、精度とトレーニング速度の両方が向上します。ただし、ステージ 1 ～ 7 のすべての MBConv を Fused-MBConv に置き換えると、パラメーターと FLOP の数が大幅に増加しますが、精度とトレーニング速度は低下します。したがって、これら 2 つの構成要素の完璧な組み合わせを見つける必要があります。著者は NAS を使用して、最適な組み合わせを自動的に検索します。

すべての段階を均等にスケールアップするのは最適ではありません

v1 では、すべてのステージを拡張するために単純な複合スケーリングルールが使用されます。たとえば、深さ係数が 2 の場合、ネットワーク内のすべてのステージのレイヤーの数は 2 倍になります。ただし、ステージが異なると、トレーニング速度やトレーニング速度に異なる影響を及ぼします。パラメータ効率です。この論文の著者は、ネットワークのより深い層に徐々に層を追加する、不均一なスケーリング戦略を採用しています。さらに、v1 では入力サイズが徐々に増加するため、メモリ使用量が増加し、トレーニング速度が遅くなります。この問題を解決するために、この論文ではスケーリングルールをわずかに変更し、最大入力サイズをより小さい値に制限します。

EfficientNetV2 アーキテクチャ

v1 と比較して、NAS の条件をいくつか変更しましたが、ここでは詳しく紹介しません. 検索された EfficientNetV2-S の構造は表 4 に示されています。

v1 と比較すると、いくつかの違いがわかります

v2 は、MBConv と新しく追加された Fused-MBConv をネットワークの浅い層で広範囲に使用します。
v2 は、メモリアクセスのオーバーヘッドが少ないため、MBConv でより小さい拡張率を使用することを好みます。
V2 は、より小さな 3x3 畳み込みを使用する傾向がありますが、同時に、より小さなカーネルサイズによって引き起こされる減少した受容野を補うためにより多くのネットワーク層を追加します。
v2 では、v1 の stride=1 の最後のステージが削除されます。これは、パラメーターの量とメモリのオーバーヘッドが大きいためと考えられます。

効率的なNetV2スケーリング

作成者は、v1 のようなスケーリングルールを使用して EfficientNetV2-S を拡張し、EfficientNetV2-M/L を取得し、いくつかの追加の最適化を行います。

非常に大きな入力はメモリのオーバーヘッドを引き起こし、トレーニングの速度を低下させる可能性があるため、推論入力の最大サイズを 480 に制限します。
運用上のオーバーヘッドをあまり増加させずにネットワーク容量を増加させるために、ネットワークの後の段階でレイヤを徐々に追加します。

漸進的な学習

画像サイズはトレーニング効率において重要な役割を果たします。FixRes に加えて、他の多くのネットワークもトレーニング中に入力サイズを動的に調整しますが、これは通常、精度の低下につながります。著者らは、精度の低下は不均衡な正則化が原因であり、異なるサイズの画像を使用してトレーニングする場合は、それに応じて正則化の強度を調整する必要があると仮説を立てています。一般に、大規模なモデルでは過学習に対処するためにより強力な正則化が必要となるため、同じネットワークであっても、入力が小さいほどネットワーク容量が小さくなるため、より弱い正則化のみが必要になります。逆に、入力画像が大きいほど計算量が大きくなり、ネットワーク容量が大きくなり、オーバーフィットしやすくなります。この仮説を検証するために、著者は同じネットワークを使用して、異なる入力サイズと異なる正則化強度でトレーニングしました。結果を表 5 に示します。入力が増加するにつれて、最高の精度を得るために必要な正則化がわかるようになります。強度も徐々に増していきます。

図 4 は、この論文で提案されている改善された漸進的学習のトレーニングプロセスを示しています. トレーニングの初期段階では、より小さな入力と弱い正則化を使用することで、ネットワークは簡単かつ迅速に単純な表現を学習することができ、その後、入力サイズと強度を徐々に増加させます。正則化により学習がより困難になります。

具体的には、トレーニングに \(N\) ステップがあると仮定すると、最大入力サイズは \(S_{e}\) であり、正則化強度リスト \(\Phi_{e}=\left \{ \phi^{ k }_{e} \right \} \)。ここで、 \(k\) は、ドロップアウト率やミックスアップ率の値などの正則化手法のクラスの強度を表します。トレーニングプロセス全体を \(M\) ステージに分割します。各ステージ \(1\le i\le M\) では、モデルの入力サイズは \(S_{i}\) 、正則化強度は次のようになります。 \ (\Phi_{i}=\left \{ \phi^{k}_{i} \right \} \)。最終ステージ \(M\) の入力サイズは \(S_{e}\)、正則化強度は \(\Phi_{e}\) です。便宜上、初期入力サイズが \(S_{0}\)、正則化強度が \(\Phi_{0}\) であり、各ステージの特定の値を決定するために線形補間が使用されると仮定します。計算プロセスは次のとおりです

実験結果

著者は ImageNet 上で実験を行っています. プログレッシブ学習の設定は表 6 に示されています. 合計 350 エポックが学習され、4 つのステージに分かれており、各ステージには約 87 エポックがあります。表中のminとmaxは入力サイズと正則化強度の最小値と最大値を表します。また、Fixefficientnet の記事の実践を続けます。トレーニング中の最大入力は推論中の最大入力よりも約 20% 小さくなり、トレーニングの完了後はどのレイヤーの微調整も実行されません。

ImageNet の完全な結果を表 7 に示します。EfficientNet v2 の推論速度は非常に速く、以前の ConvNets モデルや Transformer モデルよりも精度とパラメーター効率が優れていることがわかります。

コード分析

ここでは mmpretrain での実装を例として、具体的な実装について簡単に説明します。1 つ目はネットワーク構造であり、S の構造は表 4 を参照して確認できます。実はここでは詳しく紹介しませんが、MやLなどもあります。さらに、以下のリストには 7 つのサブリストがあり、表 4 のレイヤー 0 を除くレイヤー 1 ～ 7 に対応します。また、MBConvのexpand_ratioは表4には記載されておらず、具体的な値はarch_settingsに記載されている。

# Parameters to build layers. From left to right:
# - repeat (int): The repeat number of the block in the layer
# - kernel_size (int): The kernel size of the layer
# - stride (int): The stride of the first block of the layer
# - expand_ratio (int, float): The expand_ratio of the mid_channels
# - in_channel (int): The number of in_channels of the layer
# - out_channel (int): The number of out_channels of the layer
# - se_ratio (float): The squeeze ratio of SELayer.
# - block_type (int): -2: ConvModule, -1: EnhancedConvModule,
#                      0: FusedMBConv, 1: MBConv
arch_settings = {
    **dict.fromkeys(['small', 's'], [[2, 3, 1, 1, 24, 24, 0.0, -1],
                                     [4, 3, 2, 4, 24, 48, 0.0, 0],
                                     [4, 3, 2, 4, 48, 64, 0.0, 0],
                                     [6, 3, 2, 4, 64, 128, 0.25, 1],
                                     [9, 3, 1, 6, 128, 160, 0.25, 1],
                                     [15, 3, 2, 6, 160, 256, 0.25, 1],
                                     [1, 1, 1, 1, 256, 1280, 0.0, -2]])
}

次にネットワーク層を構築します。最初は通常の畳み込み層である 0 番目の層です。コードは次のとおりです。

self.layers.append(
    ConvModule(
        in_channels=self.in_channels,
        out_channels=self.arch[0][4],
        kernel_size=3,
        stride=2,
        conv_cfg=self.conv_cfg,
        norm_cfg=self.norm_cfg,
        act_cfg=self.act_cfg))

次に、arch_settings に従ってレイヤー 1 ～ 7 を構築します。コードは次のとおりです。

in_channels = self.arch[0][4]
layer_setting = self.arch[:-1]

total_num_blocks = sum([x[0] for x in layer_setting])
block_idx = 0
dpr = [
    x.item()
    for x in torch.linspace(0, self.drop_path_rate, total_num_blocks)
]  # stochastic depth decay rule

for layer_cfg in layer_setting:
    layer = []
    (repeat, kernel_size, stride, expand_ratio, _, out_channels,
     se_ratio, block_type) = layer_cfg
    for i in range(repeat):
        stride = stride if i == 0 else 1
        if block_type == -1:
            has_skip = stride == 1 and in_channels == out_channels
            droppath_rate = dpr[block_idx] if has_skip else 0.0
            layer.append(
                EnhancedConvModule(
                    in_channels=in_channels,
                    out_channels=out_channels,
                    kernel_size=kernel_size,
                    has_skip=has_skip,
                    drop_path_rate=droppath_rate,
                    stride=stride,
                    padding=1,
                    conv_cfg=None,
                    norm_cfg=self.norm_cfg,
                    act_cfg=self.act_cfg))
            in_channels = out_channels
        else:
            mid_channels = int(in_channels * expand_ratio)
            se_cfg = None
            if block_type != 0 and se_ratio > 0:
                se_cfg = dict(
                    channels=mid_channels,
                    ratio=expand_ratio * (1.0 / se_ratio),
                    divisor=1,
                    act_cfg=(self.act_cfg, dict(type='Sigmoid')))
            block = FusedMBConv if block_type == 0 else MBConv
            conv_cfg = self.conv_cfg if stride == 2 else None
            layer.append(
                block(
                    in_channels=in_channels,
                    out_channels=out_channels,
                    mid_channels=mid_channels,
                    kernel_size=kernel_size,
                    stride=stride,
                    se_cfg=se_cfg,
                    conv_cfg=conv_cfg,
                    norm_cfg=self.norm_cfg,
                    act_cfg=self.act_cfg,
                    drop_path_rate=dpr[block_idx],
                    with_cp=self.with_cp))
            in_channels = out_channels
        block_idx += 1
    self.layers.append(Sequential(*layer))

# make the last layer
self.layers.append(
    ConvModule(
        in_channels=in_channels,
        out_channels=self.out_channels,
        kernel_size=self.arch[-1][1],
        stride=self.arch[-1][2],
        conv_cfg=self.conv_cfg,
        norm_cfg=self.norm_cfg,
        act_cfg=self.act_cfg))

表 4 では、最初の層は Fused-MBConv ですが、コード実装では、最初の層は EnhancedConvModule です。つまり、次のように通常の畳み込みに追加のショートカットが追加されていることに注意してください。

class EnhancedConvModule(ConvModule):
    """ConvModule with short-cut and droppath.

    Args:
        in_channels (int): Number of channels in the input feature map.
            Same as that in ``nn._ConvNd``.
        out_channels (int): Number of channels produced by the convolution.
            Same as that in ``nn._ConvNd``.
        kernel_size (int | tuple[int]): Size of the convolving kernel.
            Same as that in ``nn._ConvNd``.
        stride (int | tuple[int]): Stride of the convolution.
            Same as that in ``nn._ConvNd``.
        has_skip (bool): Whether there is short-cut. Defaults to False.
        drop_path_rate (float): Stochastic depth rate. Default 0.0.
        padding (int | tuple[int]): Zero-padding added to both sides of
            the input. Same as that in ``nn._ConvNd``.
        dilation (int | tuple[int]): Spacing between kernel elements.
            Same as that in ``nn._ConvNd``.
        groups (int): Number of blocked connections from input channels to
            output channels. Same as that in ``nn._ConvNd``.
        bias (bool | str): If specified as `auto`, it will be decided by the
            norm_cfg. Bias will be set as True if `norm_cfg` is None, otherwise
            False. Default: "auto".
        conv_cfg (dict): Config dict for convolution layer. Default: None,
            which means using conv2d.
        norm_cfg (dict): Config dict for normalization layer. Default: None.
        act_cfg (dict): Config dict for activation layer.
            Default: dict(type='ReLU').
        inplace (bool): Whether to use inplace mode for activation.
            Default: True.
        with_spectral_norm (bool): Whether use spectral norm in conv module.
            Default: False.
        padding_mode (str): If the `padding_mode` has not been supported by
            current `Conv2d` in PyTorch, we will use our own padding layer
            instead. Currently, we support ['zeros', 'circular'] with official
            implementation and ['reflect'] with our own implementation.
            Default: 'zeros'.
        order (tuple[str]): The order of conv/norm/activation layers. It is a
            sequence of "conv", "norm" and "act". Common examples are
            ("conv", "norm", "act") and ("act", "conv", "norm").
            Default: ('conv', 'norm', 'act').
    """

    def __init__(self, *args, has_skip=False, drop_path_rate=0, **kwargs):
        super().__init__(*args, **kwargs)
        self.has_skip = has_skip
        if self.has_skip and (self.in_channels != self.out_channels
                              or self.stride != (1, 1)):
            raise ValueError('the stride must be 1 and the `in_channels` and'
                             ' `out_channels` must be the same , when '
                             '`has_skip` is True in `EnhancedConvModule` .')
        self.drop_path = DropPath(
            drop_path_rate) if drop_path_rate else nn.Identity()

    def forward(self, x: torch.Tensor, **kwargs) -> torch.Tensor:
        short_cut = x
        x = super().forward(x, **kwargs)
        if self.has_skip:
            x = self.drop_path(x) + short_cut
        return x

残りの層は、表 4 の構造に従って完全に構築されます。MBConv のコードは次のとおりで、MobileNet の InvertedResidual ブロックに相当します。

class InvertedResidual(BaseModule):
    """Inverted Residual Block.

    Args:
        in_channels (int): The input channels of this module.
        out_channels (int): The output channels of this module.
        mid_channels (int): The input channels of the depthwise convolution.
        kernel_size (int): The kernel size of the depthwise convolution.
            Defaults to 3.
        stride (int): The stride of the depthwise convolution. Defaults to 1.
        se_cfg (dict, optional): Config dict for se layer. Defaults to None,
            which means no se layer.
        conv_cfg (dict): Config dict for convolution layer. Defaults to None,
            which means using conv2d.
        norm_cfg (dict): Config dict for normalization layer.
            Defaults to ``dict(type='BN')``.
        act_cfg (dict): Config dict for activation layer.
            Defaults to ``dict(type='ReLU')``.
        drop_path_rate (float): stochastic depth rate. Defaults to 0.
        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
            memory while slowing down the training speed. Defaults to False.
        init_cfg (dict | list[dict], optional): Initialization config dict.
    """

    def __init__(self,
                 in_channels,
                 out_channels,
                 mid_channels,
                 kernel_size=3,
                 stride=1,
                 se_cfg=None,
                 conv_cfg=None,
                 norm_cfg=dict(type='BN'),
                 act_cfg=dict(type='ReLU'),
                 drop_path_rate=0.,
                 with_cp=False,
                 init_cfg=None):
        super(InvertedResidual, self).__init__(init_cfg)
        self.with_res_shortcut = (stride == 1 and in_channels == out_channels)
        assert stride in [1, 2]
        self.with_cp = with_cp
        self.drop_path = DropPath(
            drop_path_rate) if drop_path_rate > 0 else nn.Identity()
        self.with_se = se_cfg is not None
        self.with_expand_conv = (mid_channels != in_channels)

        if self.with_se:
            assert isinstance(se_cfg, dict)

        if self.with_expand_conv:
            self.expand_conv = ConvModule(
                in_channels=in_channels,
                out_channels=mid_channels,
                kernel_size=1,
                stride=1,
                padding=0,
                conv_cfg=conv_cfg,
                norm_cfg=norm_cfg,
                act_cfg=act_cfg)
        self.depthwise_conv = ConvModule(
            in_channels=mid_channels,
            out_channels=mid_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=kernel_size // 2,
            groups=mid_channels,
            conv_cfg=conv_cfg,
            norm_cfg=norm_cfg,
            act_cfg=act_cfg)
        if self.with_se:
            self.se = SELayer(**se_cfg)
        self.linear_conv = ConvModule(
            in_channels=mid_channels,
            out_channels=out_channels,
            kernel_size=1,
            stride=1,
            padding=0,
            conv_cfg=conv_cfg,
            norm_cfg=norm_cfg,
            act_cfg=None)

    def forward(self, x):
        """Forward function.

        Args:
            x (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The output tensor.
        """

        def _inner_forward(x):
            out = x

            if self.with_expand_conv:
                out = self.expand_conv(out)

            out = self.depthwise_conv(out)

            if self.with_se:
                out = self.se(out)

            out = self.linear_conv(out)

            if self.with_res_shortcut:
                return x + self.drop_path(out)
            else:
                return out

        if self.with_cp and x.requires_grad:
            out = cp.checkpoint(_inner_forward, x)
        else:
            out = _inner_forward(x)

        return out

Fused-MBConv のコードは次のとおりです。これは、MBConv の 3x3 深さ方向畳み込みと 1x1 拡張畳み込みを従来の 3x3 畳み込みに置き換えるものです。

class EdgeResidual(BaseModule):
    """Edge Residual Block.

    Args:
        in_channels (int): The input channels of this module.
        out_channels (int): The output channels of this module.
        mid_channels (int): The input channels of the second convolution.
        kernel_size (int): The kernel size of the first convolution.
            Defaults to 3.
        stride (int): The stride of the first convolution. Defaults to 1.
        se_cfg (dict, optional): Config dict for se layer. Defaults to None,
            which means no se layer.
        with_residual (bool): Use residual connection. Defaults to True.
        conv_cfg (dict, optional): Config dict for convolution layer.
            Defaults to None, which means using conv2d.
        norm_cfg (dict): Config dict for normalization layer.
            Defaults to ``dict(type='BN')``.
        act_cfg (dict): Config dict for activation layer.
            Defaults to ``dict(type='ReLU')``.
        drop_path_rate (float): stochastic depth rate. Defaults to 0.
        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
            memory while slowing down the training speed. Defaults to False.
        init_cfg (dict | list[dict], optional): Initialization config dict.
    """

    def __init__(self,
                 in_channels,
                 out_channels,
                 mid_channels,
                 kernel_size=3,
                 stride=1,
                 se_cfg=None,
                 with_residual=True,
                 conv_cfg=None,
                 norm_cfg=dict(type='BN'),
                 act_cfg=dict(type='ReLU'),
                 drop_path_rate=0.,
                 with_cp=False,
                 init_cfg=None):
        super(EdgeResidual, self).__init__(init_cfg=init_cfg)
        assert stride in [1, 2]
        self.with_cp = with_cp
        self.drop_path = DropPath(
            drop_path_rate) if drop_path_rate > 0 else nn.Identity()
        self.with_se = se_cfg is not None
        self.with_residual = (
            stride == 1 and in_channels == out_channels and with_residual)

        if self.with_se:
            assert isinstance(se_cfg, dict)

        self.conv1 = ConvModule(
            in_channels=in_channels,
            out_channels=mid_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=kernel_size // 2,
            conv_cfg=conv_cfg,
            norm_cfg=norm_cfg,
            act_cfg=act_cfg)

        if self.with_se:
            self.se = SELayer(**se_cfg)

        self.conv2 = ConvModule(
            in_channels=mid_channels,
            out_channels=out_channels,
            kernel_size=1,
            stride=1,
            padding=0,
            conv_cfg=None,
            norm_cfg=norm_cfg,
            act_cfg=None)

    def forward(self, x):

        def _inner_forward(x):
            out = x
            out = self.conv1(out)

            if self.with_se:
                out = self.se(out)

            out = self.conv2(out)

            if self.with_residual:
                return x + self.drop_path(out)
            else:
                return out

        if self.with_cp and x.requires_grad:
            out = cp.checkpoint(_inner_forward, x)
        else:
            out = _inner_forward(x)

        return out

EfficientNet V2 (ICML 2021) の原理とコード分析

序文

この記事への寄稿

メソッドの紹介

トレーニングの効率を理解する

EfficientNetV2 アーキテクチャ

効率的なNetV2スケーリング

漸進的な学習

実験結果

コード分析

おすすめ

EfficientNet V2 (ICML 2021) の原理とコード分析

序文

この記事への寄稿

メソッドの紹介

トレーニングの効率を理解する

EfficientNetV2 アーキテクチャ

効率的なNetV2スケーリング

漸進的な学習

実験結果

コード分​​析

おすすめ

コード分析