U-net 原理部分之前的博客有些了，这里主要记录一下代码实现
U-net往期博客：https://blog.csdn.net/qq_19841133/article/details/126927383

基于Attention-based（用的是自注意力机制）的U-net
代码来源IDDPM项目：
https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/unet.py

文章目录

U-net
conv_nd
TimestepEmbedSequential emb传入层
Downsample 下采样层
Upsample 上采样层
AttentionBlock 注意力机制层
QKVAttention
ResBlock
写在后面

IDDPM的NN模型用的是attention-based Unet

Unet很熟悉了，除了有两部分编码器和解码器（input和output），还有mid block中间模块，如有ResBlock，MHSA Block

input block组成：Res（接收输入x和emb timestep表示成emb，condition表示成emb），MHSA（像素对像素的注意力机制），Downsample
mid block：Res，MHSA， Res
output block：Res（与input block对应层的输出进行拼接），MHSA，Upsample

U-net

第一个模块，time_emb，对输入进来的time_step进行变换，

		time_embed_dim = model_channels * 4
        self.time_embed = nn.Sequential(
            linear(model_channels, time_embed_dim),
            SiLU(),
            linear(time_embed_dim, time_embed_dim),
        )

如果是条件式生成那么还有一个label_emb，作为条件的embeding和x一起输入

        if self.num_classes is not None:
            self.label_emb = nn.Embedding(num_classes, time_embed_dim)

input_block，U_net的左边的那部分，把所有的Module放入ModuleList，ModuleList实例化参数一般是List，List每个元素是一个Module，第一个Module是TimestepEmbedSequential，此时List只有一个Module，后面会慢慢append进来，TimestepEmbedSequential里面是conv_nd，dims=1就是1d，dim=2就是2d，默认kernelsize是3，padding=1 （TimestepEmbedSequential见下面，其实就一个nn.Sequential的封装，其起到的作用就是选择是否传入emb，像conv_nd不用，只有ResBlock会传）

		self.input_blocks = nn.ModuleList(
            [
                TimestepEmbedSequential(
                    conv_nd(dims, in_channels, model_channels, 3, padding=1)
                )
            ]
        )

接下来对U-net左边进行搭建，首先是对channel_mult（通道乘子）进行遍历，就是乘以几倍几倍，通道乘子其实就是定义了有几层的U-net，一般是逐层扩大，输出通道数是mult * model_channels，乘子和当前model通道数的乘积，通道数正在扩大

每个乘子中有很多的Resblock，对Resblock进行遍历
ResBlock继承自TimestepEmbedSequential，那么就需要传入一个emb，这里传了time-embed-dim

 		for level, mult in enumerate(channel_mult):
            for _ in range(num_res_blocks):
                layers = [
                    ResBlock(
                        ch,
                        time_embed_dim,
                        dropout,
                        out_channels=mult * model_channels,
                        dims=dims,
                        use_checkpoint=use_checkpoint,
                        use_scale_shift_norm=use_scale_shift_norm,
                    )
                ]

我们还在num_res_blocks的遍历中，下面是创建attention-block，ds就是我们下采样的比例，如果ds在attention_resolutions的列表中，我们就插入一个AttentionBlock，所以具体在哪些地方插入取决与attention_resolutions

				ch = mult * model_channels
                if ds in attention_resolutions:
                    layers.append(
                        AttentionBlock(
                            ch, use_checkpoint=use_checkpoint, num_heads=num_heads
                        )
                    )
                self.input_blocks.append(TimestepEmbedSequential(*layers))
                input_block_chans.append(ch)

最后遍历完num_res_blocks，我们再跟一个下采样层

            if level != len(channel_mult) - 1:
                self.input_blocks.append(
                    TimestepEmbedSequential(Downsample(ch, conv_resample, dims=dims))
                )
                input_block_chans.append(ch)
                ds *= 2

U-net中间部分就是两个ResBlock和一个AttentionBlock构成
通道数和空间数都没有改变

		self.middle_block = TimestepEmbedSequential(
            ResBlock(
                ch,
                time_embed_dim,
                dropout,
                dims=dims,
                use_checkpoint=use_checkpoint,
                use_scale_shift_norm=use_scale_shift_norm,
            ),
            AttentionBlock(ch, use_checkpoint=use_checkpoint, num_heads=num_heads),
            ResBlock(
                ch,
                time_embed_dim,
                dropout,
                dims=dims,
                use_checkpoint=use_checkpoint,
                use_scale_shift_norm=use_scale_shift_norm,
            ),
        )

U-net右边，和左边镜像，由ResBlock和Attentionblock，和UpSample构成

注意的是ResBlock输入的通道数是ch + input_block_chans.pop(),因为U-net左右两边是连起来的，所以通道数应该是两者之和

		self.output_blocks = nn.ModuleList([])
        for level, mult in list(enumerate(channel_mult))[::-1]:
            for i in range(num_res_blocks + 1):
                layers = [
                    ResBlock(
                        ch + input_block_chans.pop(),
                        time_embed_dim,
                        dropout,
                        out_channels=model_channels * mult,
                        dims=dims,
                        use_checkpoint=use_checkpoint,
                        use_scale_shift_norm=use_scale_shift_norm,
                    )
                ]
                ch = model_channels * mult
                if ds in attention_resolutions:
                    layers.append(
                        AttentionBlock(
                            ch,
                            use_checkpoint=use_checkpoint,
                            num_heads=num_heads_upsample,
                        )
                    )
                if level and i == num_res_blocks:
                    layers.append(Upsample(ch, conv_resample, dims=dims))
                    ds //= 2
                self.output_blocks.append(TimestepEmbedSequential(*layers))

最后是输出模块，最后一个变换，得到卷积的输出

		self.out = nn.Sequential(
            normalization(ch),
            SiLU(),
            zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)),
        )

forward函数，输入x，timesteps，y，y是条件生成

timesteps经过timeset-embedding得到emb表示，这里用的正余弦timeEmbbing，总之，能对不同的timestep实现差异化表示即可

y还得表示成条件的emb

然后对input block的遍历，这个模块的输入是上个模块的输出,中间是middle-block,最后是output-block，之所以有个hs列表，是因为我们得保存input-block的输出，给out-block使用

    def forward(self, x, timesteps, y=None):
        """
        Apply the model to an input batch.

        :param x: an [N x C x ...] Tensor of inputs.
        :param timesteps: a 1-D batch of timesteps.
        :param y: an [N] Tensor of labels, if class-conditional.
        :return: an [N x C x ...] Tensor of outputs.
        """
        assert (y is not None) == (
            self.num_classes is not None
        ), "must specify y if and only if the model is class-conditional"

        hs = []
        emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))

        if self.num_classes is not None:
            assert y.shape == (x.shape[0],)
            emb = emb + self.label_emb(y)

        h = x.type(self.inner_dtype)
        for module in self.input_blocks:
            h = module(h, emb)
            hs.append(h)
        h = self.middle_block(h, emb)
        for module in self.output_blocks:
            cat_in = th.cat([h, hs.pop()], dim=1)
            h = module(cat_in, emb)
        h = h.type(x.dtype)
        return self.out(h)

conv_nd

只是对nn.Conv函数的一个封装

def conv_nd(dims, *args, **kwargs):
    """
    Create a 1D, 2D, or 3D convolution module.
    """
    if dims == 1:
        return nn.Conv1d(*args, **kwargs)
    elif dims == 2:
        return nn.Conv2d(*args, **kwargs)
    elif dims == 3:
        return nn.Conv3d(*args, **kwargs)
    raise ValueError(f"unsupported dimensions: {
      
      dims}")

TimestepEmbedSequential emb传入层

TimestepEmbedSequential作用就是对TimestepBlock的子类输入x之外，多传入一个emb

class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
    """
    A sequential module that passes timestep embeddings to the children that
    support it as an extra input.
    """

    def forward(self, x, emb):
        for layer in self:
            if isinstance(layer, TimestepBlock):
                x = layer(x, emb)
            else:
                x = layer(x)
        return x

那么有哪些TimestepBlock的子类呢，只有一个ResBlock类是继承了TimestepBlock，也就是说只有ResBlock才用传入emb，像在上采样和下采样就不用emb了

Downsample 下采样层

下采样层，直接调用了self.op，self.op有卷积的下采样，和直接平均池化的下采样，2d图像中stride=2（3d的stride=（1，2，2）），stride=2作用是对图像空间h,w=1/2h, 1/2w，长和宽减少一半

class Downsample(nn.Module):
    """
    A downsampling layer with an optional convolution.

    :param channels: channels in the inputs and outputs.
    :param use_conv: a bool determining if a convolution is applied.
    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
                 downsampling occurs in the inner-two dimensions.
    """

    def __init__(self, channels, use_conv, dims=2):
        super().__init__()
        self.channels = channels
        self.use_conv = use_conv
        self.dims = dims
        stride = 2 if dims != 3 else (1, 2, 2)
        if use_conv:
            self.op = conv_nd(dims, channels, channels, 3, stride=stride, padding=1)
        else:
            self.op = avg_pool_nd(stride)

    def forward(self, x):
        assert x.shape[1] == self.channels
        return self.op(x)

Upsample 上采样层

用临近插值interpolate扩大自己空间h,w两倍，如果要卷积，再做一个通道数不变的上卷积

class Upsample(nn.Module):
    """
    An upsampling layer with an optional convolution.

    :param channels: channels in the inputs and outputs.
    :param use_conv: a bool determining if a convolution is applied.
    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
                 upsampling occurs in the inner-two dimensions.
    """

    def __init__(self, channels, use_conv, dims=2):
        super().__init__()
        self.channels = channels
        self.use_conv = use_conv
        self.dims = dims
        if use_conv:
            self.conv = conv_nd(dims, channels, channels, 3, padding=1)

    def forward(self, x):
        assert x.shape[1] == self.channels
        if self.dims == 3:
            x = F.interpolate(
                x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest"
            )
        else:
            x = F.interpolate(x, scale_factor=2, mode="nearest")
        if self.use_conv:
            x = self.conv(x)
        return x

AttentionBlock 注意力机制层

直接看_forward，首先x变成3维的[batch-size,channel,-1]，将x归一化norm，再送入qkv，得到qkv三个量
将qkv reshape，变成 batch-size×num_head，-1（序列长度），qkv.shape[2]（特征维度）
将qkv送如QKVAttention类，得到h，h是经过注意力之后的结果，将h reshape，再经过投影层，加回x，所以这是一个带残差的attention注意力机制

class AttentionBlock(nn.Module):
    """
    An attention block that allows spatial positions to attend to each other.

    Originally ported from here, but adapted to the N-d case.
    https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
    """

    def __init__(self, channels, num_heads=1, use_checkpoint=False):
        super().__init__()
        self.channels = channels
        self.num_heads = num_heads
        self.use_checkpoint = use_checkpoint

        self.norm = normalization(channels)
        self.qkv = conv_nd(1, channels, channels * 3, 1)
        self.attention = QKVAttention()
        self.proj_out = zero_module(conv_nd(1, channels, channels, 1))

    def forward(self, x):
        return checkpoint(self._forward, (x,), self.parameters(), self.use_checkpoint)

    def _forward(self, x):
        b, c, *spatial = x.shape
        x = x.reshape(b, c, -1)
        qkv = self.qkv(self.norm(x))
        qkv = qkv.reshape(b * self.num_heads, -1, qkv.shape[2])
        h = self.attention(qkv)
        h = h.reshape(b, -1, h.shape[-1])
        h = self.proj_out(h)
        return (x + h).reshape(b, c, *spatial)

QKVAttention

这个就是标准的attention计算

class QKVAttention(nn.Module):
    """
    A module which performs QKV attention.
    """

    def forward(self, qkv):
        """
        Apply QKV attention.

        :param qkv: an [N x (C * 3) x T] tensor of Qs, Ks, and Vs.
        :return: an [N x C x T] tensor after attention.
        """
        ch = qkv.shape[1] // 3
        q, k, v = th.split(qkv, ch, dim=1)
        scale = 1 / math.sqrt(math.sqrt(ch))
        weight = th.einsum(
            "bct,bcs->bts", q * scale, k * scale
        )  # More stable with f16 than dividing afterwards
        weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)
        return th.einsum("bts,bcs->bct", weight, v)

    @staticmethod
    def count_flops(model, _x, y):
        """
        A counter for the `thop` package to count the operations in an
        attention operation.

        Meant to be used like:

            macs, params = thop.profile(
                model,
                inputs=(inputs, timestamps),
                custom_ops={QKVAttention: QKVAttention.count_flops},
            )

        """
        b, c, *spatial = y[0].shape
        num_spatial = int(np.prod(spatial))
        # We perform two matmuls with the same number of ops.
        # The first computes the weight matrix, the second computes
        # the combination of the value vectors.
        matmul_ops = 2 * b * (num_spatial ** 2) * c
        model.total_ops += th.DoubleTensor([matmul_ops])

ResBlock

有很多层，有in_layer层，emb-layer层，out-layers层，还有skip-connection层，如果通道数一致则直接连接起来就好，如果通道数目不一致，可以用一个大小不变的卷积或者1×1的卷积改变一下dim

        self.in_layers = nn.Sequential(
            normalization(channels),
            SiLU(),
            conv_nd(dims, channels, self.out_channels, 3, padding=1),
        )
        self.emb_layers = nn.Sequential(
            SiLU(),
            linear(
                emb_channels,
                2 * self.out_channels if use_scale_shift_norm else self.out_channels,
            ),
        )
        self.out_layers = nn.Sequential(
            normalization(self.out_channels),
            SiLU(),
            nn.Dropout(p=dropout),
            zero_module(
                conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)
            ),
        )

        if self.out_channels == channels:
            self.skip_connection = nn.Identity()
        elif use_conv:
            self.skip_connection = conv_nd(
                dims, channels, self.out_channels, 3, padding=1
            )
        else:
            self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)

forward函数，传入x和emb，x经过in_layers得到h，emb经过emb_layers得到emb-out，h + emb_out输入out_layers得到h，x再和h相加，所以大致就是x和h的一个残差连接

    def _forward(self, x, emb):
        h = self.in_layers(x)
        emb_out = self.emb_layers(emb).type(h.dtype)
        while len(emb_out.shape) < len(h.shape):
            emb_out = emb_out[..., None]
        if self.use_scale_shift_norm:
            out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
            scale, shift = th.chunk(emb_out, 2, dim=1)
            h = out_norm(h) * (1 + scale) + shift
            h = out_rest(h)
        else:
            h = h + emb_out
            h = self.out_layers(h)
        return self.skip_connection(x) + h

写在后面

其实实现的思路很简单，只是要把它写成模块就稍显的有些复杂了，这就是我们可以多多学习的地方，有时间仿照着这个写一下…

（pytorch进阶之路）Attention-based U-net实现