d2l学习_第六章_搭建网络

x.1 layer and block

网络模型由layer层和module块组成。多个layer形成层组，层组即为block块。layer 和 block都是由module的子类构成的，而net(x)实际上调用的是net.__call__(x)。

无论是自定义块还是自定义层都需要定义__init__和__forward__方法。其中nn.Sequential方法用于叠加多个层，可以参考nn.ModuleList和nn.Sequential有什么区别，例子https://blog.csdn.net/qq_43369406/article/details/129998217，其中torch.nn.Module.add_module()方法用于添加子类，而torch.nn.Module.children()方法用于查看该类中的所有子类。

class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module) # layer_name be not the same, like parameters str.

    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

需要注意的是，如果你不使用nn模块下的方法例如nn.Parameter()或者nn.Linear()定义层的话，它的required_grad属性默认是False，即**权重不是一个模型参数，不会被反向传播更新。即在求偏微分时候，他并不是作为变量出现，而永远被视作链式求导法则中的一个常量。**可以参考torch.tensor和torch.nn.Parameterhttps://blog.csdn.net/qq_43369406/article/details/131234557

        self.rand_weight = torch.rand((20, 20))	# required_grad == False
        self.linear = nn.LazyLinear(20)			# True

你甚至可以不用例如nn.Linear预制的层而使用自定义层如下，参考torch.tensor和torch.nn.Parameterhttps://blog.csdn.net/qq_43369406/article/details/131234557

class MyLinear(nn.Module):
    def __init__(self, in_units, units):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(in_units, units))
        self.bias = nn.Parameter(torch.randn(units,))

    def forward(self, X):
        linear = torch.matmul(X, self.weight.data) + self.bias.data
        return F.relu(linear)

x.2 parameters

我们首先需要知道parameters模型参数往往分为weight权重和bias偏置，而模型参数的类型都为torch.nn.parameter.Parameter类型。

我们可以通过net[0].weight访问parameter中的weight param（param往往是个复合的对象，包含值，梯度和额外信息），通过net[0].weight.data访问值，通过net[0].weight.grad访问梯度（但在没有经过反向传播loss.backward()时，梯度矩阵往往是None)。

我们也可以通过net[0].state_dict()以键值对的形式来访问param；通过net.named_parameters()生成param的生成器（包含值，梯度等等信息），通过next(net.named_parameters())来访问键值对，需要注意的是torch.save(model.state_dict())不会保存梯度信息。该函数仅保存模型的参数（权重和偏置），而不包括梯度信息。如果要保存模型的梯度信息，可以使用完整的模型保存方法，如torch.save(model)。例子如下：

net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4)) # 2 means the number of sample. Linear() just chage the last dimension's data.
Y = net(X)

# type
print(type(net.state_dict()))
print(type(next(net.named_parameters())))
print(type(net.named_parameters()))

# value
print(net.state_dict())
print([(name, param.shape) for name, param in net.named_parameters()])
print(net.state_dict()['2.bias'].data)

'''
output is ::

<class 'collections.OrderedDict'>
<class 'tuple'>
<class 'generator'>
OrderedDict([('0.weight', tensor([[ 0.0387,  0.0717,  0.0398,  0.3735],
        [ 0.1486, -0.3150, -0.3813,  0.3046],
        [ 0.3214, -0.0327,  0.3165, -0.3807],
        [-0.4867, -0.4954,  0.3559,  0.1104],
        [ 0.0462, -0.2147,  0.3825,  0.1375],
        [ 0.3976, -0.2493,  0.1906, -0.4750],
        [ 0.1445,  0.0632, -0.1869, -0.0446],
        [-0.2179,  0.1317, -0.4357,  0.1538]])), ('0.bias', tensor([ 0.4576, -0.0437, -0.0475,  0.0576, -0.1058,  0.4215, -0.1188,  0.1157])), ('2.weight', tensor([[ 0.2158,  0.0116,  0.0566,  0.3138, -0.0825, -0.1470, -0.1105, -0.2869]])), ('2.bias', tensor([-0.2899]))])
[('0.weight', torch.Size([8, 4])), ('0.bias', torch.Size([8])), ('2.weight', torch.Size([1, 8])), ('2.bias', torch.Size([1]))]
tensor([-0.2899])
'''

我们需要知道的是，param是个tuple，是复合的对象，包含值，梯度和额外信息，对于param的值，不论是weight还是bias，都是由类似下面的tuple组成的(参数名称, tensor())，参数名称唯一，这和层的名称一样也是唯一的，nn.Module.add_module(层名称, 对象)。也是由于tuple中的tensor的存在，这也是后面为啥能查看模型在哪个设备的原因，也是通过获得Tensor.device方法获取的，

[(参数名称, tensor())]
[('weight', tensor([[ 0.2336, -0.2090,  0.0561,  0.1469,  0.1116, -0.0786,  0.1960,  0.0033]])), ('bias', tensor([-0.3357]))]

x.2 init parameters

关于权重初始化可以参考pytorch权重初始化/参数初始化https://blog.csdn.net/qq_43369406/article/details/131342983

x.2.1 内置方法初始化

模型初始化是指模型在训练前对参数进行的赋值，需要搭配model.apply(func)和类似nn.init.normal_init中的方法来实现。如果你对整个网络apply将会递归其中的children都使用相同方法初始化（注意children指继承自nn.Module的子类），如果对单个层apply将只有单层初始化。下面是系统内置的初始化权重方法：

"""

6.3.1. Built-in Initialization
using built-in func to init.

- `nn.init.normal_(module.weight, mean=0, std=0.01)`
- `nn.init.zeros_(module.bias)`
- `nn.init.constant_(module.weight, 1)`
- `nn.init.zeros_(module.bias)`
- `nn.init.xavier_uniform_(module.weight)`
- `nn.init.kaiming_uniform_(module.weight)` # default one for Linear, and the type is Leaky_ReLU
- `nn.init.uniform_(module.weight, -10, 10)`
"""
def init_normal(module):
    if type(module) == nn.Linear:
        nn.init.normal_(module.weight, mean=0, std=0.01)
        nn.init.zeros_(module.bias)

net.apply(init_normal)
print(net[0].weight.data[0]) 
print(net[0].bias.data[0])


def init_constant(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 1)
        nn.init.zeros_(module.bias)

net.apply(init_constant)
print(net[0].weight.data[0]); print(net[0].bias.data[0])

def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)

def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)

net[0].apply(init_xavier); net[2].apply(init_42)
print(net[0].weight.data[0]); print(net[2].weight.data)

x.2.2 自定义方法初始化

def _weights_init(m):
    """
    intro:
        weights init.
        finish these:
            - torch.nn.Linear
    >>> version 1.0.0
    if type(m) == nn.Linear:
        print("Init", *[(name, param.shape) for name, param in m.named_parameters()][0])    # linear - param - weight
        nn.init.trunc_normal_(m.weight, std=.01)
        if m.bias is not None:
            print("Init", *[(name, param.shape) for name, param in m.named_parameters()][1])    # linear - param - bias
            nn.init.zeros_(m.bias)
    
    args:
        :param torch.parameters m: nn.Module
    """
    classname = m.__class__.__name__

    if type(m) == nn.Linear:
        print("Init", *[(name, param.shape) for name, param in m.named_parameters()][0])    # linear - param - weight
        nn.init.trunc_normal_(m.weight, std=.01)
        if m.bias is not None:
            print("Init", *[(name, param.shape) for name, param in m.named_parameters()][1])    # linear - param - bias
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode="fan_out")
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.LayerNorm):
        nn.init.zeros_(m.bias)
        nn.init.ones_(m.weight)
    elif classname.startswith('Conv'):
        m.weight.data.normal_(0.0, 0.02)
    elif classname.find('BatchNorm') != -1:
        m.weight.data.normal_(1.0, 0.02)
        m.bias.data.fill_(0)
         
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net.apply(_weights_init)

x.3 Lazy init

所有带Lazy的层如nn.LazyLinear都是lazy init延后初始化，需要传入一个Tensor后才确定模型。

x.4 与外设进行交互/加载与存储

当你的model网络模型和张量数据tensor在CPU上时，其实是CPU和内存配合工作；而当你的model网络模型和张量数据tensor在GPU上时，其实是对应的GPU和对应的GPU的显存配合工作。但是内存断电后即停止，如果存储到外存上进行永久存储呢，Pytorch将这两个方法封装在了torch.save和torch.load中。

x.4.1 与张量交互

x = torch.arange(4)
torch.save(x, 'x-file')

x2 = torch.load('x-file')

x.4.2 与模型交互

需要搭配torch.nn.state_dict()方法和torch.nn.load_state_dict()方法。且由于我们只存储模型的参数，并不存储模型的样子，所有我们先要实例化模型后，再将参数载入。

torch.save(net.state_dict(), 'mlp.params')

net_clone = MLP()
net_clone.load_state_dict(torch.load('mlp.params'))

载入时可以指定载入到哪个设备，例如

checkpoint = torch.load(load_path, map_location="cpu")

x.5 设备

设备分为GPU和CPU，tensor和model的指定设备，转移设备，查看在哪个设备的方法如下，也可参考设备指定_指定GPU设备常用操作/cuda设备指定/查看模型在哪https://blog.csdn.net/qq_43369406/article/details/129816988

# tensor
X = torch.zeros((2, 3), device=torch.device("cuda:1"))
X = X.to(device=torch.device("cuda:0"))
print(X.device)

# model
net = nn.Sequential(nn.LazyLinear(1))
net = net.to(device=torch.device("cuda:0"))
print(net[0].weight.data.device)

查看有哪些设备

print(torch.device("cpu"))
print(torch.device("cuda:1"))
print(torch.cuda.is_available())
print(torch.cuda.device_count())