Tensor轴变换 axis 或 dim（transpose、permute、view、reshape、einsum）

操作分类：

重排维度：transpose、swapaxes、permute都是对维度进行重排序，但不改变维度的大小。
重组维度：view、reshape可以重组原始维度，修改维度大小。
万能运算：einsum 通过操作index(dim/axis)来匹配对应的矩阵运算。

dim 与 axis

Tensor的 dim维度 或 axis轴 变换是 Pytorch深度学习最重要的操作之一（在torch中叫dim多一些，在numpy中叫axis多一些），这些操作不改变内存中的物理存储，只会改变tensor的视图view，即以什么样的顺序或维度来看待这个tensor，越靠后的维度在内存上越相连，每个维度都有具体的物理含义。可以通过tensor.shape来查看一个张量的维度。

如加载图像数据后，[32, 3, 64,64]可以理解为[batch_size, channel, hight, weight]，如self-attention中[16, 8, 32, 128]可以理解为[batch_szie, heads, seq_len, head_dim]。

tensor的dim索引从下标0开始，如shape为[10, 3, 64, 64]的tensor，其dim的取值范围是0,1,2,3。

如下例子：

import torch
tensor = torch.randn(10, 3, 64, 64).to("cuda")
tensor.shape  # torch.Size([10, 3, 64, 64])

tensor[i]等价于tensor[i, :, :]，tensor[i]的shape为[3, 64, 64];
tensor[i, j]等价于tensor[i, j, :]，tensor[i, j]的shape为[64, 64].

transpose 重排维度

使用方法：torch.tanspose(tensor, dim1, dim2)，交换 tensor 的 dim1 和 dim2 这两个维度。

import torch
tensor = torch.randn(16, 8, 32, 128).to("cuda")
# torch.Size([16, 8, 32, 128])
trans = torch.transpose(tensor, 2, 3).contiguous()
# torch.Size([16, 8, 128, 32])

另外，swapaxes就是tanspose的别名！torch.swapaxes(tensor, dim1, dim2)，效果等于上面的tanspose。

permute 重排维度

使用方法：transpose和swapaxes只能交换两个维度dim，而permute可以对所有轴进行重排！torch.permute(dim1, dim2, dim3...)，dim_i是原始维度的索引，将其放到新的位置，就是交换旧维度到新索引位置。

import torch
tensor = torch.randn(16, 8, 32, 128).to("cuda")
# torch.Size([16, 8, 32, 128])
tensor = tensor.permute(0, 2, 1, 3)  # 交换1，2维度
# torch.Size([16, 32, 8, 128])

view 重组维度

使用方法：tensor.contiguous().view(dim0, dim1, dim2...) ，将tensor的shape变换为(dim0, dim1, dim2...)，dim的个数可以少于或多于原来tensor！，因为所有维度的累积 $\prod_{i=0}^N{dim_i}$ 是不变的，因此当有一个dim=-1时，将自动计算。

import torch
tensor = torch.randn(16, 8, 32, 16).to("cuda")
# torch.Size([16, 8, 32, 16])  (batch_size, heads, seq_len, head_dim)
tensor = tensor.contiguous().view(16, 32, -1)  # 合头heads
# torch.Size([16, 32, 128])  (batch_size, seq_len, dim)

contiguous：因为transpose和permute这些操作不改变内存中的物理存储，而torch要求 越靠后的维度在内存上越相连，所以按照新维度索引，tensor在内存中不再是连续存储的，但view操作要求tensor的内存连续存储，需要用tensor.contiguous() 将原始的tensor调整为一个内存连续的tensor。在pytorch 0.4中，增加了torch.reshape()操作，大致相当于 tensor.contiguous().view()，这样就省去了对tensor做view()变换前，调用contiguous()的麻烦；因此建议所有情况都无脑使用 reshape。

reshape 重组维度

使用方法：tonsor.reshape()同tensor.contiguous().view()，tensor.reshape(dim0, dim1, dim2...) ，将tensor的shape变换为(dim0, dim1, dim2...)，dim的个数可以少于或多于原来tensor！，因为所有维度的累积 $\prod_{i=0}^N{dim_i}$ 是不变的，因此当有一个dim=-1时，将自动计算。

import torch
tensor = torch.randn(16, 8, 32, 16).to("cuda")
# torch.Size([16, 8, 32, 16])  (batch_size, heads, seq_len, head_dim)
tensor = tensor.reshape(16, 32, -1)  # 合头heads
# torch.Size([16, 32, 128])  (batch_size, seq_len, dim)

einsum 万能运算

使用方法：爱因斯坦表达式通过操作index(dim/axis)来匹配对应的矩阵运算。和前面几个操作不同的是，torch.einsum不仅可以进行单个矩阵维度的重排、重组，还可以完成多个矩阵的矩阵加法、矩阵乘法、元素乘法等运算。

->左侧表示输入的矩阵shape，->右侧表示输出的矩阵shape。

permute 重排：单个输入矩阵，->左右维度数量不变，只改变顺序，如交换i和j维度，ij->ji。

import torch
tensor = torch.randn(16, 8, 32, 16).to("cuda")
# torch.Size([16, 8, 32, 16])
tensor = torch.einsum("bhsd->bhds", tensor)
# torch.Size([16, 8, 16, 32])

sum求和：单个输入矩阵，->右侧缺少哪些维度，就按照哪些维度求和，如按照j维度求和，ij->i。

import torch
tensor = torch.randn(16, 8, 32, 16).to("cuda")
# torch.Size([16, 8, 32, 16])
tensor = torch.einsum("bhsd->bh", tensor)
# torch.Size([16, 8])

matrix multi 矩阵乘法：->左边多个输入矩阵逗号分隔，->左边是单个矩阵，沿左边两者重复出现且右边消失的维度进行乘法，如沿k维度进行矩阵乘法，ij,jk->ik。

tensor1 = torch.randn(2, 3).to("cuda")
tensor2 = torch.randn(3, 5).to("cuda")
tensor = torch.einsum("ij, jk -> ik", tensor1, tensor2)
# (2,3) @ (3,5) = (2,5)

组合操作：先沿着j维度进行矩阵乘法，再沿着k维度进行求和：

tensor1 = torch.randn(2, 3).to("cuda")
tensor2 = torch.randn(3, 5).to("cuda")
tensor = torch.einsum("ij, jk -> i", tensor1, tensor2)
# (2,3) @ (3,5) = (2,5)

更加复杂的组合操作：模拟attention score，先自动进行转置，然后最后两个维度进行矩阵乘法，其中虽然都有seq_len，但因为output输出矩阵中不能出现两个相同的字母，所以不能都用s命名，因此使用i和j。

import torch
# key 和 value 都是[batch_size, heads, seq_len, head_dim]
query = torch.randn(16, 8, 32, 16).to("cuda")
key = torch.randn(16, 8, 32, 16).to("cuda")
attention_score = torch.einsum("bhid, bhjd -> bhij", query, key)  # bhid, bhjd -> bhid, bhdj -> bhij
# torch.Size([16, 8, 32, 32])

# 等价操作
attention_score = query @ key.transpose(-2, -1)
attention_score = torch.matmul(query, key.transpose(-2, -1))

element-wise multi 元素乘法：->左边多个相同shape的矩阵，->右边单个和做左边相同shape的矩阵。矩阵对应元素相乘，也叫hadamard product。

import torch
tensor1 = torch.randn(16, 8, 32, 16).to("cuda")
tensor2 = torch.randn(16, 8, 32, 16).to("cuda")

tensor = torch.einsum("bhsd,bhsd->bhsd", tensor1, tensor2)
# torch.Size([16, 8, 32, 16])

# 等价操作
tensor = tensor1 * tensor2

dot product 矩阵点积：->左边多个相同shape的矩阵，->是空的（求和sum）。即，先逐元素相乘，然后全部求和。

import torch
tensor1 = torch.randn(16, 8, 32, 16).to("cuda")
tensor2 = torch.randn(16, 8, 32, 16).to("cuda")

tensor = torch.einsum("bhsd,bhsd-> ", tensor1, tensor2)
# tensor是一个值

# 等价操作
tensor = sum(tensor1 * tensor2)