[Megatron-DeepSpeed] Detailed Explanation of Tensor Parallel Tool Code mpu (3): Implementation and Testing of Tensor Parallel Layer

Related Blog
[Megatron-DeepSpeed] Tensor Parallel Tool Code mpu Detailed Explanation (4): Implementation and Testing of Tensor Parallel Version Embedding Layer and Cross Entropy [Megatron-DeepSpeed] Tensor
Parallel Tool Code Mpu Detailed Explanation (3): Tensor Parallel Layer implementation and testing
[Megatron-DeepSpeed] Tensor parallel tool code mpu detailed explanation (1): Parallel environment initialization
[Megatron-DeepSpeed] Tensor parallel tool code mpu detailed explanation (2): Collective communication operation encapsulation mappings
[Deep learning] [Distributed training] DeepSpeed: AllReduce and ZeRO-DP
[Deep learning] Mixed precision training and memory analysis
[Deep learning] [Distributed training] Collective communication operation and Pytorch example
[Natural language processing] [Large model] BLOOM, a large language model Reasoning tool test
[Natural Language Processing] [Large Model] GLM-130B: an open source bilingual pre-trained language model
[Natural Language Processing] [Large Model] Introduction to 8-bit matrix multiplication for large Transformers

Megatron-DeepSpeed: Implementation and Testing of Tensor Parallelism

​ Megatron-DeepSpeed ​​is the DeepSpeed ​​version of NVIDIA Megatron-LM. Mainstream large models such as BLOOM and GLM-130B are developed based on Megatron-DeepSpeed. Here we take the BLOOM version of Megetron-DeepSpeed ​​as an example to introduce the details of its model parallel code mpu (located under megatron/mpu).

​Understanding this part of the code requires a certain understanding of the principles of model parallelism and collective communication. You can read the article:

​Strongly recommend reading , otherwise it will affect the understanding of this article:

Reading suggestions:

  1. This article will only analyze the core code, and will not introduce all the code;
  2. This article will provide some test scripts to demonstrate the functionality of each part of the code;
  3. Practical hands-on exercises are recommended to deepen understanding;
  4. It is recommended to have a certain understanding of Collective communication and distributed model training before reading this article;

1. Overview

​ The core files in the mpu directory are:

  • initialize.py: Responsible for the initialization of data parallel groups, tensor parallel groups and pipeline parallel groups, and obtaining information related to various parallel groups;
  • data.py: Realize the data broadcast function in tensor parallelism;
  • cross_entropy.py: Tensor parallel version of cross entropy;
  • layers.py : Parallel version of the Embedding layer, as well as a column-parallel linear layer and a row-parallel linear layer;
  • mappings.py: for tensor parallel communication operations;

2. 1D Tensor Parallel Principle

​ The parallelism in Megatron-DeepSpeed ​​is 1D tensor parallelism. Here is a brief introduction of the principle. If you want to have a more in-depth and comprehensive understanding of parallel technology, you can read the above article "Hundred Billion Model Training Technology".
Please add a picture description

​ Take the full link layer Y = XAY=XAY=Take X A as an example to introduce 1D tensor parallelism. Among them,XXXYYY is the input and output vector,AAA is the weight matrix. In total, 1D tensor parallelism can be divided into column parallelism and row parallelism (named after the division of the weight matrix). The above figure shows two kinds of parallelism.

  • column parallel

    Dividing the rows and columns of the matrix into n parts (not necessarily equal in size) can be expressed as A = [ A 1 , A 2 , … , A n ] A=[A_1,A_2, \dots, A_n]A=[A1,A2,,An] , then matrix multiplication is expressed as
    XA = X [ A 1 , A 2 , … , A n ] = [ XA 1 , XA 2 , … , XA n ] XA=X[A_1,A_2,\dots,A_n]=[ XA_1,XA_2,\dots,XA_n]X A=X[A1,A2,,An]=[ X A1,X A2,,X An]
    Obviously,only the weights need to be divided.

  • row parallel

    To divide the weights, the input matrix must also be divided. Assuming that AAA level is divided intonnn copies, then enter the matrixXXX must be divided vertically intonnn copies, matrix multiplication is expressed as
    XA = [ X 1 , X 2 , … , X n ] [ A 1 A 2 … A n ] = X 1 A 1 + X 2 A 2 + ⋯ + X n A n XA=[ X_1,X_2,\dots,X_n] \left[ \begin{array}{l} A_1 \\ A_2 \\ \dots \\ A_n \end{array} \right] = X_1A_1+X_2A_2+\dots+X_nA_nX A=[X1,X2,,Xn] A1A2An =X1A1+X2A2++XnAn

3. Implementation and testing of tensor parallelism

1. Column Parallel

​ When column parallelism is forward propagating, the processes in the tensor parallel group can forward independently. Assuming that the tensor parallelism is 2, the forward propagation of the neural network can be simply expressed as:
loss = f ( Y ) = f ( [ Y 1 , Y 2 ] ) = f ( [ XA 1 , XA 2 ] ) \begin {aligned} \text{loss}&=f(Y) = f([Y_1,Y_2]) \\ &=f([XA_1,XA_2]) \\ \end{aligned}loss=f(Y)=f([Y1,Y2])=f ([ X A1,X A2])
When backpropagating, loss \text{loss}loss pair inputXXThe gradient of X is
KaTeX parse error: Undefined control sequence: \part at position 8: \frac{\̲p̲a̲r̲t̲ ̲f}{\part X}=\fr... Therefore, each independent tensor parallel
groupneeds to be Gradients are summed.

source code

class ColumnParallelLinear(torch.nn.Module):
    """
    列并行线性层.
    线性层定义为Y=XA+b. A沿着第二维进行并行,A = [A_1, ..., A_p]
    
    参数:
    	input_size: 矩阵A的第一维度.
    	output_size: 矩阵A的第二维度.
    	bias: 若为true则添加bias.
    	gather_output: 若为true,在输出上调用all-gather,使得Y对所有GPT都可访问.
    	init_method: 随机初始化方法.
    	stride: strided线性层.
    """

    def __init__(self, input_size, output_size, bias=True, gather_output=True,
                 init_method=init.xavier_normal_, stride=1,
                 keep_master_weight_for_test=False,
                 skip_bias_add=False):
        super(ColumnParallelLinear, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.gather_output = gather_output
        # 获得张量并行组的world_size
        world_size = get_tensor_model_parallel_world_size()
        # 按照张量并行度(world_size)划分输出维度
        self.output_size_per_partition = divide(output_size, world_size)
        self.skip_bias_add = skip_bias_add

        # Parameters.
        # Note: torch.nn.functional.linear 执行 XA^T+b
        args = get_args()
        if args.use_cpu_initialization:
            # 初始化张量. 若完整权重矩阵A为n*m,张量并行度为k,这里初始化的张量为n*(m/k)
            # 也就是张量并行组中的进程各自初始化持有的部分张量
            self.weight = Parameter(torch.empty(self.output_size_per_partition,
                                                self.input_size,
                                                dtype=args.params_dtype))
            # 使用init_method对权重矩阵self.weight进行随机初始化(CPU版)
            # self.master_weight在测试中使用,这里不需要关注
            self.master_weight = _initialize_affine_weight_cpu(
                self.weight, self.output_size, self.input_size,
                self.output_size_per_partition, 0, init_method,
                stride=stride, return_master_weight=keep_master_weight_for_test)
        else:
            self.weight = Parameter(torch.empty(
                self.output_size_per_partition, self.input_size,
                device=torch.cuda.current_device(), dtype=args.params_dtype))
            # 使用init_method对权重矩阵self.weight进行随机初始化(GPU版)
            _initialize_affine_weight_gpu(self.weight, init_method,
                                          partition_dim=0, stride=stride)
            
        if bias:
            # 实例化一个bias
            if args.use_cpu_initialization:
                self.bias = Parameter(torch.empty(
                    self.output_size_per_partition, dtype=args.params_dtype))
            else:
                self.bias = Parameter(torch.empty(
                    self.output_size_per_partition,
                    device=torch.cuda.current_device(),
                    dtype=args.params_dtype))
            # 将张量并行的相关信息追加至self.bias
            set_tensor_model_parallel_attributes(self.bias, True, 0, stride)
            # bias初始化为0
            with torch.no_grad():
                self.bias.zero_()
        else:
            self.register_parameter('bias', None)

    def forward(self, input_):
        # 前向传播时input_parallel就等于input_
        # 反向传播时在张量并在组内将梯度allreduce
        input_parallel = copy_to_tensor_model_parallel_region(input_)
        bias = self.bias if not self.skip_bias_add else None
        output_parallel = F.linear(input_parallel, self.weight, bias)
        if self.gather_output:
            # 收集张量并行组内的张量并进行拼接
            # 此时,output是非张量并行情况下前向传播的输出
            # 张量并行组中的进程都持有完全相同的output
            output = gather_from_tensor_model_parallel_region(output_parallel)
        else:
            # 此时,output是张量并行情况下的前向传播输出
            # 张量并行组中的进程持有不同的output
            output = output_parallel
        output_bias = self.bias if self.skip_bias_add else None
        return output, output_bias

test code

​ The test follows the article [Megatron-DeepSpeed] Tensor Parallel Tool Code mpu Detailed Explanation (1): The settings in the parallel environment initialization, the tensor parallelism is 2, and the pipeline parallelism is 2.

def test_column_parallel_linear():
    global_rank = torch.distributed.get_rank()
    tensor_model_parallel_size = mpu.get_tensor_model_parallel_world_size()
    # 设置随机数种子
    seed = 12345
    set_random_seed(seed)
    # 张量并行组中,各个进程持有张量的input_size
    input_size_coeff = 4 #
    # 张量并行组中,各个进程持有张量的output_size
    input_size = input_size_coeff * tensor_model_parallel_size
    output_size_coeff = 2
    output_size = output_size_coeff * tensor_model_parallel_size
    # 初始化一个产生二维张量的模拟网络,输入的张量为(batch_size, input_size)
    batch_size = 6
    identity_layer = IdentityLayer2D(batch_size, input_size).cuda()
    # 初始化一个列并行线性层
    linear_layer = mpu.ColumnParallelLinear(
        input_size, output_size, keep_master_weight_for_test=True, gather_output=False).cuda()
    # 随机初始化一个loss权重
    # 主要是为了计算标量的loss,从而验证梯度是否正确
    loss_weight = torch.randn([batch_size, output_size]).cuda()
    ## 前向传播
    input_ = identity_layer()
    # 此时,张量并行组中各个进程持有的output仅是完整输出张量的一部分
    output = linear_layer(input_)[0]

    if torch.distributed.get_rank() == 0:
        print(f"> Output size without tensor parallel is ({
      
      batch_size},{
      
      output_size})")
    torch.distributed.barrier()
    info = f"*"*20 + \
            f"\n> global_rank={
      
      global_rank}\n" + \
            f"> output size={
      
      output.size()}\n"
    print(info, end="")

Test Results

insert image description here

​It can be seen that the expected output without parallelism is (6,4). When the tensor parallelism is 2, the output dimension of each rank is (6,2) .

2. Row Parallel

​ When row parallelism is forward propagating, each process in the tensor parallel group not only holds part of the weight, but also holds part of the input tensor. The process of forward propagation can be simply expressed as
loss = f ( Y ) = f ( XA ) = f ( [ X 1 , X 2 ] [ A 1 A 2 ] ) = f ( [ X 1 A 1 + X 2 A 2 ] ) \begin{aligned} \text{loss}&=f(Y) =f(XA)\\ &= f([X_1,X_2]\left[ \begin{array}{l} A_1 \\ A_2 \ \ \end{array} \right]) \\ &=f([X_1A_1+X_2A_2]) \\ \end{aligned}loss=f(Y)=f ( X A )=f([X1,X2][A1A2])=f([X1A1+X2A2])
Rank0 in the tensor parallel group holds X 1 X_1X1A 1 A_1A1, Rank1 holds X 2 X_2X2Sum A 2 A_2A2, and merge them after completing the forward pass on their respective GPUs.

​ The process of back propagation
KaTeX parse error: Undefined control sequence: \part at position 8: \frac{\̲p̲a̲r̲t̲ ̲f}{\part X_1} =…

source code

class RowParallelLinear(torch.nn.Module):
    """
    行并行线性层.
    线性层的定义为Y = XA + b. x
    A沿着第一个维度并行,X沿着第二个维度并行. 即
               -   -
              | A_1 |
              | .   |
          A = | .   |        X = [X_1, ..., X_p]
              | .   |
              | A_p |
               -   -
    参数:
        input_size: 矩阵A的第一维度.
        output_size: 矩阵A的第二维度.
        bias: 若为true则添加bias.
        input_is_parallel:  若为true,则认为输入应用被划分至各个GPU上,不需要进一步的划分.
    	init_method: 随机初始化方法.
    	stride: strided线性层.
    """

    def __init__(self, input_size, output_size, bias=True,
                 input_is_parallel=False,
                 init_method=init.xavier_normal_, stride=1,
                 keep_master_weight_for_test=False,
                 skip_bias_add=False):
        super(RowParallelLinear, self).__init__()

        self.input_size = input_size
        self.output_size = output_size
        self.input_is_parallel = input_is_parallel
        # 获得张量并行组的world_size
        world_size = get_tensor_model_parallel_world_size()
        # 按照张量并行度(world_size)划分输出维度
        self.input_size_per_partition = divide(input_size, world_size)
        self.skip_bias_add = skip_bias_add

        # Parameters.
        # Note: torch.nn.functional.linear 执行 XA^T+b
        args = get_args()
        if args.use_cpu_initialization:
            # 初始化张量. 若完整权重矩阵A为n*m,张量并行度为k,这里初始化的张量为n*(m/k)
            # 也就是张量并行组中的进程各自初始化持有的部分张量
            self.weight = Parameter(torch.empty(self.output_size,
                                                self.input_size_per_partition,
                                                dtype=args.params_dtype))
            # 使用init_method对权重矩阵self.weight进行随机初始化(CPU版)
            # self.master_weight在测试中使用,这里不需要关注
            self.master_weight = _initialize_affine_weight_cpu(
                self.weight, self.output_size, self.input_size,
                self.input_size_per_partition, 1, init_method,
                stride=stride, return_master_weight=keep_master_weight_for_test)
        else:
            self.weight = Parameter(torch.empty(
                self.output_size, self.input_size_per_partition,
                device=torch.cuda.current_device(), dtype=args.params_dtype))
            # 使用init_method对权重矩阵self.weight进行随机初始化(GPU版)
            _initialize_affine_weight_gpu(self.weight, init_method,
                                          partition_dim=1, stride=stride)
        if bias:
            # 实例化一个bias
            if args.use_cpu_initialization:
                self.bias = Parameter(torch.empty(self.output_size,
                                                  dtype=args.params_dtype))
            else:
                self.bias = Parameter(torch.empty(
                    self.output_size, device=torch.cuda.current_device(),
                    dtype=args.params_dtype))
            # Always initialize bias to zero.
            with torch.no_grad():
                self.bias.zero_()
        else:
            self.register_parameter('bias', None)

        self.bias_tp_auto_sync = args.sync_tp_duplicated_parameters

    def forward(self, input_):
        if self.input_is_parallel:
            input_parallel = input_
        else:
            # 前向传播时,将input_分片至张量并行组中的各个进程中
            # 反向传播时,将张量并行组中持有的部分input_梯度合并为完整的梯度
            # 此时,_input是完整的输入张量,input_parallel则是分片后的张量,即input_parallel!=_input
            input_parallel = scatter_to_tensor_model_parallel_region(input_)
        
        output_parallel = F.linear(input_parallel, self.weight)
        # 对张量并行组中的输出进行allreduce,即操作X1A1+X2A2
        output_ = reduce_from_tensor_model_parallel_region(output_parallel)

        if self.bias_tp_auto_sync:
            torch.distributed.all_reduce(self.bias, op=torch.distributed.ReduceOp.AVG, group=mpu.get_tensor_model_parallel_group())

        if not self.skip_bias_add:
            output = output_ + self.bias if self.bias is not None else output_
            output_bias = None
        else:
            output = output_
            output_bias = self.bias
        return output, output_bias

test code

Since the column parallel layer RowParallelLinearcompletely shields the internal parallel details, it is impossible to understand its execution process from the input and output. Therefore, the tests here forwardoverride their methods to reveal the details.

class MyRowParallelLinear(mpu.RowParallelLinear):
    def forward(self, input_):
        global_rank = torch.distributed.get_rank()
        # 输入X,权重A和输出Y的形状
        X_size = list(input_.size())
        A_size = [self.input_size, self.output_size]
        Y_size = [X_size[0], A_size[1]]
        if self.input_is_parallel:
            input_parallel = input_
        else:
            input_parallel = mpu.scatter_to_tensor_model_parallel_region(input_)
        Xi_size = list(input_parallel.size())
        Ai_size = list(self.weight.T.size())

        info = f"*"*20 + \
                f"\n> global_rank={
      
      global_rank}\n" + \
                f"> size of X={
      
      X_size}\n" + \
                f"> size of A={
      
      A_size}\n" + \
                f"> size of Y={
      
      Y_size}\n" + \
                f"> size of Xi={
      
      Xi_size}\n" + \
                f"> size of Ai={
      
      Ai_size}\n"

        output_parallel = F.linear(input_parallel, self.weight)
        # 通过在output_parallel保证不同rank的output_parallel,便于观察后续的结果
        output_parallel = output_parallel + global_rank
        Yi_size = list(output_parallel.size())
        info += f"> size of Yi={
      
      Yi_size}\n" + \
                f"> Yi={
      
      output_parallel}\n"
        output_ = mpu.reduce_from_tensor_model_parallel_region(output_parallel)
        info += f"> Y={
      
      output_}"

        if self.bias_tp_auto_sync:
            torch.distributed.all_reduce(self.bias, op=torch.distributed.ReduceOp.AVG, group=mpu.get_tensor_model_parallel_group())

        if not self.skip_bias_add:
            output = output_ + self.bias if self.bias is not None else output_
            output_bias = None
        else:
            output = output_
            output_bias = self.bias
        print(info)
        return output, output_bias
    
def test_row_parallel_linear():
    global_rank = torch.distributed.get_rank()
    tensor_model_parallel_size = mpu.get_tensor_model_parallel_world_size()
    # 设置随机种子
    seed = 12345
    set_random_seed(seed)
     # 张量并行组中,各个进程持有张量的input_size
    input_size_coeff = 4
    input_size = input_size_coeff * tensor_model_parallel_size
     # 张量并行组中,各个进程持有张量的output_size
    output_size_coeff = 2
    output_size = output_size_coeff * tensor_model_parallel_size
    # 初始化一个产生二维张量的模拟网络,输入的张量为(batch_size, input_size)
    batch_size = 6
    identity_layer = IdentityLayer2D(batch_size, input_size).cuda()
    # 初始化一个行并行线性层
    linear_layer = MyRowParallelLinear(
        input_size, output_size, keep_master_weight_for_test=True).cuda()

    # 前向传播
    input_ = identity_layer()
    output = linear_layer(input_)

Test Results

insert image description here

4. Complete test code

# test_layers.py
import sys
sys.path.append("..")

import os
import torch.nn.functional as F
from megatron import get_args
from megatron.mpu import layers
from megatron.initialize import _initialize_distributed
from megatron.global_vars import set_global_variables
from commons import set_random_seed
from commons import print_separator
from commons import initialize_distributed
import megatron.mpu as mpu
import torch.nn.init as init
from torch.nn.parameter import Parameter
import torch
import random

class IdentityLayer2D(torch.nn.Module):
    """
    模拟一个输入为二维张量的神经网络
    """
    def __init__(self, m, n):
        super(IdentityLayer2D, self).__init__()
        self.weight = Parameter(torch.Tensor(m, n))
        torch.nn.init.xavier_normal_(self.weight)

    def forward(self):
        return self.weight
    
def test_column_parallel_linear():
    global_rank = torch.distributed.get_rank()
    tensor_model_parallel_size = mpu.get_tensor_model_parallel_world_size()
    # 设置随机数种子
    seed = 12345
    set_random_seed(seed)
    # 张量并行组中,各个进程持有张量的input_size
    input_size_coeff = 4 #
    # 张量并行组中,各个进程持有张量的output_size
    input_size = input_size_coeff * tensor_model_parallel_size
    output_size_coeff = 2
    output_size = output_size_coeff * tensor_model_parallel_size
    # 初始化一个产生二维张量的模拟网络,输入的张量为(batch_size, input_size)
    batch_size = 6
    identity_layer = IdentityLayer2D(batch_size, input_size).cuda()
    # 初始化一个列并行线性层
    linear_layer = mpu.ColumnParallelLinear(
        input_size, output_size, keep_master_weight_for_test=True, gather_output=False).cuda()
    # 随机初始化一个loss权重
    # 主要是为了计算标量的loss,从而验证梯度是否正确
    loss_weight = torch.randn([batch_size, output_size]).cuda()
    ## 前向传播
    input_ = identity_layer()
    # 此时,张量并行组中各个进程持有的output仅是完整输出张量的一部分
    output = linear_layer(input_)[0]

    if torch.distributed.get_rank() == 0:
        print(f"> Output size without tensor parallel is ({
      
      batch_size},{
      
      output_size})")
    torch.distributed.barrier()
    info = f"*"*20 + \
            f"\n> global_rank={
      
      global_rank}\n" + \
            f"> output size={
      
      output.size()}\n"
    print(info, end="")
    
class MyRowParallelLinear(mpu.RowParallelLinear):
    def forward(self, input_):
        global_rank = torch.distributed.get_rank()
        # 输入X,权重A和输出Y的形状
        X_size = list(input_.size())
        A_size = [self.input_size, self.output_size]
        Y_size = [X_size[0], A_size[1]]
        if self.input_is_parallel:
            input_parallel = input_
        else:
            input_parallel = mpu.scatter_to_tensor_model_parallel_region(input_)
        Xi_size = list(input_parallel.size())
        Ai_size = list(self.weight.T.size())

        info = f"*"*20 + \
                f"\n> global_rank={
      
      global_rank}\n" + \
                f"> size of X={
      
      X_size}\n" + \
                f"> size of A={
      
      A_size}\n" + \
                f"> size of Y={
      
      Y_size}\n" + \
                f"> size of Xi={
      
      Xi_size}\n" + \
                f"> size of Ai={
      
      Ai_size}\n"

        output_parallel = F.linear(input_parallel, self.weight)
        # 通过在output_parallel保证不同rank的output_parallel,便于观察后续的结果
        output_parallel = output_parallel + global_rank
        Yi_size = list(output_parallel.size())
        info += f"> size of Yi={
      
      Yi_size}\n" + \
                f"> Yi={
      
      output_parallel}\n"
        output_ = mpu.reduce_from_tensor_model_parallel_region(output_parallel)
        info += f"> Y={
      
      output_}"

        if self.bias_tp_auto_sync:
            torch.distributed.all_reduce(self.bias, op=torch.distributed.ReduceOp.AVG, group=mpu.get_tensor_model_parallel_group())

        if not self.skip_bias_add:
            output = output_ + self.bias if self.bias is not None else output_
            output_bias = None
        else:
            output = output_
            output_bias = self.bias
        print(info)
        return output, output_bias
    
def test_row_parallel_linear():
    global_rank = torch.distributed.get_rank()
    tensor_model_parallel_size = mpu.get_tensor_model_parallel_world_size()
    # 设置随机种子
    seed = 12345
    set_random_seed(seed)
     # 张量并行组中,各个进程持有张量的input_size
    input_size_coeff = 4
    input_size = input_size_coeff * tensor_model_parallel_size
     # 张量并行组中,各个进程持有张量的output_size
    output_size_coeff = 2
    output_size = output_size_coeff * tensor_model_parallel_size
    # 初始化一个产生二维张量的模拟网络,输入的张量为(batch_size, input_size)
    batch_size = 6
    identity_layer = IdentityLayer2D(batch_size, input_size).cuda()
    # 初始化一个行并行线性层
    linear_layer = MyRowParallelLinear(
        input_size, output_size, keep_master_weight_for_test=True).cuda()

    # 前向传播
    input_ = identity_layer()
    output = linear_layer(input_)

def main():
    set_global_variables(ignore_unknown_args=True)
    _initialize_distributed()
    world_size = torch.distributed.get_world_size()

    print_separator('Test test_column_parallel_linear')
    test_column_parallel_linear()

    print_separator('Test test_row_parallel_linear')
    test_row_parallel_linear()


if __name__ == '__main__':
    main()

startup script

# 除了tensor-model-parallel-size和pipeline-model-parallel-size以外,
# 其余参数仅为了兼容原始代码,保存没有报错.
options=" \
        --tensor-model-parallel-size 2 \
        --pipeline-model-parallel-size 2 \
        --num-layers 10 \
        --hidden-size 768 \
        --micro-batch-size 2 \
        --num-attention-heads 32 \
        --seq-length 512 \
        --max-position-embeddings 512\
        --use_cpu_initialization True
        "

cmd="deepspeed test_layers.py $@ ${options}"

eval ${cmd}

Guess you like

Origin blog.csdn.net/bqw18744018044/article/details/132135532