A detailed understanding of the GPT2 model structure and its training process—GPT series training and deployment

        This article is an original article of the blogger, and may not be reproduced without the permission of the blogger.

This article is a series of articles         in the column " Python AIGC large model training and reasoning from scratch ", the address is "https://blog.csdn.net/suiyingy/article/details/130169592".

        For setting up and debugging the GPT2 model environment, please refer to the blog posts "GPT Series Training and Deployment—GPT2 Environment Configuration and Model Training" and "ColossalAI GPT2 Distributed Training and Debugging Configuration—GPT Series Training and Deployment", the addresses are "https://blog. csdn.net/suiyingy/article/details/128711444" and "https://blog.csdn.net/suiyingy/article/details/128806531". This section mainly introduces the structure principle of the GPT2 model and its training and debugging process in detail.

        In addition, for specific updates of this column, you can follow the official account below the article, or follow this column. All related articles will be updated in "Python AIGC Large Model Training and Reasoning from Scratch", the address is "https://blog.csdn.net/suiyingy/article/details/130169592". The experience effect of all AIGC-like model deployment will be launched synchronously in the RdFast applet .

1 Transformer main structure

        The Transformer model is a neural network model for NLP, proposed by Google researchers in 2017 and published in the paper "Attention is All you need". The core of the model is the self-attention mechanism (self-attention), which can avoid the gradient disappearance problem in RNN when processing long sequences, and can also better capture the relationship between elements in the sequence. The Transformer model is a revolutionary neural network architecture that has achieved remarkable results in the field of natural language processing and has become an important foundation in this field.

        The overall structure of the Transformer model is shown in the figure below, which belongs to a typical encoding-decoding structure, with the encoding structure on the left and the decoding structure on the right.

Figure 1 Transformer model structure

1.1 Self-attention mechanism (self-attention)

        The self-attention mechanism means that the model will pay attention to the correlation between its own data when performing feature extraction. Attention means focusing on the weight of the data. Assuming that there is a set of sequences composed of A, B, C, and D, then the size of A affected by A, B, C, and D is equivalent to the size of attention, that is, A, B, C, and D are noticed when A features degree of size.

        Transformer proposes a QKV (Query, Key, Value) encoding model in order to measure this attention value. Q, K, and V are equivalent to the representation of the original data in the new space. Assume that the input sequence feature dimension is (N_seq, C_embedding), N_seq is the length of the input sequence, and C_embedding is the feature dimension of each element in the sequence. After the fully connected layer Linear(C_embedding, C_qkv), the output feature dimension is (N_seq, C_qkv) . Transformer uses three sets of such fully connected operations to convert the original sequence into Q, K, V, and the dimensions are (N_seq, C_qkv).

 Figure 2 Schematic diagram of QKV process

        This can be understood as using Q, K, and V spaces to represent the original data. Q is equivalent to defining a query index for the original data, which can be compared to the element position in the original sequence. (K, V) represents the value of the data. The advantage of using this Key-Value method to represent the data is that the value is not affected when calculating the attention size, that is, the value is decoupled from the attention size.

        The size of the attention is represented by the correlation between Q and K, and the greater the correlation, the greater the attention weight. As shown in the figure below, the correlation is calculated by matrix multiplication (vector multiplication), that is, Q x KT, then the output dimension after multiplication is (N_seq, N_seq). Each row represents the correlation of the current element with all full elements. In order to maintain the distribution characteristics of the output data, the multiplied data needs to be divided by sqrt(dk), and dk is the above C_qkv. Obviously, dividing by sqrt(dk) does not change the relative size of the attention, but adjusts the overall magnitude.

 Figure 3 Attention calculation

        We hope that the sum of attention weights is 1, and the value of each weight is equivalent to the proportion. Then the output multiplied data needs to go through Softmax to complete this weight conversion. Assuming that the input sequence length is 3 (x1, x2, x3), then the output after Softmax is exp(x1)/(exp(x1)+exp(x2)+exp(x3)), exp(x2)/( exp(x1)+exp(x2)+exp(x3)) and exp(x1)/(exp(x3)+exp(x2)+exp(x3)).

Figure 4 Softmax

        The attention weight is multiplied by V, that is, the weighted summation, and the output result can be obtained, and the dimension is (N, C_qkv). This converts the input data dimension from (N_seq, C_embedding) to (N, C_qkv), which is equivalent to using a new sequence to represent the original sequence, and the new sequence takes into account the correlation between elements.

1.2 Self-attention mechanism and convolutional neural network (CNN)

        Generally, when a convolutional neural network is fully connected for feature extraction, it is equivalent to obtaining V. When the network model training is completed, all parameters of the fully connected layer are fixed. Any input will be transformed with the same fully connected parameters. The attention mechanism is to weight V through QK. The attention weight obtained by QK is different under different input conditions, which is equivalent to the change of the parameters of the fully connected layer with different inputs. Therefore, after using the self-attention mechanism, the input feature extraction methods are more diverse, and more abundant and effective features can be extracted. Of course, the number of parameters will increase accordingly.

        In the field of computer vision, researchers have also designed many convolution operations affected by input, such as PAConv and Deformable Conv.

1.3 Multi-head Attention

        The multi-head attention mechanism is equivalent to using the self-attention mechanism multiple times to extract original features and perform splicing and fusion. We can also compare the above self-attention process to a convolution process, and the multi-head attention mechanism is equivalent to changing the number of output channels. Assuming that the output dimension of a single self-attention mechanism module is (N, C_qkv), and the number of attention mechanism modules is MA (that is, the number of Heads, Multi-head), then the output dimension is MA x N x C_qkv. The dimension of the output features after splicing is N x (MA x C_qkv), which is equivalent to a feature fusion. In the convolutional neural network, for the spliced ​​features, a layer of convolution or full connection is usually further used to fuse the spliced ​​features as a whole. Transformer also uses this structure, the Feed-Forward network. In addition, Transformer also uses the residual structure many times to fuse input features.

        In order to keep attention features consistent with input features, MA x C_qkv should be consistent with C_embedding. Assuming that C_embedding is 512 and C_qkv is 64, then the number of long positions MA should be 8. Assuming that C_embedding is 768 and C_qkv is 64, then the number MA of long positions should be 12.

1.4 Multi-layer stacking

        If the output dimension of the attention module is consistent with the input dimension, then the module can be stacked in multiple layers, that is, the output of the previous module (N_seq, C_embedding) is used as the input of the next module (N_seq, C_embedding), as in the GPT2 model structure to be introduced below . The common layers of GPT2 structure are as follows.

Table 1 GPT2 layers and parameters

Parameter layer number vector length

117M 12 768

345M 24 1024

762M 36 1280

1542M 48 1600

2 GPT2 model structure

        The GPT series models only use the encoding structure of the transformer, and are implemented in a multi-layer stacking manner. The figure below is a schematic diagram of the 12-layer GPT2 structure. Each layer is an attention mechanism module.

Figure 5 12-layer GPT2

3 GPT2 training procedure

        The GPT2 program introduced next comes from the Colossal-AI framework, and the address is "https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt". Its environment construction and debugging methods are described in detail in previous articles of this column.

3.1 Input data and word segmentation

3.1.1 Input data

        The data used in the test is OpenWebText. For the download and preprocessing process, see "GPT Series Training and Deployment—GPT2 Environment Configuration and Model Training", the address is "https://blog.csdn.net/suiyingy/article/details/128711444 ". After processing, the original data is saved in train.json, and its data structure is {'text': text, 'url': unique_url}, as shown below.

{"text": "The space station looks like an airplane or a very bright star moving across the sky, except it doesn't have flashing lights or change direction. It will also be moving considerably faster than a typical airplane (airplanes generally fly at about 600 miles per hour; the space station flies at 17,500 miles per hour).\n\nBelow is a time-lapse photo of the space station moving across the sky.\n\nThe International Space Station is seen in this 30 second exposure as it flies over Elkton, VA early in the morning, Saturday, August 1, 2015. Photo Credit: NASA/Bill Ingalls\n\nVisit the NASA Johnson Flickr Photostream", "url": "http://spotthestation.nasa.gov/sightings/view.cfm?country=United_States®ion=Arizona&city=Phoenix#.UvPTWWSwLpM"}

        The input data of the model is the text in text. Here, the maximum word segmentation length allowed by the text is set to 1024, that is, the sequence length N_seq input by the GPT2 model is 1024.

3.1.2 Word segmentation

        Word segmentation in natural language processing refers to the process of segmenting a text into words. Word segmentation is an important basic task in the field of natural language processing. Its main purpose is to segment continuous natural language text into lexical sequences with certain semantic meanings for further semantic analysis.

        In languages ​​such as Chinese, words are not separated by spaces as in English, but need to be identified and extracted through word segmentation. Word segmentation is very important for machine translation, information retrieval, sentiment analysis and other fields, because a single character or letter cannot express a complete meaning, and only by combining them into appropriate words can we better understand and process natural language texts.

        During the word segmentation operation, we will have a dictionary, which records all the words and their serial numbers. For example, the dictionary here is GPT2Tokenizer with a total of 50257 words. Among them, "<|endoftext|>" is its last word and its unk token. The index number of the last word is 50256. "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')" loads the entire dictionary. The dictionary for querying serial numbers based on words is tokenizer.encoder, such as tokenizer.encoder['bot']; the dictionary for querying words based on serial numbers is tokenizer.decoder, such as tokenizer.decoder[13645]. It should be noted that this dictionary does not contain Chinese. Therefore, word segmentation dictionaries used by different natural language processing tasks may also be different.

        Word segmentation is ultimately to convert a piece of text into a sequence of corresponding index numbers, that is, input_ids in the program. Since the maximum length N_seq of the input sequence is set to 1024, when the actual length of the sequence after the text segmentation is greater than 1024, the first 1024 segmentations will be retained and the remaining segmentations will be deleted. When the actual length of the sequence after the text segmentation is less than 1024, it needs to be filled to 1024 segmentations. In the program, <|endoftext|> is used for filling, so when the length of input_ids is less than 1024, it is filled with 50256.

        Another part of the word segmentation output in the program is attention_mask, which is mainly used to identify valid word segmentation. The position that belongs to the valid word segmentation takes the value 1, and the position that is filled and completed takes the value 0.

        The key program of the GPT2 word segmentation process is shown below, and the program is located in ColossalAI-Examples/labguage/gpt/dataset/webtext.py. Users can set breakpoints at relevant positions of the file for program debugging.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')#50257个词
tokenizer.pad_token = tokenizer.unk_token#'<|endoftext|>'
encoded_data = tokenizer(raw_data, padding=True, truncation=True, max_length=seq_len, return_tensors='pt')
self.data = encoded_data['input_ids']#长度大于1024时将保留前1024个分词,并删除剩余分词;长度不足1024时用50256进行填充。
self.attention_mask = encoded_data['attention_mask']#属于有效分词的位置取值为1,填充补齐的位置取值为0。
torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path)#第一次运行时将结果保存,后续可直接加载。

3.2 Main structure

        The Batch Size of training has been set to 1 during the introduction of the GPT2 model training program below. The input is the word segmentation sequence input_id and its attention_mask with a length of 1024, and the dimensions are both 1x1024. 1 indicates the Batch Size. The program is located in "titans/layer/block/gpt_block.py", you can set a breakpoint at the corresponding location for debugging

#计算输入有效程度
torch.where(input_ids!=50256)[0].size()
#计算attention_mask长度,与输入有效长度相等
attention_mask.sum()
x = self.embed(input_ids)
#attention_mask取值为1的地方转换为0,取值为0的地方转为-10000。较大的负数在通过Softmax求解自注意力权重时取值接近于0。
if attention_mask is not None:
    batch_size = input_ids.shape[0]
    attention_mask = attention_mask.view(batch_size, -1)
    attention_mask = col_nn.partition_batch(attention_mask)
    attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
    attention_mask = attention_mask.to(dtype=x.dtype)    # fp16 compatibility
    attention_mask = (1.0 - attention_mask) * -10000.0
for block in self.blocks:#最小配置 12层,隐藏变量维度为768
x, attention_mask = block(x, attention_mask)
x = self.head(self.norm(x))#1x1024x50304

3.2.1  Word vector

        The above data processing has converted the input text into a sequence of length 1024. The word vector further uses a feature vector to represent each word, and the length of the feature vector is the length of the hidden variable (HIDDEN_SIZE). The entry of the word vector processing function is "x = self.embed(input_ids)", the input sequence dimension is BATCH_SIZE x SEQ_LEN (1x1024), and the output sequence dimension is BATCH_SIZE x SEQ_LEN x HIDDEN_SIZE.

        There are many methods for word vector processing, and some research will be dedicated to this direction. The easiest way is to prepare a dictionary of word vectors. The number of words in the dictionary in the program is 50304, and each word is represented by a 768 (HIDDEN_SIZE) dimensional vector. This dictionary can be obtained by random generation, and each word vector has been normalized, the mean value after normalization is 0, and the variance is 1. Finally, according to the index numbers of the 1024 word segments, the dictionary query is performed to obtain the word vectors of the corresponding numbers.

        Since there is a sequence in the sequence, its position code can also be represented by a serial number, and then a set of position vectors can also be used to represent this sequence. Since the maximum sequence length has been set to 1024, the vector dictionary dimension of the position index is 1024x768. The total word vector is obtained by adding the word vector and the position vector, and the dimension is 1024x768.

        The key program of word vector processing is as follows, the program is located in "titans/layer/embedding/gpt_embedding.py", you can set a breakpoint at the corresponding position for debugging, and the output dimension is 1x1024x768.

seq_length = input_ids.size(1)
if position_ids is None:
    bs = input_ids.size(0)
    position_ids = torch.arange(seq_length, dtype=torch.long, device=get_current_device()).unsqueeze(0)
    position_ids = position_ids.repeat(bs, 1)
# the size of input_ids is (BATCH_SIZE, SEQ_LEN)
# the size of x after word_embeddings is (BATCH_SIZE, SEQ_LEN, HIDDEN_SIZE)
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)#词向量与位置向量,1x1024x768
if self.tokentype_embeddings is not None and tokentype_ids is not None:
    x = x + self.tokentype_embeddings(tokentype_ids)
x = self.dropout(x)#增加轻微扰动
return x

3.2.2 Attention module

        According to the previous description, the GPT2 program uses 12 layers of the same attention module. Each module consists of two residual modules, namely the self-attention module and the Feed Forward module, and the input is first normalized. The mean after normalization is 0 and the variance is 1. The final output of the 12-layer attention module is 1x1024x768.

        (a) Self-attention module

        The first residual module is shown in the figure below, and its feature extraction uses a multi-head self-attention mechanism. Among them, the number of bulls is 12, and the QKV feature dimension is 64, so the final output feature dimension is still 768 after splicing, that is, 12x64.

Figure 6 The first residual module

        The corresponding program is as follows, the program is located in "titans/layer/block/gpt_block.py", you can set a breakpoint at the corresponding position for debugging, and the output dimension is 1x1024x768. The place where the attention_mask takes a value of 1 is converted to 0, and the place where the value is 0 is converted to -10000. A large negative number is close to 0 when solving the self-attention weight through Softmax.

if not self.apply_post_layernorm:
    residual = x
x = self.norm1(x)#dropout操作之后数据不再是归一化的,需重新归一化,特征维度上均值为0,方差为1
residual = x #1x024x768
x = residual + self.attn(x, attention_mask)#1x1024x768

        The key part of the program is the self-attention feature extraction, namely self.attn(x, attention_mask). The specific program is located in "titans/layer/attention/gpt_attention.py", you can set a breakpoint at the corresponding position for debugging, and the output dimension is 1x1024x768. The main steps are as follows:

        1) Calculate QKV, the input dimension is 1x1024x768. QKV is transformed in 768 dimensions, and each needs to be converted into 64 dimensions, with a total of 64*3 outputs. Since the number of longs is 12, the total dimension of the converted features is 64*3*12, which is 2304. Therefore, the 1x1024x2304 dimension output is calculated through the fully connected layer VanillaLinear (768, 2304). After dimension transformation, the program extracts Q, K, and V components respectively, and the dimensions are all 1x12x1024x64.

        2) Calculate the self-attention weight score matrix. Q is multiplied by K and divided by sqrt(64) to get a 1x12x1024x1024 preliminary weight matrix. The program respectively sets the weight after the current word as a negative number with a larger absolute value, because the prediction result of the next word is only related to the previous content. On the other hand, adding the weight matrix to the attention_mask also sets the invalid position weights to the negative of the larger absolute value. After the Softmax operation, the sum of the attention weights of each word is 1, and the corresponding position weight of the previous negative number with a large absolute value is close to 0. The output dimensions are 1x12x1024x1024.

        3) Weighted summation. V is still 1x12x1024x64 after weighted summation.

        4) Multi-head feature splicing. The 1x12x1024x64 dimensional features are concatenated to obtain 1x1024x768 dimensional new features.

        5) Feature splicing and fusion. After the 1x1024x768 dimensional features are spliced, they are fused again through the fully connected layer VanillaLinear (768, 768), and dropout disturbance is added. The final output dimension is 1x1024x768.

        The key program analysis of self-attention feature extraction is as follows.

# the size of x is (BATCH_SIZE, SEQ_LEN, HIDDEN_SIZE)
# the size of qkv is (BATCH_SIZE, SEQ_LEN, HIDDEN_SIZE*3)
qkv = self.query_key_value(x)#计算QKV,1x1024x2304
all_head_size = qkv.shape[-1] // 3#多头隐藏变量维度,768
num_attention_heads = divide(all_head_size, self.attention_head_size)#计算头的数量,每个头的基本隐藏变量数量为64,那么768/64=12
new_qkv_shape = qkv.shape[:-1] + (num_attention_heads, 3 * self.attention_head_size)#1x1024x12x192,q、k、v的特征维度各为64
qkv = qkv.view(new_qkv_shape)#1x1024x12x192
qkv = qkv.permute((0, 2, 1, 3))#1x12x1024x192
# the size of q is (BATCH_SZIE, NUM_HEADS, SEQ_LEN, HIDDEN_SIZE//NUM_HEADS)
q, k, v = torch.chunk(qkv, 3, dim=-1)#q、k、v分量,1x12x1024x64, 1x12x1024x64, 1x12x1024x64
# the size of x after matmul is (BATCH_SIZE, NUM_HEADS, SEQ_LEN, SEQ_LEN)
x = torch.matmul(q, k.transpose(-1, -2))#自注意力机制权重得分矩阵,1x12x1024x1024
x = x / math.sqrt(self.attention_head_size)# x / sqrt(64),1x12x1024x1024
q_len, k_len = q.size(-2), k.size(-2)#1024, 1024
causal_mask = torch.tril(torch.ones((q_len, k_len), dtype=torch.uint8, device=get_current_device())).view(1, 1, q_len, k_len).bool()#下三角矩阵,当前词仅与之前词相关,每一行代表一组权重,1x1x1024x1024
x = torch.where(causal_mask, x, torch.tensor(-1e4, dtype=x.dtype, device=get_current_device()))#True为x,False为-1e4,1x12x1024x1024
x = x + attention_mask#无效处全部置为较大负值,从而在softmax操作后权重得分基本为0,1x12x1024x1024。
x = self.softmax(x)#转换为权重概率得分,1x12x1024x1024
x = self.attention_dropout(x)#对数据增加扰动,1x12x1024x1024
# the size of x after matmul is (BATCH_SZIE, NUM_HEADS, SEQ_LEN, HIDDEN_SIZE//NUM_HEADS)
x = torch.matmul(x, v)#对v进行加权求和,1x12x1024x64
x = x.transpose(1, 2)#1x1024x12x64
new_context_layer_shape = x.size()[:-2] + (all_head_size,)#1x1024x768
# the size of x after reshape is (BATCH_SZIE, SEQ_LEN, HIDDEN_SIZE)
x = x.reshape(new_context_layer_shape)#多头拼接,1x1024x768
# the size of x after dense is (BATCH_SZIE, SEQ_LEN, HIDDEN_SIZE)
x = self.dense(x)#Linear(768, 768),相当于再次进行一次特征融合,1x1024x768
x = self.dropout(x)#对数据增加扰动,1x1024x768
return x#1x1024x768

(b) Feed ForWard feedforward module

        The second residual module is shown in the figure below.

 Figure 7 The second residual module

        The corresponding program is as follows, the program is located in "titans/layer/block/gpt_block.py", you can set a breakpoint at the corresponding position for debugging, and the output dimension is 1x1024x768. The feedforward process is mainly completed by two layers of fully connected layers VanillaLinear(768, 3072) and VanillaLinear(3072, 768), and the final output dimension is still 1x1024x768.

residual = x
x = self.norm2(x)#dropout操作之后数据不再是归一化的,需重新归一化,特征维度上均值为0,方差为1
x = residual + self.mlp(x)#VanillaLinear(768, 3072)、VanillaLinear(3072, 768)、dropout,再次特征融合

        The final outputs of the attention modules are all 1x1024x768.

2)Head

        The role of the Head is to map the features back to the space corresponding to the solution to the problem. The function entry is "x = self.head(self.norm(x))". The feature dimension extracted by the program after the attention module is 1x1024x768. We need to calculate which participle it belongs to based on each 768-dimensional feature, that is, to classify the feature. The word vector dictionary has a total of 50304 categories.

        GPT2 Head consists of a LayerNorm normalization layer and a fully connected layer (768, 50304), with an output dimension of 1x1024x50304.

3.3 Loss function

        GPT2 model training adopts an unsupervised method, using the current sequence to predict the next word segment one by one. Therefore, its label is also the serial number of the input word segmentation index. The feature extracted from the Nth word segmentation is associated with 1~N word segmentation, and the feature is used to predict the index number of the N+1th word segmentation. According to the input word segmentation sequence, the sequence number of the N+1th word segmentation is known.

        The corresponding program is as follows, the program is located in "tians/loss/lm_loss/gpt_lmloss.py", you can set a breakpoint at the corresponding location for debugging, and the output dimension is 1x1024x768. The loss function is the cross loss quotient function (CrossEntropyLoss), which is also the most commonly used classification loss function. The prediction input feature dimension of the loss function is 1023x50304, and the label dimension is 1023.

class GPTLMLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.loss = col_nn.CrossEntropyLoss()
    def forward(self, logits, labels):
        shift_logits = logits[..., :-1, :].contiguous()#预测结果
        shift_labels = labels[..., 1:].contiguous()#下一个取值为预测值对应的标签
        # Flatten the tokens
# shift_logits.view(-1, shift_logits.size(-1)).shape,1023x50304
# shift_labels.view(-1).shape,1023
        return self.loss(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

4 Training commands and results

        For setting up and debugging the GPT2 model environment, please refer to the blog posts "GPT Series Training and Deployment—GPT2 Environment Configuration and Model Training" and "ColossalAI GPT2 Distributed Training and Debugging Configuration—GPT Series Training and Deployment", the addresses are "https://blog. csdn.net/suiyingy/article/details/128711444" and "https://blog.csdn.net/suiyingy/article/details/128806531".

        The GPT training command of ColosssalAI-Examples is "colossalai run --nproc_per_node=2 train_gpt.py --config=gpt2_configs/gpt2_vanilla.py --from_torch". The running result is shown in the figure below.

 Figure 8 Schematic diagram of training results

5 Model saving and loading

5.1 Model saving

        In the ColossalAI-Examples/language/gpt/train_gpt.py file, uncomment the hooks.SaveCheckpointHook(checkpoint_dir='./ckpt') in the hook_list to save the training model. By default, the training model will be saved in ckpt in the current directory. If running in debugging mode, the training model may be saved in the user's home directory, which can be searched with the command "find ~ -name ckpt".

5.2 Model Training Loading

        In the ColossalAI-Examples/language/gpt/train_gpt.py file, there is no interface to load the pre-trained model by default. We need to write an entry program ourselves, as shown below. The new program is added after line 114 "logger.info('Build optimizer', ranks=[0])", which is line 115.

if os.path.exists('ckpt'):
    logger.info('Loading pretrained model from ckpt', ranks=[0])
    from collections import OrderedDict
    new_state_dict = OrderedDict()
    state_dict = torch.load('ckpt')
    for k, v in state_dict['model'].items():
        name = k[6:]  # remove `module.`
        new_state_dict[name] = v
    model.load_state_dict(new_state_dict)

        This article is an original article of the blogger, and may not be reproduced without the permission of the blogger.

This article is a series of articles         in the column " Python AIGC large model training and reasoning from scratch ", the address is "https://blog.csdn.net/suiyingy/article/details/130169592".

        In addition, for specific updates of this column, you can follow the official account below the article, or follow this column. All related articles will be updated in "Python AIGC Large Model Training and Reasoning from Scratch", the address is "https://blog.csdn.net/suiyingy/article/details/130169592". The experience effect of all AIGC-like model deployment will be launched synchronously in the RdFast applet .

Guess you like

Origin blog.csdn.net/suiyingy/article/details/130937792