ChatGLM source code analysis main.py

This code block is a script for training the transformer sequence-to-sequence model. Here is an explanation of each line of code:

pythonCopy code

import transformers from transformers import ( AutoConfig, AutoModel, AutoTokenizer, DataCollatorForSeq2Seq, HfArgumentParser, Seq2SeqTrainingArguments, set_seed, ) from trainer_seq2seq import Seq2SeqTrainer

The code above imports the required libraries. transformersis a library for working with transformer models. AutoConfig, AutoModeland AutoTokenizerare classes for automatically obtaining configurations for specific models, models and tokenizers. DataCollatorForSeq2SeqIt is a tool for organizing multiple data samples into a batch for Seq2Seq model training or testing. HfArgumentParseris a parser for handling command-line arguments. Seq2SeqTrainingArgumentsis the class used to set training parameters. set_seedIt is used to set the random number seed to ensure the repeatability of the experiment.

pythonCopy code

from arguments import ModelArguments, DataTrainingArguments

This line of code imports classes that define model parameters and data training parameters.

pythonCopy code

logger = logging.getLogger(__name__)

This line of code creates a logger that logs important information during execution.

pythonCopy code

def main():

This line of code defines the main function.

pythonCopy code

parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))

This line of code creates a parser object that converts the arguments it receives from the command line into ModelArguments, DataTrainingArgumentsand Seq2SeqTrainingArgumentsobjects of these three types.

The following code snippets are used to parse command line parameters or read parameters from json files.

pythonCopy code

logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], )

This line of code sets the basic configuration of the log, including the format of the log information, the format of the date, and the log handler.

The following code snippets mainly set the log level and format.

pythonCopy code

set_seed(training_args.seed)

This line of code sets the random number seed.

The following code snippet is used to load the dataset.

pythonCopy code

config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True) config.pre_seq_len = model_args.pre_seq_len config.prefix_projection = model_args.prefix_projection tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)

These codes load the configuration and tokenizer of the pre-trained model, and make some additional settings for the configuration.

The following code snippet is used to load the model.

pythonCopy code

if model_args.quantization_bit is not None: print(f"Quantized to {model_args.quantization_bit} bit") model = model.quantize(model_args.quantization_bit) if model_args.pre_seq_len is not None: # P-tuning v2 model = model.half() model.transformer.prefix_encoder.float() else: # Finetune model = model.float()

This part of the code does some additional setup of the model, including quantization and setting data types.

pythonCopy code

prefix = data_args.source_prefix if data_args.source_prefix is not None else ""

This line of code sets the prefix of the data.

The following code snippets are used to preprocess the data.

pythonCopy code

else: logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") return

The last part of the code checks to see if there is any task that needs to be executed, if not, it prints the message and exits

pythonCopy code

if training_args.do_train: column_names = raw_datasets["train"].column_names elif training_args.do_eval: column_names = raw_datasets["validation"].column_names elif training_args.do_predict: column_names = raw_datasets["test"].column_names else: logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") return

This part of the code checks whether the training parameter contains the do_train, do_evalor do_predictflags that represent whether to train, evaluate, or predict, respectively. The code assigns the corresponding data set column name to column_names, and if there is no task to be executed, it outputs information and exits.

Since this code is truncated, I can't provide you with further explanation. If you have the complete code snippet, I can help you understand the subsequent parts.

def preprocess_function_eval(examples):
        inputs, targets = [], []
        for i in range(len(examples[prompt_column])):
            if examples[prompt_column][i] and examples[response_column][i]:
                query = examples[prompt_column][i]
                if history_column is None or len(examples[history_column][i]) == 0:
                    prompt = query
                else:
                    prompt = ""
                    history = examples[history_column][i]
                    for turn_idx, (old_query, response) in enumerate(history):
                        prompt += "[Round {}]\n问：{}\n答：{}\n".format(turn_idx, old_query, response)
                    prompt += "[Round {}]\n问：{}\n答：".format(len(history), query)
                inputs.append(prompt)
                targets.append(examples[response_column][i])

        inputs = [prefix + inp for inp in inputs]
        model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, truncation=True, padding=True)
        labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)

        if data_args.ignore_pad_token_for_loss:
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]
        model_inputs["labels"] = labels["input_ids"]

        return model_inputs

    def preprocess_function_train(examples):
        max_seq_length = data_args.max_source_length + data_args.max_target_length

        model_inputs = {
            "input_ids": [],
            "labels": [],
        }
        for i in range(len(examples[prompt_column])):
            if examples[prompt_column][i] and examples[response_column][i]:
                query, answer = examples[prompt_column][i], examples[response_column][i]

                if history_column is None:
                    prompt = query
                else:
                    prompt = ""
                    history = examples[history_column][i]
                    for turn_idx, (old_query, response) in enumerate(history):
                        prompt += "[Round {}]\n问：{}\n答：{}\n".format(turn_idx, old_query, response)
                    prompt += "[Round {}]\n问：{}\n答：".format(len(history), query)

                prompt = prefix + prompt
                a_ids = tokenizer.encode(text=prompt, add_special_tokens=False)
                b_ids = tokenizer.encode(text=answer, add_special_tokens=False)

                if len(a_ids) > data_args.max_source_length - 1:
                    a_ids = a_ids[: data_args.max_source_length - 1]

                if len(b_ids) > data_args.max_target_length - 2:
                    b_ids = b_ids[: data_args.max_target_length - 2]

                input_ids = tokenizer.build_inputs_with_special_tokens(a_ids, b_ids)

                context_length = input_ids.index(tokenizer.bos_token_id)
                mask_position = context_length - 1
                labels = [-100] * context_length + input_ids[mask_position+1:]
                
                pad_len = max_seq_length - len(input_ids)
                input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
                labels = labels + [tokenizer.pad_token_id] * pad_len
                if data_args.ignore_pad_token_for_loss:
                    labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]

                model_inputs["input_ids"].append(input_ids)
                model_inputs["labels"].append(labels)

        return model_inputs

This code defines two functions, preprocess_function_evaland preprocess_function_train, to preprocess the data for evaluation and training.

First, let's look at preprocess_function_evalthe function:

pythonCopy code

def preprocess_function_eval(examples): inputs, targets = [], [] for i in range(len(examples[prompt_column])): if examples[prompt_column][i] and examples[response_column][i]: ...

The function takes a data sample set as a parameter. inputsTwo lists, and , are defined in the function targetsto store the input and target data. Then traverse each sample in the sample set, and only when the questions and answers in the sample exist, it will be processed.

The following code generates a dialog prompt. If the history column is empty or has no history, the hint is the current query. Otherwise, the prompt will contain queries and responses from historical dialogue rounds, with the last round being the current query.

pythonCopy code

if history_column is None or len(examples[history_column][i]) == 0: prompt = query else: ... inputs.append(prompt) targets.append(examples[response_column][i])

After the prompt is generated, it is added to inputsthe list, and the corresponding answer is added to targetsthe list.

pythonCopy code

inputs = [prefix + inp for inp in inputs] model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, truncation=True, padding=True) labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)

This part of the code first adds a prefix to each input, and then uses the tokenizer to process the input, including word segmentation, truncation, and padding, to obtain the input data of the model. Similarly, the target data is processed to obtain labels.

pythonCopy code

if data_args.ignore_pad_token_for_loss: labels["input_ids"] = [ [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"] ] model_inputs["labels"] = labels["input_ids"] return model_inputs

This part of the code ignore_pad_token_for_lossdecides whether to ignore the filled token when calculating the loss according to the value of the parameter. If you choose to ignore, then replace all padding token ids in the label with -100. Finally, the processed labels are added to the input data of the model and returned.

Here's preprocess_function_traina function that does something preprocess_function_evalsimilar, but handles it a little differently:

pythonCopy code

def preprocess_function_train(examples):
        max_seq_length = data_args.max_source_length + data_args.max_target_length

        model_inputs = {
            "input_ids": [],
            "labels": [],
        }
        for i in range(len(examples[prompt_column])):
            if examples[prompt_column][i] and examples[response_column][i]:
                query, answer = examples[prompt_column][i], examples[response_column][i]

                if history_column is None:
                    prompt = query
                else:
                    prompt = ""
                    history = examples[history_column][i]
                    for turn_idx, (old_query, response) in enumerate(history):
                        prompt += "[Round {}]\n问：{}\n答：{}\n".format(turn_idx, old_query, response)
                    prompt += "[Round {}]\n问：{}\n答：".format(len(history), query)

                prompt = prefix + prompt
                a_ids = tokenizer.encode(text=prompt, add_special_tokens=False)
                b_ids = tokenizer.encode(text=answer, add_special_tokens=False)

                if len(a_ids) > data_args.max_source_length - 1:
                    a_ids = a_ids[: data_args.max_source_length - 1]

                if len(b_ids) > data_args.max_target_length - 2:
                    b_ids = b_ids[: data_args.max_target_length - 2]

                input_ids = tokenizer.build_inputs_with_special_tokens(a_ids, b_ids)

                context_length = input_ids.index(tokenizer.bos_token_id)
                mask_position = context_length - 1
                labels = [-100] * context_length + input_ids[mask_position+1:]
                
                pad_len = max_seq_length - len(input_ids)
                input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
                labels = labels + [tokenizer.pad_token_id] * pad_len
                if data_args.ignore_pad_token_for_loss:
                    labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]

                model_inputs["input_ids"].append(input_ids)
                model_inputs["labels"].append(labels)

        return model_inputs

First, the function defines the maximum sequence length , which is the sum of the maximum length of the source sequence and the maximum length of the target sequence. Then, create a dictionary to store the input data for the model.

The following code processes each sample, only when both the question and the answer in the sample exist.

pythonCopy code

        for i in range(len(examples[prompt_column])):
            if examples[prompt_column][i] and examples[response_column][i]:
                query, answer = examples[prompt_column][i], examples[response_column][i]
                ...

Prompts are generated in the same way as in functions preprocess_function_eval. Then tokenize the prompt and answer, and handle the case where the length exceeds the maximum limit.

pythonCopy code

a_ids = tokenizer.encode(text=prompt, add_special_tokens=False) b_ids = tokenizer.encode(text=answer, add_special_tokens=False) if len(a_ids) > data_args.max_source_length - 1: a_ids = a_ids[: data_args.max_source_length - 1] if len(b_ids) > data_args.max_target_length - 2: b_ids = b_ids[: data_args.max_target_length - 2]

Next, use the tokenizer build_inputs_with_special_tokensmethod to connect the processed question and the id of the answer token to form the input of the model.

pythonCopy code

input_ids = tokenizer.build_inputs_with_special_tokens(a_ids, b_ids)

Then, calculate the length of the context and generate a label, which is composed of -100 and the token id of the answer part.

pythonCopy code

context_length = input_ids.index(tokenizer.bos_token_id) mask_position = context_length - 1 labels = [-100] * context_length + input_ids[mask_position+1:]

The following code handles the case where the length of the input and label is less than the maximum sequence length, and padding is performed by adding padding tokens.

pythonCopy code

pad_len = max_seq_length - len(input_ids) input_ids = input_ids + [tokenizer.pad_token_id] * pad_len labels = labels + [tokenizer.pad_token_id] * pad_len

Then, according to ignore_pad_token_for_lossthe value of the parameter, it is decided whether to ignore the filled token when calculating the loss. If you choose to ignore, then replace all padding token ids in the label with -100.

pythonCopy code

if data_args.ignore_pad_token_for_loss: labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]

Finally, the processed input and labels are added to the model's input data and returned.

pythonCopy code

model_inputs["input_ids"].append(input_ids) model_inputs["labels"].append(labels) return model_inputs

Generally speaking, these two functions are used to process data, but they process data in slightly different ways.

    if training_args.do_train:
        if "train" not in raw_datasets:
            raise ValueError("--do_train requires a train dataset")
        train_dataset = raw_datasets["train"]
        if data_args.max_train_samples is not None:
            train_dataset = train_dataset.select(range(data_args.max_train_samples))
        train_dataset = train_dataset.map(
            preprocess_function_train,
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            remove_columns=column_names,
            load_from_cache_file=not data_args.overwrite_cache,
        )

    if training_args.do_eval:
        if "validation" not in raw_datasets:
            raise ValueError("--do_eval requires a validation dataset")
        eval_dataset = raw_datasets["validation"]
        if data_args.max_eval_samples is not None:
            eval_dataset = eval_dataset.select(range(data_args.max_eval_samples))
        eval_dataset = eval_dataset.map(
            preprocess_function_eval,
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            remove_columns=column_names,
            load_from_cache_file=not data_args.overwrite_cache,
        )

This code is mainly used for data preprocessing and data set selection . If the parameter do_trainis set to True, it will take the training data set raw_datasetsfrom it , picking data with the maximum number of training samples as needed. It then preprocess_function_trainprocesses the training dataset using the functions defined above. The number of parallel worker processes used for dataset processing preprocessing_num_workersis controlled by a parameter. After processing is complete, the original columns are removed from the dataset.

Likewise, if the parameter do_evalis set to True, this code will raw_datasetstake the validation data set from it, and select the data with the largest number of validation samples as needed. Then, use preprocess_function_evalfunctions for processing.

    # Initialize our Trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
    )

This code is used to initialize the trainer. Seq2SeqTraineris a tailor-made trainer for sequence-to-sequence models, which accepts model, training parameters, training dataset, evaluation dataset, tokenizer and data processor as parameters.

pythonCopy code

    # Training
    if training_args.do_train:
        train_result = trainer.train(
            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
        if trainer.is_world_process_zero():
            with open(output_train_file, "w") as writer:
                logger.info("***** Train results *****")
                for key, value in sorted(train_result.metrics.items()):
                    logger.info(f"  {key} = {value}")
                    writer.write(f"{key} = {value}\n")

        # Need to save the state, since Trainer.save_model saves only the tokenizer with the model
        trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))

This code is used for model training. If do_trainthe parameter is set to True, the trainer starts training, and after training, save the model and tokenizer . Then write the training result to a txt file and save the state of the trainer.

The above code snippets are mainly used to train the model, process the training and verification data, and save the training results. Hope these explanations can help you. If you have other questions, welcome to ask.

Let's explain it line by line:

pythonCopy code

for key, value in sorted(train_result.metrics.items()):

This line of code iterates train_result.metricsover every key-value pair in the dictionary . This dictionary may contain various metrics generated during the training process, such as training accuracy , loss value, etc. sorted()The function ensures that the results are sorted alphabetically by key.

pythonCopy code

logger.info(f" {key} = {value}")

This line of code uses logger.infoa function to output the name (key) and corresponding value of each metric to a log message . In this way, in the process of running the script, you can view the specific values of these indicators in the command line terminal or log file .

pythonCopy code

writer.write(f"{key} = {value}\n")

This line of code writes the name and corresponding value of each indicator to a file. The advantage of this is that after the training process is over, you can directly open this file to view the training results without re-running the entire script.

In general, the purpose of this code is to record the key indicator values during the training process , so as to facilitate subsequent analysis of the performance of the model.

ChatGLM source code analysis main.py

Guess you like