This code block is a script for training the transformer sequence-to-sequence model. Here is an explanation of each line of code:
pythonCopy code
import transformers from transformers import ( AutoConfig, AutoModel, AutoTokenizer, DataCollatorForSeq2Seq, HfArgumentParser, Seq2SeqTrainingArguments, set_seed, ) from trainer_seq2seq import Seq2SeqTrainer
The code above imports the required libraries. transformers
is a library for working with transformer models. AutoConfig
, AutoModel
and AutoTokenizer
are classes for automatically obtaining configurations for specific models, models and tokenizers. DataCollatorForSeq2Seq
It is a tool for organizing multiple data samples into a batch for Seq2Seq model training or testing. HfArgumentParser
is a parser for handling command-line arguments. Seq2SeqTrainingArguments
is the class used to set training parameters. set_seed
It is used to set the random number seed to ensure the repeatability of the experiment.
pythonCopy code
from arguments import ModelArguments, DataTrainingArguments
This line of code imports classes that define model parameters and data training parameters.
pythonCopy code
logger = logging.getLogger(__name__)
This line of code creates a logger that logs important information during execution.
pythonCopy code
def main():
This line of code defines the main function.
pythonCopy code
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
This line of code creates a parser object that converts the arguments it receives from the command line into ModelArguments
, DataTrainingArguments
and Seq2SeqTrainingArguments
objects of these three types.
The following code snippets are used to parse command line parameters or read parameters from json files.
pythonCopy code
logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], )
This line of code sets the basic configuration of the log, including the format of the log information, the format of the date, and the log handler.
The following code snippets mainly set the log level and format.
pythonCopy code
set_seed(training_args.seed)
This line of code sets the random number seed.
The following code snippet is used to load the dataset.
pythonCopy code
config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True) config.pre_seq_len = model_args.pre_seq_len config.prefix_projection = model_args.prefix_projection tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
These codes load the configuration and tokenizer of the pre-trained model, and make some additional settings for the configuration.
The following code snippet is used to load the model.
pythonCopy code
if model_args.quantization_bit is not None: print(f"Quantized to {model_args.quantization_bit} bit") model = model.quantize(model_args.quantization_bit) if model_args.pre_seq_len is not None: # P-tuning v2 model = model.half() model.transformer.prefix_encoder.float() else: # Finetune model = model.float()
This part of the code does some additional setup of the model, including quantization and setting data types.
pythonCopy code
prefix = data_args.source_prefix if data_args.source_prefix is not None else ""
This line of code sets the prefix of the data.
The following code snippets are used to preprocess the data.
pythonCopy code
else: logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") return
The last part of the code checks to see if there is any task that needs to be executed, if not, it prints the message and exits
pythonCopy code
if training_args.do_train: column_names = raw_datasets["train"].column_names elif training_args.do_eval: column_names = raw_datasets["validation"].column_names elif training_args.do_predict: column_names = raw_datasets["test"].column_names else: logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") return
This part of the code checks whether the training parameter contains the do_train
, do_eval
or do_predict
flags that represent whether to train, evaluate, or predict, respectively. The code assigns the corresponding data set column name to column_names
, and if there is no task to be executed, it outputs information and exits.
Since this code is truncated, I can't provide you with further explanation. If you have the complete code snippet, I can help you understand the subsequent parts.
def preprocess_function_eval(examples):
inputs, targets = [], []
for i in range(len(examples[prompt_column])):
if examples[prompt_column][i] and examples[response_column][i]:
query = examples[prompt_column][i]
if history_column is None or len(examples[history_column][i]) == 0:
prompt = query
else:
prompt = ""
history = examples[history_column][i]
for turn_idx, (old_query, response) in enumerate(history):
prompt += "[Round {}]\n问:{}\n答:{}\n".format(turn_idx, old_query, response)
prompt += "[Round {}]\n问:{}\n答:".format(len(history), query)
inputs.append(prompt)
targets.append(examples[response_column][i])
inputs = [prefix + inp for inp in inputs]
model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, truncation=True, padding=True)
labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)
if data_args.ignore_pad_token_for_loss:
labels["input_ids"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
return model_inputs
def preprocess_function_train(examples):
max_seq_length = data_args.max_source_length + data_args.max_target_length
model_inputs = {
"input_ids": [],
"labels": [],
}
for i in range(len(examples[prompt_column])):
if examples[prompt_column][i] and examples[response_column][i]:
query, answer = examples[prompt_column][i], examples[response_column][i]
if history_column is None:
prompt = query
else:
prompt = ""
history = examples[history_column][i]
for turn_idx, (old_query, response) in enumerate(history):
prompt += "[Round {}]\n问:{}\n答:{}\n".format(turn_idx, old_query, response)
prompt += "[Round {}]\n问:{}\n答:".format(len(history), query)
prompt = prefix + prompt
a_ids = tokenizer.encode(text=prompt, add_special_tokens=False)
b_ids = tokenizer.encode(text=answer, add_special_tokens=False)
if len(a_ids) > data_args.max_source_length - 1:
a_ids = a_ids[: data_args.max_source_length - 1]
if len(b_ids) > data_args.max_target_length - 2:
b_ids = b_ids[: data_args.max_target_length - 2]
input_ids = tokenizer.build_inputs_with_special_tokens(a_ids, b_ids)
context_length = input_ids.index(tokenizer.bos_token_id)
mask_position = context_length - 1
labels = [-100] * context_length + input_ids[mask_position+1:]
pad_len = max_seq_length - len(input_ids)
input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
labels = labels + [tokenizer.pad_token_id] * pad_len
if data_args.ignore_pad_token_for_loss:
labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
model_inputs["input_ids"].append(input_ids)
model_inputs["labels"].append(labels)
return model_inputs
This code defines two functions, preprocess_function_eval
and preprocess_function_train
, to preprocess the data for evaluation and training.
First, let's look at preprocess_function_eval
the function:
pythonCopy code
def preprocess_function_eval(examples): inputs, targets = [], [] for i in range(len(examples[prompt_column])): if examples[prompt_column][i] and examples[response_column][i]: ...
The function takes a data sample set as a parameter. inputs
Two lists, and , are defined in the function targets
to store the input and target data. Then traverse each sample in the sample set, and only when the questions and answers in the sample exist, it will be processed.
The following code generates a dialog prompt. If the history column is empty or has no history, the hint is the current query. Otherwise, the prompt will contain queries and responses from historical dialogue rounds, with the last round being the current query.
pythonCopy code
if history_column is None or len(examples[history_column][i]) == 0: prompt = query else: ... inputs.append(prompt) targets.append(examples[response_column][i])
After the prompt is generated, it is added to inputs
the list, and the corresponding answer is added to targets
the list.
pythonCopy code
inputs = [prefix + inp for inp in inputs] model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, truncation=True, padding=True) labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)
This part of the code first adds a prefix to each input, and then uses the tokenizer to process the input, including word segmentation, truncation, and padding, to obtain the input data of the model. Similarly, the target data is processed to obtain labels.
pythonCopy code
if data_args.ignore_pad_token_for_loss: labels["input_ids"] = [ [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"] ] model_inputs["labels"] = labels["input_ids"] return model_inputs
This part of the code ignore_pad_token_for_loss
decides whether to ignore the filled token when calculating the loss according to the value of the parameter. If you choose to ignore, then replace all padding token ids in the label with -100. Finally, the processed labels are added to the input data of the model and returned.
Here's preprocess_function_train
a function that does something preprocess_function_eval
similar, but handles it a little differently:
pythonCopy code
def preprocess_function_train(examples):
max_seq_length = data_args.max_source_length + data_args.max_target_length
model_inputs = {
"input_ids": [],
"labels": [],
}
for i in range(len(examples[prompt_column])):
if examples[prompt_column][i] and examples[response_column][i]:
query, answer = examples[prompt_column][i], examples[response_column][i]
if history_column is None:
prompt = query
else:
prompt = ""
history = examples[history_column][i]
for turn_idx, (old_query, response) in enumerate(history):
prompt += "[Round {}]\n问:{}\n答:{}\n".format(turn_idx, old_query, response)
prompt += "[Round {}]\n问:{}\n答:".format(len(history), query)
prompt = prefix + prompt
a_ids = tokenizer.encode(text=prompt, add_special_tokens=False)
b_ids = tokenizer.encode(text=answer, add_special_tokens=False)
if len(a_ids) > data_args.max_source_length - 1:
a_ids = a_ids[: data_args.max_source_length - 1]
if len(b_ids) > data_args.max_target_length - 2:
b_ids = b_ids[: data_args.max_target_length - 2]
input_ids = tokenizer.build_inputs_with_special_tokens(a_ids, b_ids)
context_length = input_ids.index(tokenizer.bos_token_id)
mask_position = context_length - 1
labels = [-100] * context_length + input_ids[mask_position+1:]
pad_len = max_seq_length - len(input_ids)
input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
labels = labels + [tokenizer.pad_token_id] * pad_len
if data_args.ignore_pad_token_for_loss:
labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
model_inputs["input_ids"].append(input_ids)
model_inputs["labels"].append(labels)
return model_inputs
First, the function defines the maximum sequence length , which is the sum of the maximum length of the source sequence and the maximum length of the target sequence. Then, create a dictionary to store the input data for the model.
The following code processes each sample, only when both the question and the answer in the sample exist.
pythonCopy code
for i in range(len(examples[prompt_column])):
if examples[prompt_column][i] and examples[response_column][i]:
query, answer = examples[prompt_column][i], examples[response_column][i]
...
Prompts are generated in the same way as in functions preprocess_function_eval
. Then tokenize the prompt and answer, and handle the case where the length exceeds the maximum limit.
pythonCopy code
a_ids = tokenizer.encode(text=prompt, add_special_tokens=False) b_ids = tokenizer.encode(text=answer, add_special_tokens=False) if len(a_ids) > data_args.max_source_length - 1: a_ids = a_ids[: data_args.max_source_length - 1] if len(b_ids) > data_args.max_target_length - 2: b_ids = b_ids[: data_args.max_target_length - 2]
Next, use the tokenizer build_inputs_with_special_tokens
method to connect the processed question and the id of the answer token to form the input of the model.
pythonCopy code
input_ids = tokenizer.build_inputs_with_special_tokens(a_ids, b_ids)
Then, calculate the length of the context and generate a label, which is composed of -100 and the token id of the answer part.
pythonCopy code
context_length = input_ids.index(tokenizer.bos_token_id) mask_position = context_length - 1 labels = [-100] * context_length + input_ids[mask_position+1:]
The following code handles the case where the length of the input and label is less than the maximum sequence length, and padding is performed by adding padding tokens.
pythonCopy code
pad_len = max_seq_length - len(input_ids) input_ids = input_ids + [tokenizer.pad_token_id] * pad_len labels = labels + [tokenizer.pad_token_id] * pad_len
Then, according to ignore_pad_token_for_loss
the value of the parameter, it is decided whether to ignore the filled token when calculating the loss. If you choose to ignore, then replace all padding token ids in the label with -100.
pythonCopy code
if data_args.ignore_pad_token_for_loss: labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
Finally, the processed input and labels are added to the model's input data and returned.
pythonCopy code
model_inputs["input_ids"].append(input_ids) model_inputs["labels"].append(labels) return model_inputs
Generally speaking, these two functions are used to process data, but they process data in slightly different ways.
if training_args.do_train:
if "train" not in raw_datasets:
raise ValueError("--do_train requires a train dataset")
train_dataset = raw_datasets["train"]
if data_args.max_train_samples is not None:
train_dataset = train_dataset.select(range(data_args.max_train_samples))
train_dataset = train_dataset.map(
preprocess_function_train,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
if training_args.do_eval:
if "validation" not in raw_datasets:
raise ValueError("--do_eval requires a validation dataset")
eval_dataset = raw_datasets["validation"]
if data_args.max_eval_samples is not None:
eval_dataset = eval_dataset.select(range(data_args.max_eval_samples))
eval_dataset = eval_dataset.map(
preprocess_function_eval,
batched=True,
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
load_from_cache_file=not data_args.overwrite_cache,
)
This code is mainly used for data preprocessing and data set selection . If the parameter do_train
is set to True, it will take the training data set raw_datasets
from it , picking data with the maximum number of training samples as needed. It then preprocess_function_train
processes the training dataset using the functions defined above. The number of parallel worker processes used for dataset processing preprocessing_num_workers
is controlled by a parameter. After processing is complete, the original columns are removed from the dataset.
Likewise, if the parameter do_eval
is set to True, this code will raw_datasets
take the validation data set from it, and select the data with the largest number of validation samples as needed. Then, use preprocess_function_eval
functions for processing.
# Initialize our Trainer
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)
This code is used to initialize the trainer. Seq2SeqTrainer
is a tailor-made trainer for sequence-to-sequence models, which accepts model, training parameters, training dataset, evaluation dataset, tokenizer and data processor as parameters.
pythonCopy code
# Training
if training_args.do_train:
train_result = trainer.train(
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
This code is used for model training. If do_train
the parameter is set to True, the trainer starts training, and after training, save the model and tokenizer . Then write the training result to a txt file and save the state of the trainer.
The above code snippets are mainly used to train the model, process the training and verification data, and save the training results. Hope these explanations can help you. If you have other questions, welcome to ask.
Let's explain it line by line:
pythonCopy code
for key, value in sorted(train_result.metrics.items()):
This line of code iterates train_result.metrics
over every key-value pair in the dictionary . This dictionary may contain various metrics generated during the training process, such as training accuracy , loss value, etc. sorted()
The function ensures that the results are sorted alphabetically by key.
pythonCopy code
logger.info(f" {key} = {value}")
This line of code uses logger.info
a function to output the name (key) and corresponding value of each metric to a log message . In this way, in the process of running the script, you can view the specific values of these indicators in the command line terminal or log file .
pythonCopy code
writer.write(f"{key} = {value}\n")
This line of code writes the name and corresponding value of each indicator to a file. The advantage of this is that after the training process is over, you can directly open this file to view the training results without re-running the entire script.
In general, the purpose of this code is to record the key indicator values during the training process , so as to facilitate subsequent analysis of the performance of the model.