Colossal-LLaMA-2 of LLM: Source code interpretation (init_tokenizer.py file) implements expansion based on the source vocabulary, (init_model.py file) implements the calculated mean expansion model, (prepare_pretr

Colossal-LLaMA-2 of LLM: Source code interpretation (init_tokenizer.py file) implements expansion based on the source vocabulary (new vocabulary marked in Chinese) to achieve continuous pre-training, (init_model.py file) implements calculation of mean to expand the model Embed the layer to adapt to the new vocabulary, then save the extended model, (prepare_pretrain_dataset.py file) process and slice the original dataset and save it in JSONL format and Arrow format

Table of contents

Guess you like

Origin blog.csdn.net/qq_41185868/article/details/133365330