Large model: How to use the old tokenizer to train a new one?

background:

When we use chatGPT or SD, we find that if we use English to write prompt words, the results we get are much better than the results we get when we use Chinese. Why? There is something called tokenizer at work.

Training a suitable tokenizer is the basis for training a large model. We can either train a brand new tokenizer from scratch, or use an old tokenizer to train a new one. Today let us see how to replace the old tokenizer with the new one.

Step One: Data Preparation

Whether training a large model or training a tokenizer, we first need to prepare the data set:

from datasets import load_dataset
#加载数据集
raw_datasets = load_dataset("code_search_net", "python")

#写一个迭代函数,分配加载数据,防止数据集太大导致内存溢出
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )


training_corpus = get_training_corpus()

Step 2: Training

#加载旧的tokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
#进行训练
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Step 3: Save

tokenizer.save_pretrained("code-search-net-tokenizer")

Step 4: Use

tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

Summarize:

1. Using AutoTokenizer.train_new_from_iterator() we can easily use our own data set to train a new tokenizer based on the old tokenizer

2. If there is no large language model available in the language we need, or the data set we want to predict is very different from the data set we selected to train the large language model, we need to retrain from scratch using a tokenizer suitable for our data. Model.

Guess you like

Origin blog.csdn.net/duzm200542901104/article/details/133039046