Code completion snack tutorial (4) - training language model

Code completion snack tutorial (4) - training language model

A powerful language model can be a good basis for other tasks. Pre-training model provides a strong foundation for our language model, in some basis, we fine-tune the model can be achieved to meet special needs.
We do first practical operation, and then explain the theory.

Code Data Preparation

Strictly speaking, the code needs to be done to prepare data duplication code, talked back when we talked about relevant papers.
Now we use the most simple way, the code first spliced together.

We wrote a little script, the python file transformer library are read out together:

import os


def walkPrograms(dir, datafile, wildcard):
    exts = wildcard.split(" ")
    for root, subdirs, files in os.walk(dir):
        for name in files:
            for ext in exts:
                if name.endswith(ext):
                    print(root)
                    # print(subdirs)
                    print(name)
                    filename = os.path.join(root, name)
                    print(filename)
                    try:
                        f1 = open(filename, 'r', encoding='utf-8')
                        datafile.writelines(f1.readlines())
                    except UnicodeDecodeError:
                        continue
                    break


outfile = open('transformer.data', 'w', encoding='utf-8')
wildcard = '.py'
walkPrograms('/home/xulun/github/transformers/', outfile, wildcard)

Finally transformer.data will generate a file, which is a combination of python files.

The language model fine-tuning

Before training, we install the next transformer library, first cd to transformers download directory and execute

pip3 install -e . --user

After the installation is successful, we can use run_lm_finetuning.py script examples in the transformers in to carry out fine-tuning:

python3 run_lm_finetuning.py \
    --output_dir=/home/xulun/out_trans \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --per_gpu_train_batch_size=1 \
    --do_train \
    --train_data_file=/home/xulun/github/lusinga/localcomplete/server/transformer.data \
    --block_size=512 --save_steps=500 --overwrite_output_dir

Let's examine the meaning of these parameters:

  • output_dir: Ultimately, we want to save weight is given here directory to save the weights
  • model_type: model categories, such as gpt2 or other
  • model_name_or_path: subclass model, such gpt2-medium, gpt2-large, gpt2-xl, etc.
  • per_gpu_train_batch_size: The size of each batch of multi-CPU CPU training
  • do_train: Only specify this only for training
  • train_data_file: to train filename
  • block_size: the size of the block, and if the GPU memory is large, the more the choice of site, I use the NVidia 2060 GPU, less memory, and so I picked a relatively small value
  • save_steps: how many steps to save training time, the default value is 50, I feel a little small, here into 500
  • overwrite_output_dir: output directory is not covered in this empty, save storage space

Verify the effect

We make a completion test the effect of it, or our previous code, we first use gpt2 try the effect:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# MODEL = '/home/xulun/out_trans/'
MODEL = 'gpt2'

# 加载词汇表
tokenizer = GPT2Tokenizer.from_pretrained(MODEL)

# 输入待补全的文本
text = '    indexed_tokens = tokenizer.'
predicted_text = text

# 加载模型中预训练好的权值
model = GPT2LMHeadModel.from_pretrained(MODEL)

# 设置为eval模式,这样就不会执行训练模式下的Dropout过程
model.eval()
#model.to('cuda')

# 每一个只能补一个token出来,补一句话需要多次,30次是我拍脑袋的
for i in range(0,30):

    # 以上次预测结果作为本次的输入,所谓的自回归
    indexed_tokens = tokenizer.encode(predicted_text)

    # 将读出的索引标记转化成PyTorch向量
    tokens_tensor = torch.tensor([indexed_tokens])

    # 使用GPU进行加速,诚实地讲速度不太快
    #tokens_tensor = tokens_tensor.to('cuda')

    # 进行推理
    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    # 获取预测的下一个子词
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    # 解码成我们都读懂的文本
    predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
    # 打印输入结果
    print(predicted_text)

Output is as follows:

indexed_tokens = tokenizer.get_tokenizer_id(tokenizer.get_tokenizer_id(), tokenizer.get_tokenizer_id(), tokenizer.

Here we replaced the model we have just training, it is to make MODEL changed from gpt2 we have just trained catalog:

MODEL = '/home/xulun/out_trans/'

Well, there are students to complete:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

MODEL = '/home/xulun/out_trans/'
# MODEL = 'gpt2'

# 加载词汇表
tokenizer = GPT2Tokenizer.from_pretrained(MODEL)

# 输入待补全的文本
#text = 'function walk(dir, fn) { if (fs.existsSync(dir)) { let stat ='
#text = 'if (stat.isDirectory()) {fs.readdirSync(dir).'
#text = 'mediaFileText.color ='
#text = 'mediaFileText.top ='
text = '    indexed_tokens = tokenizer.'
predicted_text = text

# 加载模型中预训练好的权值
model = GPT2LMHeadModel.from_pretrained(MODEL)

# 设置为eval模式,这样就不会执行训练模式下的Dropout过程
model.eval()
#model.to('cuda')

# 每一个只能补一个token出来,补一句话需要多次,30次是我拍脑袋的
for i in range(0,30):

    # 以上次预测结果作为本次的输入,所谓的自回归
    indexed_tokens = tokenizer.encode(predicted_text)

    # 将读出的索引标记转化成PyTorch向量
    tokens_tensor = torch.tensor([indexed_tokens])

    # 使用GPU进行加速,诚实地讲速度不太快
    #tokens_tensor = tokens_tensor.to('cuda')

    # 进行推理
    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    # 获取预测的下一个子词
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    # 解码成我们都读懂的文本
    predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
    # 打印输入结果
    print(predicted_text)

Output:

indexed_tokens = tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)

It appears to know better than the original model transformers. We can train with more code, so that we can be better to write python code that effect.
If you want to support other languages, we will train set into other languages on it.

Guess you like

Origin yq.aliyun.com/articles/741870