Easily play open source large language model bloom (4)

foreword

The previous articles were all about the decoding strategy of the language model. Today we will enter the advanced article. How to improve the effect of the model when the effect of the decoding strategy is limited and the prompt word modification is not satisfactory? At this time, we need to fine-tune the large language model , that is, fine-tune . Generally, the large language models we use are trained by others on general-purpose large data sets, or have been fine-tuned in special fields, so they may not be suitable for the content generated by the model we currently need.
This article will take the bloom-1b1 model as an example, and use the xturing library for fine-tuning. Since fine-tuning is particularly expensive for video memory , those who don’t have a large video memory card can use Google’s colab service like me to ensure 16GiB and above video memory. You can check it by typing nvidia-smi
in the command line , as shown in the figure below, there are 40960 MiB, that is, 40 Gib.
insert image description here

Dataset preparation

insert image description here
First, prepare a json file, which contains a large list containing many dicts. The format of the dictionary is {"instruction": xxx, "input": "", "output": xxx}, instruction means a question or instruction, input means input, sometimes math problems need to tell the variable value, output means output, that is, the content of text generation.

In this example, I used the Chinese question answering data set that I mined by myself. It was written in the previous article. If you are interested, try it yourself. More than 500 of the questions and answers were collected as an example of fine-tuning.

After preparing the json, first install the datasets library with pip, and then generate the format dataset through the following code:

import json

from datasets import Dataset, DatasetDict

def preprocess_alpaca_json_data(alpaca_dataset_path: str):
    alpaca_data = json.load(open(alpaca_dataset_path))
    instructions = []
    inputs = []
    outputs = []

    for data in alpaca_data:
        instructions.append(data["instruction"])
        inputs.append(data["input"])
        outputs.append(data["output"])

    data_dict = {
    
    
        "train": {
    
    "instruction": instructions, "text": inputs, "target": outputs}
    }

    dataset = DatasetDict()
    # using your `Dict` object
    for k, v in data_dict.items():
        dataset[k] = Dataset.from_dict(v)

    dataset.save_to_disk(str("./alpaca_data"))
preprocess_alpaca_json_data('你的数据集.json')

After the calling function is generated, a folder will be generated, the contents of which are shown in the figure:
insert image description here

the code

First of all, if it is running in colab, please ensure the correct runtime:
insert image description here
select gpu-standard or advanced. Computing units are consumed every hour. I am here because I opened a colab member last month, so there are 100 computing units. Ordinary users without membership can only use standard free prostitution.

Then install the necessary libraries. It is recommended to run under linux without using colab, unless you are confident enough to compile VS under Windows :

!pip install accelerate
!pip install xturing --upgrade

The next step is to run the code smoothly:

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models.base import BaseModel

instruction_dataset = InstructionDataset("/content/alpaca_data")
model = BaseModel.create("bloom_lora")

Fill in the newly generated folder path, and then fill in bloom_lora in the .create method. The default is the bloom_1b1 model, and use lora to speed up training.

# Finetuned the model
model.finetune(dataset=instruction_dataset)

Start training, the default is three rounds, and the data set in this example took about six minutes to fine-tune.
insert image description here
After training, try to generate, first change the generation_config, which is the decoding strategy:

generation_config = model.generation_config()
generation_config.top_k = 50
generation_config.do_sample = True
generation_config.top_p = 0.8
generation_config.max_new_tokens=512

Try today's hot search questions:

# Once the model has been finetuned, you can start doing inferences
output = model.generate(texts=["第一视角进入风暴中心,是种什么体验?"])
print("Generated output by the model: {}".format(output))

The result is as follows:

登陆后第一时间发现风有多狂,能感觉空气怎么突然就凉爽了。
在云层中翻滚,能感受到海面下风的肆虐,云层下也有很多小冰块在打滚。风会从上向下刮,就像是在刮雪一样,刮的特别猛烈,在云层下翻滚。 
风暴中心,是风暴的终点,是整个风暴的中心,风向是逆着往复,风向逆着风向,风暴中心就会变成云层,然后刮的特别快,在云层下翻滚,最后消失。 
风暴中心,是风暴的最顶端,风向是顺着往复,风向顺着风向,风暴中心就会变成云层,然后刮的特别快,在云层下翻滚,最后消失。

The same problem uses the same configuration code to compare the default model generation effect :
insert image description here

一视角进入风暴中心,是种什么体验? 
影片中,马特所在的直升机,在进入风暴中心之前,需要先从高空中观察四周环境。
在空中,除了气象台、直升机、救援队伍,还有各种各样的设备,包括直升机吊舱、直升机降落伞、直升机加油机、直升机导航仪等。
这些设备中,最为核心的部分就是直升机吊舱,
因为直升机吊舱可以搭载直升机和飞机,直升机吊舱的重量相当于一架直升机的载荷,因此可以搭载更多直升机。

It can be found that the Chinese of the default model is all about helicopters, which is off topic. . . The effect is significantly improved.

Then change the sentence generation to see the effect after fine-tuning:
insert image description here

但已经对它的价值产生了足够的怀疑。
首先,我们无法证明谷歌Pixel手机和Android平板电脑的性能有多强。
第二,由于谷歌没有在手机上开发过操作系统,因此谷歌无法从Android平板电脑获得与谷歌Pixel类似的安全性能,即使谷歌Pixel手机使用安卓系统,但Android平板电脑的设备也可能使用非谷歌Android设备,例如,Windows系统。在Android平板电脑上,Android设备运行的是非谷歌Android设备。
第三,谷歌Android平板电脑的电池是无法在谷歌Android平板电脑上长期使用的。

Default model:

虽然我还没有实际体验到产品,所以不敢给一个准确的答案。
但是从实际体验来说,这款产品确实是十分不错的,我个人觉得颜值和外观设计都是十分不错的,而且售价也确实是不错。
不过我还是建议大家在购买前先了解下这款产品的信息。</s>

It can be seen from the above that the fine-tuning does have an effect, so how do I save the generated model? In fact, there is already a saved_model folder in the directory, just drag it to the Google hard drive to package and download it.

Guess you like

Origin blog.csdn.net/weixin_43945848/article/details/130079701