LLM - batch load datasets and merge them

Table of contents

I. Introduction

2.Dataset generation

1.Data style

2. Batch loading

◆ Main function call

◆ Basic variable definition

◆ Loading multiple data sets

3. Data set merging

◆ Concat

◆ interleave

◆ stopping_strategy

◆ interleave_probs

3. Summary


I. Introduction

The LLM model is trained based on the transformer. It needs to generate a dataset first, and then use the dataset to generate the corresponding input_ids, label_ids, etc. according to the task requirements. This article introduces the method of generating the dataset, that is, reading multiple files and finally generating a dataset. The following will introduce the different task requirements. Transformation of dataset.

Tips:

The data set and code in this article mainly refer to Github  LLaMA-Efficient-Tuning .

2.Dataset generation

1.Data style

 alpaca_data_zh_51k.json

◆ alpaca_gpt4_data_zh.json

The data set is a json file, where each json record contains 3 keys:

- instruction can be understood as prompt

- input input, which is what we call Question

- output output, the Answer corresponding to the Question

The above three keys can also be simplified. As mentioned earlier,  LLM - Baichuan7B Tokenizer generates training data . Here only two fields, q and a, are used. It doesn't actually matter what the fields here are, as long as the relevant data of the finally generated input_ids can be distinguished.

2. Batch loading

def getBatchDataSet(_base_path, _data_files, _strategy):
    max_samples = 9999

    # support multiple datasets
    all_datasets: List[Union["Dataset", "IterableDataset"]] = []

    for input_path in _data_files:

        data_path = EXT2TYPE.get(input_path.split(".")[-1], None)

        dataset = load_dataset(
            data_path,
            data_files=[os.path.join(_base_path, input_path)],
            split="train",
            cache_dir=None,
            streaming=None,
            use_auth_token=True
        )

        if max_samples is not None:
            max_samples_temp = min(len(dataset), max_samples)
            dataset = dataset.select(range(max_samples_temp))

        print(dataset.features)
        all_datasets.append(dataset)

    if len(all_datasets) == 1:
        return all_datasets[0]
    elif _strategy == "concat":
        return concatenate_datasets(all_datasets)
    elif _strategy == "interleave":
        # all_exhausted
        stopping_strategy = "first_exhausted"
        interleave_probs = [0.5, 0.5]
        return interleave_datasets(all_datasets, interleave_probs, stopping_strategy=stopping_strategy)
    else:
        raise ValueError("UnKnown mixing strategy")

Let’s break down the code step by step:

Main function call

import os.path
from datasets import load_dataset, concatenate_datasets, interleave_datasets
from typing import TYPE_CHECKING, Any, Dict, Generator, List, Literal, Union, Tuple
from transformers import GPT2Tokenizer
from itertools import chain
import tiktoken


if __name__ == '__main__':
    # 多文件地址
    base_path = "/Users/LLaMA-Efficient-Tuning-main/data"
    data_files = ['alpaca_data_zh_51k.json', 'alpaca_gpt4_data_zh.json']
    strategy = 'concat'
    train_dataset = getBatchDataSet(base_path, data_files, strategy)

Here are given the two json files we need to traverse and the corresponding merge strategies. The strategies will be discussed later.

Basic variable definition

EXT2TYPE = {
    "csv": "csv",
    "json": "json",
    "jsonl": "json",
    "txt": "text"
}

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

max_samples = 9999

# support multiple datasets
all_datasets: List[Union["Dataset", "IterableDataset"]] = []

EXT2TYPE is the map corresponding to the file format. For the second tokenizer, we directly use the gpt2 that comes with the Transformer for demonstration. max_samples defines the data set truncation, and the last all_datasets is used to store multiple data sets.

Multiple data set loading

    for input_path in _data_files:

        data_path = EXT2TYPE.get(input_path.split(".")[-1], None)

        dataset = load_dataset(
            data_path,
            data_files=[os.path.join(_base_path, input_path)],
            split="train",
            cache_dir=None,
            streaming=None,
            use_auth_token=True
        )

        if max_samples is not None:
            max_samples_temp = min(len(dataset), max_samples)
            dataset = dataset.select(range(max_samples_temp))

        print(dataset.features)
        all_datasets.append(dataset)

Traverse the files and suffixes of the file list, load the claimed dataset through from datasets import load_dataset, max_samples cooperates with select to complete the truncation of the dataset, and finally add the dataset to all_datasets. Here dataset.features is similar to the schema of a dataframe, used to describe the basic information of each column:

{'instruction': Value(dtype='string', id=None), 
 'input': Value(dtype='string', id=None), 
 'output': Value(dtype='string', id=None)}

The picture below shows the logs printed by the two data sets. Since the cache has been done before, the arrow file is read directly: 

3. Data set merging

    if len(all_datasets) == 1:
        return all_datasets[0]
    elif _strategy == "concat":
        return concatenate_datasets(all_datasets)
    elif _strategy == "interleave":
        # all_exhausted
        stopping_strategy = "first_exhausted"
        interleave_probs = [0.5, 0.5]
        return interleave_datasets(all_datasets, interleave_probs, stopping_strategy=stopping_strategy)
    else:
        raise ValueError("UnKnown mixing strategy")

Since training only requires one dataset, the datasets read from multiple files need to be merged into one. The above shows different merging strategies. I won’t go into details about the case of length == 1. In addition, there are two types of multi-datasets. Merge strategy:

Concat

The cocnat method directly splices multiple data sets sequentially

dataset-1 => A,B,C
dataset-2 => D,E,F
concat(dataset-1, dataset-2) => A,B,C,D,E,F

 interleave

The interleave method is used to interleave data to prevent overfitting. Interleaved datasets are two or more datasets mixed together to form a new dataset. The purpose of this is so that the model will not always see the same order of data when training, thus improving the model's generalization ability.

dataset-1 => A,B,C
dataset-2 => D,E,F
interleave(dataset-1, dataset-2) => A,E,B,C,D,F

 stopping_strategy

stopping_strategy is used to define when to stop data set merging. There are two staggered strategies: first_exhausted and all_exhausted:

- first_exhausted (first exhausted strategy)

Data sets are processed in the order they are added to the interleave method. When a data set is traversed, it will stop generating data. This method is suitable for when you want to stop iteration after traversing the first data set.

- all_exhausted (all exhausted strategy)

The data sets are processed in the order they are added to the interleave method. Data generation will stop when all data sets are traversed. This method is suitable for when you want to stop iteration after traversing the entire data set.

The main difference between these two strategies is when to stop iteration and throw an exception. first_exhausted The strategy stops after traversing the first data set, while  all_exhausted the strategy stops after traversing all data sets. Which strategy to choose depends on your specific needs and the characteristics of your data set.

 interleave_probs

In the interleave_datasets method, interleave_probs is an optional parameter that specifies the interleaving probability for each dataset. When using the interleave_datasets method to interleave multiple datasets, you can specify a probability for each dataset via the interleave_probs parameter. This probability represents the probability that each data set will be selected when generating interleaved data sets.

For example, suppose you have two data sets A and B, and you set interleave_probs=[0.5, 0.5]. This means that when generating interleaved data sets, A and B are both selected with probability 0.5.

If you set interleave_probs=[0.3, 0.7], then the probability of A being selected is 0.3, and the probability of B being selected is 0.7.

This parameter allows you to weight different data sets as needed so that certain data sets are favored over others when interleaving the data sets.

3. Summary

For large LLM models, we spend most of our time calling frameworks and ready-made models for fine-tuning. Being familiar with the use of some tools can make it easier for us to modify different parts during tuning. This article is mainly used to load the original data to generate a dataset. We will then use it based on The dataset obtained above generates the datasets required for different tasks.

Guess you like

Origin blog.csdn.net/BIT_666/article/details/132825731