Download [T0] instruction fine-tuning dataset

T0 benchmark(or called P3) is a large-scale manual annotation instruction tuning data set, proposed in the ICLR 2021 T0 article, which collects multi-task data from the huggingface hub, and equips each task with manual writing from the prompt source instruction.

The P3 dataset can be found on huggingface: link

insert image description here
However, after we download it, we will find that all data files are in the file name .tfrecord, and the content after opening is very weird:
insert image description here

As shown in the figure above, all downloaded data files are Git LFSfiles (that is, large file system), which can be simply understood as pointers, which record the storage address of the data, but the real data has not been successfully downloaded, and they are still stored in On the remote server pointed to by the pointer .

To fully download P3, follow these steps:

1. Download Git lfs

First download git-lfs, git lfs official website: https://git-lfs.com/

insert image description here
Take mac as an example: brew install git-lfsyou can install it directly.

For linux, sudo permission is required . At present, the author has not found a reliable method to install git-lfs from source code on linux. For details, please refer to: https://askubuntu.com/questions/799341/how-to-install-git-lfs -on-ubuntu-16-04

2. clone warehouse

Next, clone p3 directly from the huggingface hub:

git clone https://huggingface.co/datasets/bigscience/P3

After cloning, you will find that the entire warehouse is actually very small. As mentioned above, the real data is not downloaded at this time, but all the LFS files are downloaded.

3. Restore the LFS file

Next, enter the root directory of P3 first,

Then use git-lfs to download all remote files pointed to by lfs files with the following command:

git lfs install  # git-lfs initialization, set `--force` if there is any errors
git lfs pull  # download all files pointed by lfs

Then came the long wait. . . The complete data of the entire P3 should be about several hundred G in size. .
insert image description here

4. [Optional] Select evaluation subset

Since the entire data set is too large, we can choose to download only part of the data we need.

For example, the author only wants to download the held-out evaluation set (test set) of T0. There is no need to download the entire huge data set, so you can use the following python script to delete all unnecessary tasks before going to Download (readers modify the code according to their own needs):

# remain  ANLI R1-R3, CB,COPA and RTE tasks
import os
import shutil

def get_directories(path):
    directories = []
    for entry in os.scandir(path):
        if entry.is_dir():
            directories.append(entry.name)
    return directories

def target_task_dir(directory):
    ''' only return true when facing with the target task directory. '''
    directory = directory.lower()
    if "anli" in directory and ("r1" in directory or "r2" in directory or "r3" in directory):
        return True
    elif "cb" in directory:
        # super_glue CB
        return True
    elif "copa" in directory:
        # super_glue COPA
        return True
    elif "rte" in directory:
        # super_glue RTE
        return True
    else:
        return False

path = "./data"
directories = get_directories(path)

for directory in directories:
    if not target_task_dir(directory):
        # del this directory (including all files in it)
        shutil.rmtree(os.path.join(path,directory))

Put the above script into the P3 root directory, and run python.

After deleting, remember to save the modification with git:

git add -A
git commit -m "del unused tasks"

I will go later git lfs pull.

5. Processing tfrecord files

ALL

Guess you like

Origin blog.csdn.net/weixin_43301333/article/details/131554881