T0 benchmark
(or called P3
) is a large-scale manual annotation instruction tuning data set, proposed in the ICLR 2021 T0 article, which collects multi-task data from the huggingface hub, and equips each task with manual writing from the prompt source instruction.
The P3 dataset can be found on huggingface: link
However, after we download it, we will find that all data files are in the file name .tfrecord
, and the content after opening is very weird:
As shown in the figure above, all downloaded data files are Git LFS
files (that is, large file system), which can be simply understood as pointers, which record the storage address of the data, but the real data has not been successfully downloaded, and they are still stored in On the remote server pointed to by the pointer .
To fully download P3, follow these steps:
1. Download Git lfs
First download git-lfs
, git lfs official website: https://git-lfs.com/
Take mac as an example: brew install git-lfs
you can install it directly.
For linux, sudo permission is required . At present, the author has not found a reliable method to install git-lfs from source code on linux. For details, please refer to: https://askubuntu.com/questions/799341/how-to-install-git-lfs -on-ubuntu-16-04
2. clone warehouse
Next, clone p3 directly from the huggingface hub:
git clone https://huggingface.co/datasets/bigscience/P3
After cloning, you will find that the entire warehouse is actually very small. As mentioned above, the real data is not downloaded at this time, but all the LFS files are downloaded.
3. Restore the LFS file
Next, enter the root directory of P3 first,
Then use git-lfs to download all remote files pointed to by lfs files with the following command:
git lfs install # git-lfs initialization, set `--force` if there is any errors
git lfs pull # download all files pointed by lfs
Then came the long wait. . . The complete data of the entire P3 should be about several hundred G in size. .
4. [Optional] Select evaluation subset
Since the entire data set is too large, we can choose to download only part of the data we need.
For example, the author only wants to download the held-out evaluation set (test set) of T0. There is no need to download the entire huge data set, so you can use the following python script to delete all unnecessary tasks before going to Download (readers modify the code according to their own needs):
# remain ANLI R1-R3, CB,COPA and RTE tasks
import os
import shutil
def get_directories(path):
directories = []
for entry in os.scandir(path):
if entry.is_dir():
directories.append(entry.name)
return directories
def target_task_dir(directory):
''' only return true when facing with the target task directory. '''
directory = directory.lower()
if "anli" in directory and ("r1" in directory or "r2" in directory or "r3" in directory):
return True
elif "cb" in directory:
# super_glue CB
return True
elif "copa" in directory:
# super_glue COPA
return True
elif "rte" in directory:
# super_glue RTE
return True
else:
return False
path = "./data"
directories = get_directories(path)
for directory in directories:
if not target_task_dir(directory):
# del this directory (including all files in it)
shutil.rmtree(os.path.join(path,directory))
Put the above script into the P3 root directory, and run python.
After deleting, remember to save the modification with git:
git add -A
git commit -m "del unused tasks"
I will go later git lfs pull
.
5. Processing tfrecord files
ALL