How huggingface loads local data sets for large model training

background:

Generally, when we use huggingface, we load the data set directly from its website for training. The code is as follows:

from datasets import load_dataset

food = load_dataset("food101")

After writing the above code, the load_dataset function will automatically download the data set from huggingface to the local, and then cache it.

But how to load our own data set? Especially for image data sets, for the model, the data format it requires is as follows:

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
 'label': 79}

As you can see from the above code, image is a picture in PIL format, and our local image is an image file. How to convert the local image into a PIL image?

Let's talk about how to load a local data set.

Dataset directory structure

dataset
	--image
		--1.jsp
		--2.jsp
		--……
	--image.json
	--label.json

Directory structure description

1.image folder

Store all pictures

2、image.json

A json string that stores a map structure. The key is the name of the image, and the value is the label classification corresponding to the image.

{
	"1.jpg": 0,
	"2.jpg": 0,
	"3.jpg": 1,
	"4.jpg": 1,
	"5.jpg": 4,
	"6.jpg": 4,
	"7.jpg": 2,
	"8.jpg": 2,
	"9.jpg": 3,
	"10.jpg": 3
}

3、label.json

A json string that stores a map structure. The key is the name of the label, and the value is the classification of the label.

{
	"apple": 0,
	"pear": 1,
	"strawberry": 2,
	"peach": 3,
	"chestnut": 4
}

code

How to convert the data stored in the above directory structure into the format required by the model?

Without further ado, let’s get straight to the code.

import json
import os
from PIL import Image
from datasets import Dataset

path = 'D:/项目/AI/数据集/image/vit_dataset'


def gen(path):
    image_json = os.path.join(path, "image.json")
    with open(image_json, 'r') as f:
        # 读取JSON数据
        data = json.load(f)
    for key, value in data.items():
        imagePath = os.path.join(path, "image")
        imagePath = os.path.join(imagePath, key)
        image = Image.open(imagePath)
        yield {'image': image, 'label': value}


ds = Dataset.from_generator(gen, gen_kwargs={"path": path})
ds = ds.train_test_split(test_size=0.2)

def get_label(path):
    label_json = os.path.join(path, "label.json")
    with open(label_json, 'r') as f:
        # 读取JSON数据
        data = json.load(f)
    label2id, id2label = dict(), dict()
    for key, value in data.items():
        label2id[key] = str(value)
        id2label[str(value)] = key
    return label2id, id2label


print(ds)
print(ds['train'][0])
label2id, id2label = get_label(path)
print(label2id)
print(id2label)

operation result:

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 8
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 2
    })
})
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=332x332 at 0x1ED6A84C690>, 'label': 0}
{'apple': '0', 'pear': '1', 'strawberry': '2', 'peach': '3', 'chestnut': '4'}
{'0': 'apple', '1': 'pear', '2': 'strawberry', '3': 'peach', '4': 'chestnut'}

From the output, we find that we have obtained the data structure required by the model. Focus mainly on the third to last row of the running results.

Summarize:

Using the Dataset.from_generator() function, by defining a generator, we can convert our local customized data into any format type required by the large model.

Guess you like

Origin blog.csdn.net/duzm200542901104/article/details/134391256