How to load the training dataset

Work with local and remote datasets

Datasets provides loading scripts to handle loading of local and remote datasets. It supports several common data formats such as:

Data format Loading script Example
CSV & TSV csv load_dataset("csv", data_files="my_file.csv")
Text files text load_dataset("text", data_files="my_file.txt")
JSON & JSON Lines json load_dataset("json", data_files="my_file.jsonl")
Pickled DataFrames pandas load_dataset("pandas", data_files="my_dataframe.pkl")

As shown in the table, for each data format, we only need to specify the type of loading script in the function load_dataset(), and data_filesparameters specifying one or more file paths. Let's start by loading the dataset from a local file; we'll see how to do the same with remote files later.

1. Load the local dataset

For this example we will use the SQuAD-it dataset

To load_dataset()load a JSON file with this function, we just need to know whether we're dealing with plain JSON (similar to nested dictionaries) or JSON Lines (line-delimited JSON). Like many question answering datasets, SQuAD-it uses a nested format, with all text stored in a single datafield. This means we can load the dataset by specifying parameters fieldlike this:

from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

By default, loading a local file creates an DatasetDictobject with splits train. We can squad_it_datasetsee this by inspecting the object:

squad_it_dataset:

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

This shows us the number of rows and column names associated with the training set. We can look at one of the examples by index splitting trainas follows:

squad_it_dataset["train"][0]:

{
    "title": "Terremoto del Sichuan del 2008",
    "paragraphs": [
        {
            "context": "Il terremoto del Sichuan del 2008 o il terremoto...",
            "qas": [
                {
                    "answers": [{"answer_start": 29, "text": "2008"}],
                    "id": "56cdca7862d2951400fa6826",
                    "question": "In quale anno si è verificato il terremoto nel Sichuan?",
                },
                ...
            ],
        },
        ...
    ],
}

To put the training and test datasets into one  DatasetDictobject, you can do this:

data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

You can also load the training dataset directly without decompression:

data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

2. Load remote dataset

We can point directly to data_filesthe SQuAD_it  -*.json.gz  URL as follows:

#https://github.com/crux82/squad-it/blob/master/SQuAD_it-train.json.gz
#https://github.com/crux82/squad-it/blob/master/SQuAD_it-test.json.gz

url = "https://github.com/crux82/squad-it/blob/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Guess you like

Origin blog.csdn.net/qwer123456u/article/details/130399725