Work with local and remote datasets
Datasets provides loading scripts to handle loading of local and remote datasets. It supports several common data formats such as:
Data format | Loading script | Example |
---|---|---|
CSV & TSV | csv |
load_dataset("csv", data_files="my_file.csv") |
Text files | text |
load_dataset("text", data_files="my_file.txt") |
JSON & JSON Lines | json |
load_dataset("json", data_files="my_file.jsonl") |
Pickled DataFrames | pandas |
load_dataset("pandas", data_files="my_dataframe.pkl") |
As shown in the table, for each data format, we only need to specify the type of loading script in the function load_dataset()
, and data_files
parameters specifying one or more file paths. Let's start by loading the dataset from a local file; we'll see how to do the same with remote files later.
1. Load the local dataset
For this example we will use the SQuAD-it dataset
To load_dataset()
load a JSON file with this function, we just need to know whether we're dealing with plain JSON (similar to nested dictionaries) or JSON Lines (line-delimited JSON). Like many question answering datasets, SQuAD-it uses a nested format, with all text stored in a single data
field. This means we can load the dataset by specifying parameters field
like this:
from datasets import load_dataset
squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")
By default, loading a local file creates an DatasetDict
object with splits train
. We can squad_it_dataset
see this by inspecting the object:
squad_it_dataset:
DatasetDict({
train: Dataset({
features: ['title', 'paragraphs'],
num_rows: 442
})
})
This shows us the number of rows and column names associated with the training set. We can look at one of the examples by index splitting train
as follows:
squad_it_dataset["train"][0]:
{
"title": "Terremoto del Sichuan del 2008",
"paragraphs": [
{
"context": "Il terremoto del Sichuan del 2008 o il terremoto...",
"qas": [
{
"answers": [{"answer_start": 29, "text": "2008"}],
"id": "56cdca7862d2951400fa6826",
"question": "In quale anno si è verificato il terremoto nel Sichuan?",
},
...
],
},
...
],
}
To put the training and test datasets into one DatasetDict
object, you can do this:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset
DatasetDict({
train: Dataset({
features: ['title', 'paragraphs'],
num_rows: 442
})
test: Dataset({
features: ['title', 'paragraphs'],
num_rows: 48
})
})
You can also load the training dataset directly without decompression:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
2. Load remote dataset
We can point directly to data_files
the SQuAD_it -*.json.gz URL as follows:
#https://github.com/crux82/squad-it/blob/master/SQuAD_it-train.json.gz
#https://github.com/crux82/squad-it/blob/master/SQuAD_it-test.json.gz
url = "https://github.com/crux82/squad-it/blob/master/"
data_files = {
"train": url + "SQuAD_it-train.json.gz",
"test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")