Training dataset processing

Column data:

-- https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip

Since TSV is just a variation of CSV that uses tabs instead of commas as delimiters, we can load these files by using the csvload script and specifying the parameters delimiterin load_dataset()the function as follows:

from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

When doing any kind of data analysis, it's a good practice to take a small random sample to get a quick idea of ​​the type of data you're dealing with. Dataset.shuffle()In Datasets, we can create random samples by chaining together and functions Dataset.select():

drug_sample = drug_dataset[ "train" ].shuffle(seed= 42 ).select( range ( 1000 ))
 # 查看前几个例子
drug_sample[: 3 ]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I'm a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than an elevated blood pressure.  I had severe knee and ankle pain which completely went away after taking Mobic.  I attempted to stop the medication however pain returned after a few days."'],
 'rating': [9.0, 3.0, 10.0],
 'date': ['September 2, 2015', 'November 7, 2011', 'June 5, 2013'],
 'usefulCount': [36, 13, 128]}

1. Clean the training data.

In the training data, the data needs to be cleaned:

  • This Unnamed: 0column looks a lot like an anonymous ID for each patient.
  • The conditioncolumn contains a mix of uppercase and lowercase labels.
  • Comments vary in length and use a mix of Python line separators (  \r\n) as well as HTML character codes such as &\#039;.

a. To test the Unnamed: 0patient ID hypothesis for this column, we can use this Dataset.unique()function to verify that the number of IDs matches the number of rows in each split:

for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

To clean up the dataset a bit by renaming the columns to something more interpretable Unnamed: 0,you can use the DatasetDict.rename_column()function:

drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

2. The conditioncolumn contains a mix of uppercase and lowercase labels, converted to lowercase

def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


drug_dataset.map(lowercase_condition)

The following error will be reported:

AttributeError: 'NoneType' object has no attribute 'lower'

For some data, condition列为空,需要过滤下空,you can use lambda functions to define simple mapping and filtering operations:

drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

After removing entries None, we can normalize our conditioncolumns and examine the data by taking the first three entries:

drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]
['left ventricular dysfunction', 'adhd', 'birth control']

2. Add new columns to the training data

For example: I want to add a column to count the length of the review field of user comments and add it to the training set

Define a simple function to count the number of words in each review:

def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

ompute_review_length()It returns a dictionary whose keys do not correspond to one of the column names in the dataset. In this case, it will be applied to all rows in the dataset to create a new column when compute_review_length()passed to :Dataset.map()review_length

drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]
{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

As expected, we can see that a review_lengthcolumn has been added to our training set. We can sort this new column Dataset.sort()to see what the extreme values ​​look like:

drug_dataset["train"].sort("review_length")[:3]
{'patient_id': [103488, 23627, 20558],
 'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],
 'condition': ['birth control', 'muscle spasm', 'pain'],
 'review': ['"Excellent."', '"useless"', '"ok"'],
 'rating': [10.0, 1.0, 6.0],
 'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],
 'usefulCount': [5, 2, 10],
 'review_length': [1, 1, 1]}

For short reviews that do not make sense for our training, we use this Dataset.filter()function to remove reviews that contain less than 30 words. Similar to what we did for columns condition, we can filter out very short comments by requiring comments to be longer than this threshold:

drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

Training set data, and the number of test set data:

{'train': 138514, 'test': 46108}

3. Another problem is that there are HTML character codes in the comments.

htmlThese characters can be escaped using the Python module as follows:

import html

text = "I'm a transformer called BERT"
html.unescape(text)
"I'm a transformer called BERT"

We'll use Dataset.map()和 the unescape method for 转义all HTML characters in the corpus:

drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

4. map()Batch processing capability of the method

The Dataset.map()method takes one batchedparameter which, if set True, causes it to immediately send a batch of examples to the map function (the batch size is configurable, but defaults to 1,000). For example, the map function, which did not escape all the HTML before, takes some time to run (you can read the elapsed time from the progress bar). We can speed things up by using list comprehensions to process multiple elements at once.

When you specify batched=Truethe function, it receives a dictionary containing the fields of the dataset, but each value is now a list of values ​​instead of just a single value. The return value of Dataset.map()should be the same: a dictionary containing the fields we want to update or add to the dataset, and a list of values. For example, here's another way to un-escape all HTML characters, but using batched=True:

new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

Speed ​​is critical when using Dataset.map()the "fast" tokenizer encountered, which tokenizes large lists of text quickly. For example, to tokenize all drug reviews using the fast tokenizer, we could use a function like this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

We can pass one or more examples to the tokenizer, so we can use this function with or without it batched=True. Let's take this opportunity to compare the performance of the different options. %timeIn a notebook, you can time a single line of instruction by adding before the line of code you want to measure:

%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Execute the same instruction with and without , then try it batched=Truewith the slow tokenizer ( added use_fast=Falsein the method), so you can see the numbers you get on hardware:AutoTokenizer.from_pretrained()

Options Fast tokenizer Slow tokenizer
batched=True 10.8s 4min41s
batched=False 59.2s 5min3s

This means that using batched=Truethe fast tokenizer with options is 30 times faster than the slow tokenizer without batching - which is pretty amazing! This is the main reason why fast tokenizers are the default when used (and why they are called "fast") AutoTokenizer. They were able to achieve this speedup because behind the scenes, the tokenized code is executed in Rust, a language that easily parallelizes code execution.

Parallelization is also the reason Fast Tokenizer achieves nearly 6x speedup through batch processing: you can't parallelize a single tokenization operation, but when you want to tokenize a large amount of text at the same time, you can split the execution into multiple processes, each Each is responsible for its own text.

Dataset.map()Also has some parallelization capabilities of its own. Since they're not supported by Rust, they won't let slow tokenizers catch up to fast tokenizers, but they're still useful (especially if you're using a tokenizer that doesn't have a fast version). To enable multiprocessing, use num_procthe parameter and specify the number of processes to use in the call Dataset.map():

slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

You can experiment a bit with time to determine the optimal number of processes to use; in our case, 8 seemed to yield the best speed gains. Here are the numbers we got with and without multiprocessing:

options fast tokenizer slow tokenizer
batched=True 10.8s 4 minutes 41 seconds
batched=False 59.2 seconds 5 minutes 3 seconds
batched=True,num_proc=8 6.52s 41.3s
batched=False,num_proc=8 9.49s 45.2 seconds

5. Enter a set of features and be truncated 

In machine learning, an example is usually defined as a set of features that we feed to the model. In some cases these features will be a set of columns in a Dataset, but in others (as here and for question answering) multiple features can be extracted from a single example and belong to a single column (the explanation is: one sentence , is truncated into multiple paragraphs because of the maximum length set by the tokenizer.)

We will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all chunks of the text (all segments), not just the first one. This can be done with return_overflowing_tokens=True:

def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

Dataset.map()Let's test this on an example before using it on the whole dataset:

result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

An input, because the maximum length is set to 128, is cut into 2 segments (2 feature data), the length of this input is (128+49):

[128, 49]

Do this for all elements of the dataset:

tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

The following errors occur (1463 rows larger than the default operation of 1000 rows) 1,000 examples give 1,463 new features:

ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
  • writer_batch_size  (  int, default 1000) — The cache file writer's number of lines per write operation. This value is a good compromise between memory usage and processing speed during processing. Higher values ​​cause processing to perform fewer lookups, and lower values ​​consume less temporary memory at runtime map.

We mentioned that we can also handle the length mismatch by making the old column the same size as the new column. To do this, we need to overflow_to_sample_mappingset the field returned by the tokenizer return_overflowing_tokens=True. It gives us a mapping from a new feature index to its source sample index. Using this, we can associate each key present in the original dataset with a list of values ​​of the correct size by repeating each example's value as many times as we generated new features:

def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'condition', 'date', 'drugName', 'input_ids', 'patient_id', 'rating', 'review', 'review_length', 'token_type_ids', 'usefulCount'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['attention_mask', 'condition', 'date', 'drugName', 'input_ids', 'patient_id', 'rating', 'review', 'review_length', 'token_type_ids', 'usefulCount'],
        num_rows: 68876
    })
})

Six  Create a validation set

Datasets provide a Dataset.train_test_split()basis . Let's use this to divide and split scikit-learnour training set (we set the parameters for reproducibility):trainvalidationseed

drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean
DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'review_clean'],
        num_rows: 46108
    })
})

Seven, save the data set

As shown in the table below, Datasets provides three main functions to save your datasets in different formats:

Data format Function
Arrow Dataset.save_to_disk()
CSV Dataset.to_csv()
JSON Dataset.to_json()

Arrow:

drug_dataset_clean.save_to_disk("drug-reviews")

This will create a directory with the following structure:
 

drug-reviews/
├── dataset_dict.json
├── test
│   ├── dataset.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── indices.arrow
│   └── state.json
└── validation
    ├── dataset.arrow
    ├── dataset_info.json
    ├── indices.arrow
    └── state.json

After saving the dataset, we can load_from_disk()load it with the following function:

from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

For CSV and JSON formats, we have to store each split as a separate file. One way is to iterate over the keys and values ​​in the object DatasetDict:

for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Load the JSON file as follows:

data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Guess you like

Origin blog.csdn.net/qwer123456u/article/details/130404158