Data set loading and data preprocessing

Author: Zen and the Art of Computer Programming

1 Introduction

Data set loading and data preprocessing in the field of machine learning have always been a hot topic. Whether it is supervised learning or unsupervised learning, the original data needs to be converted into a form that is easy for machine learning to process, that is, feature vectors, label vectors or structured data structure. This article introduces this module in detail.

First, data sets are usually generated from various sources, such as files, databases, network interfaces, etc. Then, these data must go through a series of preprocessing steps such as cleaning, conversion, standardization, and normalization to obtain a training data set that is easier to process. Among them, methods such as data partitioning, stratified sampling, missing value completion, and feature engineering also need to be considered.

In addition, the deep learning framework also provides many convenient tool functions and classes for data preprocessing, such as:

  • DataLoader: used to load data sets and load asynchronously through multiple threads.
  • Dataset: Custom data set class, which can inherit the base class to implement its own logic.
  • DataAugmentation: Data augmentation tool that can be used to expand the training set.
  • FeatureExtractor: Feature extraction tool, which can be used to extract feature vectors of images, text and other data.
  • LabelEncoder: Label encoder, used to convert categorical variables into integer form.
  • Tokenizer: used to segment text data and convert text sequences into vector representations.

This article mainly discusses the principles and usage of the above technologies.

2. Dataset loading

2.1 File path matching

During the data set loading process, the simplest case is to read the file from the local directory. Just specify the path list of files directly, such as:

file_list = ['data/train.txt', 'data/test.txt'] # 假设文件已经存在

2.2 Folder path matching

Another way is to use the files in the entire folder as part of the data set. In this case, you can use the walk() function of the os module to recursively traverse the directory and find all file paths. Then choose whether to retain the folder name or other information (such as file name, category label, etc.) according to actual needs. As follows:

import os

def find_files(path):
    """查找给定目录下的所有文件"""
    files = []
    for root, dirs, file_names in os.walk(path):
        if len(dirs) > 0:
            print('Ignoring directory:', dirs)
        for name in file_names:
            path = os.path.join(root, name)
            files.append(path)
    return files

file_list = find_files('/path/to/dataset')
print(len(file_list), 'files found.')

2.3 HDF5 format

HDF5 is a universal disk-based data storage format suitable for many types of data, including multi-dimensional arrays, tables, images, metadata, etc. For machine learning data sets with complex data organization, it is more efficient than plain text format, and can easily perform pre-processing operations such as slicing and compression on the data set.

When the data set is large, using HDF5 format may be very advantageous. You can first convert the data set into HDF5 format, and then use the API provided by Python to read the data to save memory. To write to HDF5 format, you can use the to_hdf() function in the pandas library, and to read it, you can use the read_hdf() function in the pandas library.

Here is an example:

import pandas as pd

# 将数据集保存成HDF5格式
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df.to_hdf('my_dataset.h5', key='df', mode='w')

# 从HDF5格式读取数据集
df = pd.read_hdf('my_dataset.h5', key='df')

The above example shows how to save and read a dataset in HDF5 format.

3. Data preprocessing

Data preprocessing refers to operations such as cleaning, conversion, standardization, and normalization of raw data to make the data easier to understand, analyze, and process. Here we will briefly introduce some common preprocessing techniques.

3.1 Data partitioning

Data partitioning refers to dividing the original data set into a training set, a verification set and a test set according to a certain proportion. The usual approach is to divide it according to the ratio of 7:1:2, that is, the training set accounts for 70%, the verification set accounts for 10%, and the test set accounts for 20%. This enables the most stable model training effect and more accurate model parameter adjustment and comparison. In addition, the training set can be divided into multiple subsets for different purposes, such as k-fold cross-validation, transfer learning, etc.

There are two methods of data partitioning:

  1. Random division: The simplest method is to randomly sort the data set by row, select the first 70% and the last 10% as the training set and verification set, and the first 10% as the test set.
  2. Ordinary partitioning: The data set can also be divided into multiple subsets based on a certain feature (such as timestamp) or target variable. For example, the data set can be divided according to years, and each year corresponds to a subset. It can also be divided according to user ID, and each user corresponds to a subset. Doing this helps to make the distribution of each subset as balanced as possible and prevent some subsets from being too biased towards one category.

3.2 Stratified sampling

Stratified sampling is a commonly used technique to ensure that a sufficient number of each category is sampled. Normally, the training set should be completely balanced, that is, have the same number of samples for each category. However, real-world data often do not fully meet this condition. Stratified sampling can help solve this problem.

The basic idea of ​​stratified sampling is to divide the original data set into several subsets, each subset containing samples of the same category. Then, ordinary random sampling is performed on each subset so that each category is sampled to a large enough proportion. This ensures that the class distribution between the training set and the validation set is similar. There are many methods of stratified sampling, the common ones are as follows:

  1. Uneven stratified sampling: This is a classic stratified sampling strategy. The basic idea is to first divide the original data set into K subsets in proportion, and then sample each subset with uniform probability so that each category is sampled to a large enough proportion. The advantage of this approach is that there is no skew between the generated subsets, but the disadvantage is that it does not allow each subset to independently reflect the true data distribution.
  2. Consistent stratified sampling: Consistent stratified sampling is a more complex stratified sampling method. Its basic idea is to sample each category in the original data set so that the number of samples in each category is in K subsets. The scale is the same as in the original dataset. This method avoids the problems of uneven stratified sampling, but because each subset contains the same number of samples, it cannot reflect the true data distribution.
  3. Monotone stratified sampling: Monotone stratified sampling is a new stratified sampling method. Its basic idea is to first divide the original data set into K subsets in proportion, and then divide each subset according to the number and order of samples of the same category. Sort to ensure that the number of samples of each type in the training set does not vary by more than 1.

3.3 Missing value completion

There are often missing values ​​in the data set. In this case, the missing value completion method can be used to fill in the missing values. There are three common methods:

  1. Delete samples with missing values: This is the simplest and most direct method. However, if there are too many samples, it will affect the quality of the data set.
  2. Mode filling method: For variables with few outliers, the mode can be used to fill in missing values. For example, fill missing values ​​with the mode (the sample value that occurs most frequently).
  3. Interpolation method: For continuous variables, you can use interpolation method to complete missing values. For example, linear interpolation method, nearest neighbor interpolation method, etc.

3.4 Feature Engineering

Feature engineering refers to extracting effective features from raw data to be used as input to machine learning models. Commonly used feature engineering techniques include the following aspects:

  1. Combination features: Combining multiple low-dimensional variables into new variables can effectively increase sample size and improve model performance. For example, calculate the sum, product, difference, ratio, etc. of two variables.
  2. Polynomial features: For numerical variables, they can be expanded into multiple indicators through a set of polynomial functions. For example, you can calculate the square, cube, etc. powers of a variable.
  3. Interactive features: combine different features to form a richer feature space. For example, calculate the product of two variables, the quotient of division, the remainder, etc.
  4. Text features: For text data, effective features can be extracted through statistics, classification, clustering and other means. For example, calculate word frequency, TF-IDF, keywords, topic models, etc.
  5. Image features: For image data, effective features can be extracted through different feature extraction methods. For example, calculate edges, textures, HOG descriptors, etc.

3.5 Normalization and standardization

Normalization and standardization are both important steps in data preprocessing and are used to map data distribution to the same scale or range. Common normalization methods include MinMaxScaler, StandardScaler, etc.; common standardization methods include Z-score normalization, Min-max normalization, etc.

The difference between normalization and standardization is that normalization will make the data mean 0 and standard deviation 1, while normalization only scales to a specific interval and does not change the data distribution. In practical applications, multiple methods can be used together to achieve better results.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133504593