Batch Training with Ray Core

批量训练Batch Training和调优是简单机器学习用例（如时间序列预测）中的常见任务。它们需要在多个数据Batch上拟合简单模型，这些Batch对应于不同的数据ID。

本笔记展示了如何使用Ray Core和stateless Ray任务在纽约市出租车数据集（NYC City）上进行批量训练。批量训练为不同和独立的数据集或数据集的子集创建相同的模型。这个任务可以简单地并行化，并且可以轻松地使用Ray扩展。

0.内容简介

本教程将介绍以下步骤：

读取parquet数据
使用Ray任务预处理、训练和评估数据批次
将数据划分为批次，并为每个批次生成一个Ray任务以并行运行
通过集中数据加载优化运行时间

我们希望分析乘客下车地点与行程持续时间之间的关系。对于每个上车地点，这种关系都会有很大不同，因此我们需要为每个地点建立一个单独的模型。此外，这种关系随着时间的推移而改变。因此，我们的任务是为每个上车地点-月份组合创建单独的模型。我们使用的数据集已经按月份划分（每个文件等于一个），我们可以使用数据集中的上车地点ID列将其分组为数据批次。然后，我们将为每个批次拟合模型并选择最佳模型。

1. 环境准备

环境为训练环境为《机器学习框架Ray -- 1.3 Ray Clusters与Ray AIR的基本使用》中创建的RayAIR环境。需要使用scikit库。

conda activate RayAIR
pip install scikit-learn

NYC City原始数据集见 https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

2. Ray初始化

from typing import Callable, Optional, List, Union, Tuple, Iterable
import time
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import pyarrow as pa
from pyarrow import fs
from pyarrow import dataset as ds
from pyarrow import parquet as pq
import pyarrow.compute as pc

启动Ray

import ray
ray.init(ignore_reinit_error=True)

出于基准测试目的，我们可以打印各种操作的时间。但是，为了减少输出中的混乱，这个功能默认设置为False。

PRINT_TIMES = False
def print_time(msg: str):
    if PRINT_TIMES:
        print(msg)

为了加快速度，只使用2019年最后两个月的数据集的一小部分。可以通过将SMOKE_TEST变量设置为False来选择使用2018-2019年的完整数据集。

SMOKE_TEST = True

3. 数据读取

读取parquet数据。read_data函数读取一个Parquet文件，并使用推送下谓词（push-down predicate）根据提供的索引对要在其上拟合模型的数据批次进行提取。通过让每个任务分别读取数据和提取Batch。在数据集加载完成后，将其转换为pandas，以便可以使用scikit-learn进行训练。

def read_data(file: str, pickup_location_id: int) -> pd.DataFrame:
    return pq.read_table(
        file,
        filters=[("pickup_location_id", "=", pickup_location_id)],
        columns=[
            "pickup_at",
            "dropoff_at",
            "pickup_location_id",
            "dropoff_location_id",
        ],
    ).to_pandas()

创建Ray任务以预处理、训练和评估数据批次

定义一个简单的批处理转换函数，以设置正确的数据类型、计算行程持续时间并填充缺失值。

def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
    df["pickup_at"] = pd.to_datetime(
        df["pickup_at"], format="%Y-%m-%d %H:%M:%S"
    )
    df["dropoff_at"] = pd.to_datetime(
        df["dropoff_at"], format="%Y-%m-%d %H:%M:%S"
    )
    df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
    df["pickup_location_id"] = df["pickup_location_id"].fillna(-1)
    df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
    return df

将在数据批次上拟合scikit-learn模型。定义一个Ray任务fit_and_score_sklearn，该任务拟合模型并在验证集上计算平均绝对误差。

考虑为简单的线性回归问题，预测下车地点与行程持续时间之间的关系。

# Ray task to fit and score a scikit-learn model.
@ray.remote
def fit_and_score_sklearn(
    train: pd.DataFrame, test: pd.DataFrame, model: BaseEstimator
) -> Tuple[BaseEstimator, float]:
    train_X = train[["dropoff_location_id"]]
    train_y = train["trip_duration"]
    test_X = test[["dropoff_location_id"]]
    test_y = test["trip_duration"]

    # Start training.
    model = model.fit(train_X, train_y)
    pred_y = model.predict(test_X)
    error = mean_absolute_error(test_y, pred_y)
    return model, error

接下来定义一个train_and_evaluate Ray任务，其中包含加载数据批次、转换数据、将数据分为训练和测试集、在数据上拟合和评估模型所需的所有逻辑。确保返回文件和位置ID，以便可以将拟合的模型映射回它们。对于数据加载和处理，使用之前定义的read_data和transform_batch函数。

def train_and_evaluate_internal(
    df: pd.DataFrame, models: List[BaseEstimator], pickup_location_id: int = 0
) -> List[Tuple[BaseEstimator, float]]:
    # We need at least 4 rows to create a train / test split.
    if len(df) < 4:
        print(
            f"Dataframe for LocID: {pickup_location_id} is empty or smaller than 4"
        )
        return None

    # Train / test split.
    train, test = train_test_split(df)

    # We put the train & test dataframes into Ray object store
    # so that they can be reused by all models fitted here.
    # https://docs.ray.io/en/master/ray-core/patterns/pass-large-arg-by-value.html
    train_ref = ray.put(train)
    test_ref = ray.put(test)

    # Launch a fit and score task for each model.
    results = ray.get(
        [
            fit_and_score_sklearn.remote(train_ref, test_ref, model)
            for model in models
        ]
    )
    results.sort(key=lambda x: x[1])  # sort by error
    return results


@ray.remote
def train_and_evaluate(
    file_name: str,
    pickup_location_id: int,
    models: List[BaseEstimator],
) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]:
    start_time = time.time()
    data = read_data(file_name, pickup_location_id)
    data_loading_time = time.time() - start_time
    print_time(
        f"Data loading time for LocID: {pickup_location_id}: {data_loading_time}"
    )

    # Perform transformation
    start_time = time.time()
    data = transform_batch(data)
    transform_time = time.time() - start_time
    print_time(
        f"Data transform time for LocID: {pickup_location_id}: {transform_time}"
    )

    # Perform training & evaluation for each model
    start_time = time.time()
    results = (train_and_evaluate_internal(data, models, pickup_location_id),)
    training_time = time.time() - start_time
    print_time(
        f"Training time for LocID: {pickup_location_id}: {training_time}"
    )

    return (
        file_name,
        pickup_location_id,
        results,
    )

4. Batch分配

将数据划分为多个batch，并为每个批次生成一个Ray任务以并行运行

run_batch_training驱动函数为每个接收到的Parquet文件（每个文件对应一个月）生成任务。定义该函数以接受一个模型列表，以便评估它们并为每个批次选择最佳模型。当函数达到ray.get()时，它会阻塞并等待任务返回结果。

def run_batch_training(files: List[str], models: List[BaseEstimator]):
    print("Starting run...")
    start = time.time()

    # Store task references
    task_refs = []
    for file in files:
        try:
            locdf = pq.read_table(file, columns=["pickup_location_id"])
        except Exception:
            continue
        pickup_location_ids = locdf["pickup_location_id"].unique()

        for pickup_location_id in pickup_location_ids:
            # Cast PyArrow scalar to Python if needed.
            try:
                pickup_location_id = pickup_location_id.as_py()
            except Exception:
                pass
            task_refs.append(
                train_and_evaluate.remote(file, pickup_location_id, models)
            )

    # Block to obtain results from each task
    results = ray.get(task_refs)

    taken = time.time() - start
    count = len(results)
    # If result is None, then it means there weren't enough records to train
    results_not_none = [x for x in results if x is not None]
    count_not_none = len(results_not_none)

    # Sleep a moment for nicer output
    time.sleep(1)
    print("", flush=True)
    print(f"Number of pickup locations: {count}")
    print(
        f"Number of pickup locations with enough records to train: {count_not_none}"
    )
    print(f"Number of models trained: {count_not_none * len(models)}")
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
    return results

5. 开始批量训练

从S3存储桶中获取数据集的分区，以便我们可以将它们传递给run。数据集按年和月进行划分，这意味着每个文件代表一个月。

# Obtain the dataset. Each month is a separate file.
dataset = ds.dataset(
    "s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/",
    partitioning=["year", "month"],
)
starting_idx = -2 if SMOKE_TEST else 0
files = [f"s3://anonymous@{file}" for file in dataset.files][starting_idx:]

print(f"Obtained {len(files)} files!")

通过集中数据加载

为了确保数据始终可以适应内存，每个任务独立地读取文件并提取所需的数据批次。然而，这对运行时间产生了负面影响。如果我们的Ray集群中有足够的内存，我们可以一次加载每个分区，提取批次，并将它们保存在Ray对象存储中，以牺牲更高的内存使用为代价，大大减少所需的时间。换句话说，我们使用Ray对象存储执行集中式数据加载，而不是分布式数据加载。

请注意，我们没有在read_into_object_store的引用上调用ray.get()。相反将引用本身作为train_and_evaluate.remote调度的参数传递，允许数据在实际需要时保留在对象存储中。这避免了在调用ray.get()的进程中将所有数据加载到内存的情况。

以下代码使用了 Ray 这个分布式计算库，并从 S3 存储桶中读取了一组 Parquet 文件。这些文件包含了纽约市出租车的行程数据。代码的主要目标是针对每个出租车上车地点（pickup_location_id）训练一个模型，并使用给定的模型列表（这里只有线性回归模型）进行训练。代码主要分为以下几个部分：

train_and_evaluate 函数，用于对给定的数据进行特征转换、训练模型并评估模型性能。
read_into_object_store 函数，将数据从 S3 存储桶中读取，并将其保存到 Ray 对象存储中，以便分布式训练。
run_batch_training_with_object_store 函数，用于执行分布式训练的主逻辑。
results 变量中保存了训练和评估的结果。这个结果列表包含了每个上车地点（pickup_location_id）的训练结果。

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

# Redefine the train_and_evaluate task to use in-memory data.
# We still keep file_name and pickup_location_id for identification purposes.
@ray.remote
def train_and_evaluate(
    pickup_location_id_and_data: Tuple[int, pd.DataFrame],
    file_name: str,
    models: List[BaseEstimator],
) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]:
    pickup_location_id, data = pickup_location_id_and_data
    # Perform transformation
    start_time = time.time()
    # The underlying numpy arrays are stored in the Ray object
    # store for efficient access, making them immutable. We therefore
    # copy the DataFrame to obtain a mutable copy we can transform.
    data = data.copy()
    data = transform_batch(data)
    transform_time = time.time() - start_time
    print_time(
        f"Data transform time for LocID: {pickup_location_id}: {transform_time}"
    )

    return (
        file_name,
        pickup_location_id,
        train_and_evaluate_internal(data, models, pickup_location_id),
    )


# This allows us to create a Ray Task that is also a generator, returning object references.
@ray.remote(num_returns="dynamic")
def read_into_object_store(file: str) -> ray.ObjectRefGenerator:
    print(f"Loading {file}")
    # Read the entire file into memory.
    try:
        locdf = pq.read_table(
            file,
            columns=[
                "pickup_at",
                "dropoff_at",
                "pickup_location_id",
                "dropoff_location_id",
            ],
        )
    except Exception:
        return []

    pickup_location_ids = locdf["pickup_location_id"].unique()

    for pickup_location_id in pickup_location_ids:
        # Each id-data batch tuple will be put as a separate object into the Ray object store.

        # Cast PyArrow scalar to Python if needed.
        try:
            pickup_location_id = pickup_location_id.as_py()
        except Exception:
            pass

        yield (
            pickup_location_id,
            locdf.filter(
                pc.equal(locdf["pickup_location_id"], pickup_location_id)
            ).to_pandas(),
        )


def run_batch_training_with_object_store(
    files: List[str], models: List[BaseEstimator]
):
    print("Starting run...")
    start = time.time()

    # Store task references
    task_refs = []

    # Use a SPREAD scheduling strategy to load each
    # file on a separate node as an OOM safeguard.
    # This is not foolproof though! We can also specify a resource
    # requirement for memory, if we know what is the maximum
    # memory requirement for a single file.
    read_into_object_store_spread = read_into_object_store.options(
        scheduling_strategy="SPREAD"
    )

    # Dictionary of references to read tasks with file names as keys
    read_tasks_by_file = {
        files[file_id]: read_into_object_store_spread.remote(file)
        for file_id, file in enumerate(files)
    }

    for file, read_task_ref in read_tasks_by_file.items():
        # We iterate over references and pass them to the tasks directly
        for pickup_location_id_and_data_batch_ref in iter(ray.get(read_task_ref)):
            task_refs.append(
                train_and_evaluate.remote(
                    pickup_location_id_and_data_batch_ref, file, models
                )
            )

    # Block to obtain results from each task
    results = ray.get(task_refs)

    taken = time.time() - start
    count = len(results)
    # If result is None, then it means there weren't enough records to train
    results_not_none = [x for x in results if x is not None]
    count_not_none = len(results_not_none)

    # Sleep a moment for nicer output
    time.sleep(1)
    print("", flush=True)
    print(f"Number of pickup locations: {count}")
    print(
        f"Number of pickup locations with enough records to train: {count_not_none}"
    )
    print(f"Number of models trained: {count_not_none * len(models)}")
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
    return results

6. 启动计算

results = run_batch_training_with_object_store(
    files, models=[LinearRegression()]
)
print(results[:10])

linear线性回归模型，训练结果：

Starting run...
(read_into_object_store pid=327120) Loading s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet

(read_into_object_store pid=327121) Loading s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet

(train_and_evaluate pid=327115) Dataframe for LocID: 214 is empty or smaller than 4

(train_and_evaluate pid=330272) Dataframe for LocID: 176 is empty or smaller than 4

(train_and_evaluate pid=330234) Dataframe for LocID: 204 is empty or smaller than 4

Number of pickup locations: 522

Number of pickup locations with enough records to train: 522

Number of models trained: 522

TOTAL TIME TAKEN: 93.02 seconds

结果解释

例如，如下结果段落中，

's3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, [(LinearRegression(), 930.7871354247037)]

145和 930.7871354247037分别代表：

145 是上车地点的 ID（pickup_location_id）。每个上车地点的 ID 是唯一的，代表纽约市的一个特定区域。这里表示的是上车地点 ID 为 145 的训练结果。

930.7871354247037 是模型在验证集上的性能评估结果。性能评估指标是 Mean Absolute Error（MAE）。MAE 是一个回归任务中常用的评估指标，它衡量了模型预测值与真实值之间的平均绝对误差。

机器学习框架Ray -- 2.6 基于Ray Core的Batch Training