Batch Training with Ray Core
批量训练Batch Training和调优是简单机器学习用例(如时间序列预测)中的常见任务。它们需要在多个数据Batch上拟合简单模型,这些Batch对应于不同的数据ID。
本笔记展示了如何使用Ray Core和stateless Ray任务在纽约市出租车数据集(NYC City)上进行批量训练。批量训练为不同和独立的数据集或数据集的子集创建相同的模型。这个任务可以简单地并行化,并且可以轻松地使用Ray扩展。
0.内容简介
本教程将介绍以下步骤:
- 读取parquet数据
- 使用Ray任务预处理、训练和评估数据批次
- 将数据划分为批次,并为每个批次生成一个Ray任务以并行运行
- 通过集中数据加载优化运行时间
我们希望分析乘客下车地点与行程持续时间之间的关系。对于每个上车地点,这种关系都会有很大不同,因此我们需要为每个地点建立一个单独的模型。此外,这种关系随着时间的推移而改变。因此,我们的任务是为每个上车地点-月份组合创建单独的模型。我们使用的数据集已经按月份划分(每个文件等于一个),我们可以使用数据集中的上车地点ID列将其分组为数据批次。然后,我们将为每个批次拟合模型并选择最佳模型。
1. 环境准备
环境为训练环境为《机器学习框架Ray -- 1.3 Ray Clusters与Ray AIR的基本使用》中创建的RayAIR环境。需要使用scikit库。
conda activate RayAIR
pip install scikit-learn
NYC City原始数据集见 https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
2. Ray初始化
from typing import Callable, Optional, List, Union, Tuple, Iterable
import time
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import pyarrow as pa
from pyarrow import fs
from pyarrow import dataset as ds
from pyarrow import parquet as pq
import pyarrow.compute as pc
启动Ray
import ray
ray.init(ignore_reinit_error=True)
出于基准测试目的,我们可以打印各种操作的时间。但是,为了减少输出中的混乱,这个功能默认设置为False。
PRINT_TIMES = False
def print_time(msg: str):
if PRINT_TIMES:
print(msg)
为了加快速度,只使用2019年最后两个月的数据集的一小部分。可以通过将SMOKE_TEST变量设置为False来选择使用2018-2019年的完整数据集。
SMOKE_TEST = True
3. 数据读取
读取parquet数据。read_data函数读取一个Parquet文件,并使用推送下谓词(push-down predicate)根据提供的索引对要在其上拟合模型的数据批次进行提取。通过让每个任务分别读取数据和提取Batch。在数据集加载完成后,将其转换为pandas,以便可以使用scikit-learn进行训练。
def read_data(file: str, pickup_location_id: int) -> pd.DataFrame:
return pq.read_table(
file,
filters=[("pickup_location_id", "=", pickup_location_id)],
columns=[
"pickup_at",
"dropoff_at",
"pickup_location_id",
"dropoff_location_id",
],
).to_pandas()
创建Ray任务以预处理、训练和评估数据批次
定义一个简单的批处理转换函数,以设置正确的数据类型、计算行程持续时间并填充缺失值。
def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
df["pickup_at"] = pd.to_datetime(
df["pickup_at"], format="%Y-%m-%d %H:%M:%S"
)
df["dropoff_at"] = pd.to_datetime(
df["dropoff_at"], format="%Y-%m-%d %H:%M:%S"
)
df["trip_duration"] = (df["dropoff_at"] - df["pickup_at"]).dt.seconds
df["pickup_location_id"] = df["pickup_location_id"].fillna(-1)
df["dropoff_location_id"] = df["dropoff_location_id"].fillna(-1)
return df
将在数据批次上拟合scikit-learn模型。定义一个Ray任务fit_and_score_sklearn,该任务拟合模型并在验证集上计算平均绝对误差。
考虑为简单的线性回归问题,预测下车地点与行程持续时间之间的关系。
# Ray task to fit and score a scikit-learn model.
@ray.remote
def fit_and_score_sklearn(
train: pd.DataFrame, test: pd.DataFrame, model: BaseEstimator
) -> Tuple[BaseEstimator, float]:
train_X = train[["dropoff_location_id"]]
train_y = train["trip_duration"]
test_X = test[["dropoff_location_id"]]
test_y = test["trip_duration"]
# Start training.
model = model.fit(train_X, train_y)
pred_y = model.predict(test_X)
error = mean_absolute_error(test_y, pred_y)
return model, error
接下来定义一个train_and_evaluate Ray任务,其中包含加载数据批次、转换数据、将数据分为训练和测试集、在数据上拟合和评估模型所需的所有逻辑。确保返回文件和位置ID,以便可以将拟合的模型映射回它们。对于数据加载和处理,使用之前定义的read_data和transform_batch函数。
def train_and_evaluate_internal(
df: pd.DataFrame, models: List[BaseEstimator], pickup_location_id: int = 0
) -> List[Tuple[BaseEstimator, float]]:
# We need at least 4 rows to create a train / test split.
if len(df) < 4:
print(
f"Dataframe for LocID: {pickup_location_id} is empty or smaller than 4"
)
return None
# Train / test split.
train, test = train_test_split(df)
# We put the train & test dataframes into Ray object store
# so that they can be reused by all models fitted here.
# https://docs.ray.io/en/master/ray-core/patterns/pass-large-arg-by-value.html
train_ref = ray.put(train)
test_ref = ray.put(test)
# Launch a fit and score task for each model.
results = ray.get(
[
fit_and_score_sklearn.remote(train_ref, test_ref, model)
for model in models
]
)
results.sort(key=lambda x: x[1]) # sort by error
return results
@ray.remote
def train_and_evaluate(
file_name: str,
pickup_location_id: int,
models: List[BaseEstimator],
) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]:
start_time = time.time()
data = read_data(file_name, pickup_location_id)
data_loading_time = time.time() - start_time
print_time(
f"Data loading time for LocID: {pickup_location_id}: {data_loading_time}"
)
# Perform transformation
start_time = time.time()
data = transform_batch(data)
transform_time = time.time() - start_time
print_time(
f"Data transform time for LocID: {pickup_location_id}: {transform_time}"
)
# Perform training & evaluation for each model
start_time = time.time()
results = (train_and_evaluate_internal(data, models, pickup_location_id),)
training_time = time.time() - start_time
print_time(
f"Training time for LocID: {pickup_location_id}: {training_time}"
)
return (
file_name,
pickup_location_id,
results,
)
4. Batch分配
将数据划分为多个batch,并为每个批次生成一个Ray任务以并行运行
run_batch_training驱动函数为每个接收到的Parquet文件(每个文件对应一个月)生成任务。定义该函数以接受一个模型列表,以便评估它们并为每个批次选择最佳模型。当函数达到ray.get()时,它会阻塞并等待任务返回结果。
def run_batch_training(files: List[str], models: List[BaseEstimator]):
print("Starting run...")
start = time.time()
# Store task references
task_refs = []
for file in files:
try:
locdf = pq.read_table(file, columns=["pickup_location_id"])
except Exception:
continue
pickup_location_ids = locdf["pickup_location_id"].unique()
for pickup_location_id in pickup_location_ids:
# Cast PyArrow scalar to Python if needed.
try:
pickup_location_id = pickup_location_id.as_py()
except Exception:
pass
task_refs.append(
train_and_evaluate.remote(file, pickup_location_id, models)
)
# Block to obtain results from each task
results = ray.get(task_refs)
taken = time.time() - start
count = len(results)
# If result is None, then it means there weren't enough records to train
results_not_none = [x for x in results if x is not None]
count_not_none = len(results_not_none)
# Sleep a moment for nicer output
time.sleep(1)
print("", flush=True)
print(f"Number of pickup locations: {count}")
print(
f"Number of pickup locations with enough records to train: {count_not_none}"
)
print(f"Number of models trained: {count_not_none * len(models)}")
print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
return results
5. 开始批量训练
从S3存储桶中获取数据集的分区,以便我们可以将它们传递给run。数据集按年和月进行划分,这意味着每个文件代表一个月。
# Obtain the dataset. Each month is a separate file.
dataset = ds.dataset(
"s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/",
partitioning=["year", "month"],
)
starting_idx = -2 if SMOKE_TEST else 0
files = [f"s3://anonymous@{file}" for file in dataset.files][starting_idx:]
print(f"Obtained {len(files)} files!")
通过集中数据加载
为了确保数据始终可以适应内存,每个任务独立地读取文件并提取所需的数据批次。然而,这对运行时间产生了负面影响。如果我们的Ray集群中有足够的内存,我们可以一次加载每个分区,提取批次,并将它们保存在Ray对象存储中,以牺牲更高的内存使用为代价,大大减少所需的时间。换句话说,我们使用Ray对象存储执行集中式数据加载,而不是分布式数据加载。
请注意,我们没有在read_into_object_store的引用上调用ray.get()。相反将引用本身作为train_and_evaluate.remote调度的参数传递,允许数据在实际需要时保留在对象存储中。这避免了在调用ray.get()的进程中将所有数据加载到内存的情况。
以下代码使用了 Ray 这个分布式计算库,并从 S3 存储桶中读取了一组 Parquet 文件。这些文件包含了纽约市出租车的行程数据。代码的主要目标是针对每个出租车上车地点(pickup_location_id)训练一个模型,并使用给定的模型列表(这里只有线性回归模型)进行训练。代码主要分为以下几个部分:
- train_and_evaluate 函数,用于对给定的数据进行特征转换、训练模型并评估模型性能。
- read_into_object_store 函数,将数据从 S3 存储桶中读取,并将其保存到 Ray 对象存储中,以便分布式训练。
- run_batch_training_with_object_store 函数,用于执行分布式训练的主逻辑。
- results 变量中保存了训练和评估的结果。这个结果列表包含了每个上车地点(pickup_location_id)的训练结果。
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
# Redefine the train_and_evaluate task to use in-memory data.
# We still keep file_name and pickup_location_id for identification purposes.
@ray.remote
def train_and_evaluate(
pickup_location_id_and_data: Tuple[int, pd.DataFrame],
file_name: str,
models: List[BaseEstimator],
) -> Tuple[str, str, List[Tuple[BaseEstimator, float]]]:
pickup_location_id, data = pickup_location_id_and_data
# Perform transformation
start_time = time.time()
# The underlying numpy arrays are stored in the Ray object
# store for efficient access, making them immutable. We therefore
# copy the DataFrame to obtain a mutable copy we can transform.
data = data.copy()
data = transform_batch(data)
transform_time = time.time() - start_time
print_time(
f"Data transform time for LocID: {pickup_location_id}: {transform_time}"
)
return (
file_name,
pickup_location_id,
train_and_evaluate_internal(data, models, pickup_location_id),
)
# This allows us to create a Ray Task that is also a generator, returning object references.
@ray.remote(num_returns="dynamic")
def read_into_object_store(file: str) -> ray.ObjectRefGenerator:
print(f"Loading {file}")
# Read the entire file into memory.
try:
locdf = pq.read_table(
file,
columns=[
"pickup_at",
"dropoff_at",
"pickup_location_id",
"dropoff_location_id",
],
)
except Exception:
return []
pickup_location_ids = locdf["pickup_location_id"].unique()
for pickup_location_id in pickup_location_ids:
# Each id-data batch tuple will be put as a separate object into the Ray object store.
# Cast PyArrow scalar to Python if needed.
try:
pickup_location_id = pickup_location_id.as_py()
except Exception:
pass
yield (
pickup_location_id,
locdf.filter(
pc.equal(locdf["pickup_location_id"], pickup_location_id)
).to_pandas(),
)
def run_batch_training_with_object_store(
files: List[str], models: List[BaseEstimator]
):
print("Starting run...")
start = time.time()
# Store task references
task_refs = []
# Use a SPREAD scheduling strategy to load each
# file on a separate node as an OOM safeguard.
# This is not foolproof though! We can also specify a resource
# requirement for memory, if we know what is the maximum
# memory requirement for a single file.
read_into_object_store_spread = read_into_object_store.options(
scheduling_strategy="SPREAD"
)
# Dictionary of references to read tasks with file names as keys
read_tasks_by_file = {
files[file_id]: read_into_object_store_spread.remote(file)
for file_id, file in enumerate(files)
}
for file, read_task_ref in read_tasks_by_file.items():
# We iterate over references and pass them to the tasks directly
for pickup_location_id_and_data_batch_ref in iter(ray.get(read_task_ref)):
task_refs.append(
train_and_evaluate.remote(
pickup_location_id_and_data_batch_ref, file, models
)
)
# Block to obtain results from each task
results = ray.get(task_refs)
taken = time.time() - start
count = len(results)
# If result is None, then it means there weren't enough records to train
results_not_none = [x for x in results if x is not None]
count_not_none = len(results_not_none)
# Sleep a moment for nicer output
time.sleep(1)
print("", flush=True)
print(f"Number of pickup locations: {count}")
print(
f"Number of pickup locations with enough records to train: {count_not_none}"
)
print(f"Number of models trained: {count_not_none * len(models)}")
print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
return results
6. 启动计算
results = run_batch_training_with_object_store(
files, models=[LinearRegression()]
)
print(results[:10])
linear线性回归模型,训练结果:
Starting run...
(read_into_object_store pid=327120) Loading s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/06/data.parquet/ab5b9d2b8cc94be19346e260b543ec35_000000.parquet
(read_into_object_store pid=327121) Loading s3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet
(train_and_evaluate pid=327115) Dataframe for LocID: 214 is empty or smaller than 4
(train_and_evaluate pid=330272) Dataframe for LocID: 176 is empty or smaller than 4
(train_and_evaluate pid=330234) Dataframe for LocID: 204 is empty or smaller than 4
Number of pickup locations: 522
Number of pickup locations with enough records to train: 522
Number of models trained: 522
TOTAL TIME TAKEN: 93.02 seconds
结果解释
例如,如下结果段落中,
's3://anonymous@air-example-data/ursa-labs-taxi-data/by_year/2019/05/data.parquet/359c21b3e28f40328e68cf66f7ba40e2_000000.parquet', 145, [(LinearRegression(), 930.7871354247037)]
145和 930.7871354247037分别代表:
145 是上车地点的 ID(pickup_location_id)。每个上车地点的 ID 是唯一的,代表纽约市的一个特定区域。这里表示的是上车地点 ID 为 145 的训练结果。
930.7871354247037 是模型在验证集上的性能评估结果。性能评估指标是 Mean Absolute Error(MAE)。MAE 是一个回归任务中常用的评估指标,它衡量了模型预测值与真实值之间的平均绝对误差。