一、概述
AutoML在近年来的各类机器学习和Kaggle比赛中层出不穷,明显是机器学习的一个趋势,自动化机器学习提供了方法和流程,使机器学习可供非机器学习专家使用,以提高机器学习的效率并加速机器学习的研究。
FLAML是今年由微软主推的一个全新的高效轻量级自动化机器学习框架。
FLAML 是一个轻量级的 Python 库,可自动、高效且经济地找到准确的机器学习模型。它使用户不必为每个学习者选择学习者和超参数。
对于分类和回归等常见的机器学习任务,它可以快速找到计算资源较少的用户提供数据的质量模型。它支持经典机器学习模型和深度神经网络。
它很容易定制或扩展。用户可以从一个平滑的范围内找到他们想要的可定制性:最小定制(计算资源预算)、中等定制(例如,scikit 风格的学习器、搜索空间和度量)或完全定制(任意训练和评估代码)。
它支持快速自动调优,能够处理复杂的约束/指导/提前停止。FLAML 由 Microsoft Research 发明的一种新的、具有成本效益的超参数优化和学习器选择方法提供支持。
另外FLAML 还具有来自Visual Studio 2022中的ML.NET Model Builder的.NET 实现。
二、安装和使用简介
1、安装
pip install flaml
2、使用简介
号称使用三行三码就可以进行训练,代码风格与scikit-learn一致。
from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")
也可以使用指定使用XGBoost、LightGBM、随机森林等的快速超参数调整工具或自定义学习器。
automl.fit(X_train, y_train, task="classification", estimator_list=["lgbm"])
详情见官方github。
https://github.com/microsoft/FLAMLhttps://github.com/microsoft/FLAML
三、使用案例
这里基于kaggle的2022年3月的表格游乐场比赛,进行测试。
from flaml import AutoML
from sklearn import svm
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
# 读取训练和测试数据
train = pd.read_csv('child/train_action.csv', index_col='row_id')
test = pd.read_csv('child/test_action.csv', index_col='row_id')
# 进行一些基础的特征工程
def feature_engineering(data):
data['time'] = pd.to_datetime(data['time'])
data['month'] = data['time'].dt.month
data['weekday'] = data['time'].dt.weekday
data['hour'] = data['time'].dt.hour
data['minute'] = data['time'].dt.minute
data['is_month_start'] = data['time'].dt.is_month_start.astype('int')
data['is_month_end'] = data['time'].dt.is_month_end.astype('int')
data['hour+minute'] = data['time'].dt.hour * 60 + data['time'].dt.minute
data['is_weekend'] = (data['time'].dt.dayofweek > 4).astype('int')
data['is_afternoon'] = (data['time'].dt.hour > 12).astype('int')
data['x+y'] = data['x'].astype('str') + data['y'].astype('str')
data['x+y+direction'] = data['x'].astype('str') + data['y'].astype('str') + data['direction'].astype('str')
data['hour+direction'] = data['hour'].astype('str') + data['direction'].astype('str')
data['hour+x+y'] = data['hour'].astype('str') + data['x'].astype('str') + data['y'].astype('str')
data['hour+direction+x'] = data['hour'].astype('str') + data['direction'].astype('str') + data['x'].astype('str')
data['hour+direction+y'] = data['hour'].astype('str') + data['direction'].astype('str') + data['y'].astype('str')
data['hour+direction+x+y'] = data['hour'].astype('str') + data['direction'].astype('str') + data['x'].astype('str') + data['y'].astype('str')
data['hour+x'] = data['hour'].astype('str') + data['x'].astype('str')
data['hour+y'] = data['hour'].astype('str') + data['y'].astype('str')
data = data.drop(['time'], axis=1)
return data
# 改变类型以减少内存使用
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum() / 1024 ** 2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object and col!= 'time':
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024 ** 2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
# 组织数据
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
for data in [train, test]:
data = feature_engineering(data)
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
for data in [train, test]:
data = feature_engineering(data)
y = train['congestion']
del train['time']
#del train['time1']
del train['congestion']
del test['time']
#del test['time1']
del test['pre']
# 声明AutoML并进行训练
automl = AutoML()
automl.fit(train, y, task="regression") #, time_budget=-1,#这个参数是只搜索时间,-1为不限制
# 保存模型
with open("automl_v2.pkl", "wb") as f:
pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)
# 加载模型并预测
with open("automl_v2.pkl", "rb") as f:
automl = pickle.load(f)
predictions = automl.predict(test)
# 保存到csv
preds = []
for pred in predictions:
preds.append(pred)
res = pd.DataFrame()
res['congestion'] = preds
res.to_csv("faml_v2.csv")
如果使用automl.fit(train, y, task="regression"),全部默认参数进行训练,一分钟就训练完成,把结果提交到kaggle得到了5.688分,虽然排名不咋地,但是就速度和结果同单独的比如随机森林等等还是非常可以的。
下面修改参数automl.fit(train, y, task="regression", time_budget=600),表示会训练10分钟,-1为不限制时间,不建议为-1,否则运行的时间就非常长了。下面是10分钟后输出结果如下(太长了,保留了一小部分),可以看出进行了若干种方法的搜索。
[flaml.automl: 03-16 16:47:29] {2068} INFO - task = regression
[flaml.automl: 03-16 16:47:29] {2070} INFO - Data split method: uniform
[flaml.automl: 03-16 16:47:29] {2074} INFO - Evaluation method: holdout
[flaml.automl: 03-16 16:47:29] {2155} INFO - Minimizing error metric: 1-r2
[flaml.automl: 03-16 16:47:30] {2248} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 0, current learner lgbm
[flaml.automl: 03-16 16:47:30] {2617} INFO - Estimated sufficient time budget=50419s. Estimated necessary time budget=431s.
[flaml.automl: 03-16 16:47:30] {2669} INFO - at 3.5s, estimator lgbm's best error=0.7083, best estimator lgbm's best error=0.7083
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 1, current learner lgbm
[flaml.automl: 03-16 16:47:30] {2669} INFO - at 3.5s, estimator lgbm's best error=0.7083, best estimator lgbm's best error=0.7083
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 2, current learner lgbm
[flaml.automl: 03-16 16:47:30] {2669} INFO - at 3.6s, estimator lgbm's best error=0.4913, best estimator lgbm's best error=0.4913
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 3, current learner xgboost......
......
......
[flaml.automl: 03-16 16:55:37] {2669} INFO - at 490.4s, estimator lgbm's best error=0.2688, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:37] {2501} INFO - iteration 140, current learner extra_tree
[flaml.automl: 03-16 16:55:37] {2669} INFO - at 490.9s, estimator extra_tree's best error=0.3131, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:37] {2501} INFO - iteration 141, current learner xgboost
[flaml.automl: 03-16 16:55:42] {2669} INFO - at 496.0s, estimator xgboost's best error=0.2914, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:42] {2501} INFO - iteration 142, current learner extra_tree
[flaml.automl: 03-16 16:55:43] {2669} INFO - at 496.6s, estimator extra_tree's best error=0.3131, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:43] {2501} INFO - iteration 143, current learner extra_tree
[flaml.automl: 03-16 16:55:44] {2669} INFO - at 497.3s, estimator extra_tree's best error=0.3113, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:44] {2501} INFO - iteration 144, current learner extra_tree
[flaml.automl: 03-16 16:55:44] {2669} INFO - at 497.8s, estimator extra_tree's best error=0.3113, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:44] {2501} INFO - iteration 145, current learner extra_tree
[flaml.automl: 03-16 16:55:45] {2669} INFO - at 498.5s, estimator extra_tree's best error=0.3043, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:45] {2501} INFO - iteration 146, current learner extra_tree
[flaml.automl: 03-16 16:55:46] {2669} INFO - at 499.2s, estimator extra_tree's best error=0.3043, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:46] {2501} INFO - iteration 147, current learner lgbm
[flaml.automl: 03-16 16:55:49] {2669} INFO - at 502.8s, estimator lgbm's best error=0.2688, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:49] {2501} INFO - iteration 148, current learner lgbm
[flaml.automl: 03-16 16:56:47] {2669} INFO - at 561.0s, estimator lgbm's best error=0.2688, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:56:47] {2501} INFO - iteration 149, current learner lgbm
[flaml.automl: 03-16 16:56:51] {2669} INFO - at 564.2s, estimator lgbm's best error=0.2688, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:56:51] {2501} INFO - iteration 150, current learner lgbm
[flaml.automl: 03-16 16:57:12] {2669} INFO - at 586.0s, estimator lgbm's best error=0.2688, best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:57:12] {2501} INFO - iteration 151, current learner lgbm
[flaml.automl: 03-16 16:57:25] {2669} INFO - at 598.9s, estimator lgbm's best error=0.2673, best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:25] {2501} INFO - iteration 152, current learner extra_tree
[flaml.automl: 03-16 16:57:25] {2669} INFO - at 599.0s, estimator extra_tree's best error=0.3043, best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:25] {2501} INFO - iteration 153, current learner extra_tree
[flaml.automl: 03-16 16:57:26] {2669} INFO - at 599.2s, estimator extra_tree's best error=0.3043, best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:26] {2501} INFO - iteration 154, current learner extra_tree
[flaml.automl: 03-16 16:57:26] {2669} INFO - at 599.3s, estimator extra_tree's best error=0.3043, best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:26] {2501} INFO - iteration 155, current learner rf
[flaml.automl: 03-16 16:57:26] {2669} INFO - at 599.5s, estimator rf's best error=0.3408, best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:39] {2895} INFO - retrain lgbm for 12.7s
[flaml.automl: 03-16 16:57:39] {2900} INFO - retrained model: LGBMRegressor(colsample_bytree=0.7603311183328791,
learning_rate=0.07676228628554725, max_bin=1023,
min_child_samples=12, n_estimators=233, num_leaves=634,
reg_alpha=0.17131266959954505, reg_lambda=0.0009765625,
verbose=-1)
[flaml.automl: 03-16 16:57:39] {2277} INFO - fit succeeded
[flaml.automl: 03-16 16:57:39] {2279} INFO - Time taken to find the best model: 598.8882813453674
[flaml.automl: 03-16 16:57:39] {2293} WARNING - Time taken to find the best model is 100% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
训练完成之后,会给出当前最佳结果的模型。
LGBMRegressor(colsample_bytree=0.7603311183328791,
learning_rate=0.07676228628554725, max_bin=1023,
min_child_samples=12, n_estimators=233, num_leaves=634,
reg_alpha=0.17131266959954505, reg_lambda=0.0009765625,
verbose=-1)
这次把结果提交到kaggle得到了5.304分,有所提高。