机器学习笔记 - AutoML框架FLAML初体验

一、概述

        AutoML在近年来的各类机器学习和Kaggle比赛中层出不穷,明显是机器学习的一个趋势,自动化机器学习提供了方法和流程,使机器学习可供非机器学习专家使用,以提高机器学习的效率并加速机器学习的研究。

        FLAML是今年由微软主推的一个全新的高效轻量级自动化机器学习框架。

        FLAML 是一个轻量级的 Python 库,可自动、高效且经济地找到准确的机器学习模型。它使用户不必为每个学习者选择学习者和超参数。

        对于分类和回归等常见的机器学习任务,它可以快速找到计算资源较少的用户提供数据的质量模型。它支持经典机器学习模型和深度神经网络。

        它很容易定制或扩展。用户可以从一个平滑的范围内找到他们想要的可定制性:最小定制(计算资源预算)、中等定制(例如,scikit 风格的学习器、搜索空间和度量)或完全定制(任意训练和评估代码)。

        它支持快速自动调优,能够处理复杂的约束/指导/提前停止。FLAML 由 Microsoft Research 发明的一种新的、具有成本效益的超参数优化和学习器选择方法提供支持。

        另外FLAML 还具有来自Visual Studio 2022中的ML.NET Model Builder的.NET 实现。

二、安装和使用简介

1、安装

pip install flaml

2、使用简介

        号称使用三行三码就可以进行训练,代码风格与scikit-learn一致。

from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification")

        也可以使用指定使用XGBoost、LightGBM、随机森林等的快速超参数调整工具或自定义学习器。

automl.fit(X_train, y_train, task="classification", estimator_list=["lgbm"])

        详情见官方github。

https://github.com/microsoft/FLAMLhttps://github.com/microsoft/FLAML

三、使用案例

        这里基于kaggle的2022年3月的表格游乐场比赛,进行测试。

from flaml import AutoML
from sklearn import svm
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV

# 读取训练和测试数据
train = pd.read_csv('child/train_action.csv', index_col='row_id')
test = pd.read_csv('child/test_action.csv', index_col='row_id')

# 进行一些基础的特征工程
def feature_engineering(data):
    data['time'] = pd.to_datetime(data['time'])
    data['month'] = data['time'].dt.month
    data['weekday'] = data['time'].dt.weekday
    data['hour'] = data['time'].dt.hour
    data['minute'] = data['time'].dt.minute
    data['is_month_start'] = data['time'].dt.is_month_start.astype('int')
    data['is_month_end'] = data['time'].dt.is_month_end.astype('int')
    data['hour+minute'] = data['time'].dt.hour * 60 + data['time'].dt.minute
    data['is_weekend'] = (data['time'].dt.dayofweek > 4).astype('int')
    data['is_afternoon'] = (data['time'].dt.hour > 12).astype('int')
    data['x+y'] = data['x'].astype('str') + data['y'].astype('str')
    data['x+y+direction'] = data['x'].astype('str') + data['y'].astype('str') + data['direction'].astype('str')
    data['hour+direction'] = data['hour'].astype('str') + data['direction'].astype('str')
    data['hour+x+y'] = data['hour'].astype('str') + data['x'].astype('str') + data['y'].astype('str')
    data['hour+direction+x'] = data['hour'].astype('str') + data['direction'].astype('str') + data['x'].astype('str')
    data['hour+direction+y'] = data['hour'].astype('str') + data['direction'].astype('str') + data['y'].astype('str')
    data['hour+direction+x+y'] = data['hour'].astype('str') + data['direction'].astype('str') + data['x'].astype('str') + data['y'].astype('str')
    data['hour+x'] = data['hour'].astype('str') + data['x'].astype('str')
    data['hour+y'] = data['hour'].astype('str') + data['y'].astype('str')
    data = data.drop(['time'], axis=1)
    return data

# 改变类型以减少内存使用
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object and col!= 'time':
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

# 组织数据
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

for data in [train, test]:
    data = feature_engineering(data)

train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

for data in [train, test]:
    data = feature_engineering(data)

y = train['congestion']
del train['time']
#del train['time1']
del train['congestion']

del test['time']
#del test['time1']
del test['pre']

# 声明AutoML并进行训练
automl = AutoML()
automl.fit(train, y, task="regression") #, time_budget=-1,#这个参数是只搜索时间,-1为不限制

# 保存模型
with open("automl_v2.pkl", "wb") as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

# 加载模型并预测
with open("automl_v2.pkl", "rb") as f:
    automl = pickle.load(f)
predictions = automl.predict(test)

# 保存到csv
preds = []
for pred in predictions:
    preds.append(pred)

res = pd.DataFrame()
res['congestion'] = preds
res.to_csv("faml_v2.csv")

        如果使用automl.fit(train, y, task="regression"),全部默认参数进行训练,一分钟就训练完成,把结果提交到kaggle得到了5.688分,虽然排名不咋地,但是就速度和结果同单独的比如随机森林等等还是非常可以的。

        下面修改参数automl.fit(train, y, task="regression", time_budget=600),表示会训练10分钟,-1为不限制时间,不建议为-1,否则运行的时间就非常长了。下面是10分钟后输出结果如下(太长了,保留了一小部分),可以看出进行了若干种方法的搜索。

[flaml.automl: 03-16 16:47:29] {2068} INFO - task = regression
[flaml.automl: 03-16 16:47:29] {2070} INFO - Data split method: uniform
[flaml.automl: 03-16 16:47:29] {2074} INFO - Evaluation method: holdout
[flaml.automl: 03-16 16:47:29] {2155} INFO - Minimizing error metric: 1-r2
[flaml.automl: 03-16 16:47:30] {2248} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 0, current learner lgbm
[flaml.automl: 03-16 16:47:30] {2617} INFO - Estimated sufficient time budget=50419s. Estimated necessary time budget=431s.
[flaml.automl: 03-16 16:47:30] {2669} INFO -  at 3.5s,    estimator lgbm's best error=0.7083,    best estimator lgbm's best error=0.7083
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 1, current learner lgbm
[flaml.automl: 03-16 16:47:30] {2669} INFO -  at 3.5s,    estimator lgbm's best error=0.7083,    best estimator lgbm's best error=0.7083
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 2, current learner lgbm
[flaml.automl: 03-16 16:47:30] {2669} INFO -  at 3.6s,    estimator lgbm's best error=0.4913,    best estimator lgbm's best error=0.4913
[flaml.automl: 03-16 16:47:30] {2501} INFO - iteration 3, current learner xgboost

......

......

......
[flaml.automl: 03-16 16:55:37] {2669} INFO -  at 490.4s,    estimator lgbm's best error=0.2688,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:37] {2501} INFO - iteration 140, current learner extra_tree
[flaml.automl: 03-16 16:55:37] {2669} INFO -  at 490.9s,    estimator extra_tree's best error=0.3131,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:37] {2501} INFO - iteration 141, current learner xgboost
[flaml.automl: 03-16 16:55:42] {2669} INFO -  at 496.0s,    estimator xgboost's best error=0.2914,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:42] {2501} INFO - iteration 142, current learner extra_tree
[flaml.automl: 03-16 16:55:43] {2669} INFO -  at 496.6s,    estimator extra_tree's best error=0.3131,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:43] {2501} INFO - iteration 143, current learner extra_tree
[flaml.automl: 03-16 16:55:44] {2669} INFO -  at 497.3s,    estimator extra_tree's best error=0.3113,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:44] {2501} INFO - iteration 144, current learner extra_tree
[flaml.automl: 03-16 16:55:44] {2669} INFO -  at 497.8s,    estimator extra_tree's best error=0.3113,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:44] {2501} INFO - iteration 145, current learner extra_tree
[flaml.automl: 03-16 16:55:45] {2669} INFO -  at 498.5s,    estimator extra_tree's best error=0.3043,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:45] {2501} INFO - iteration 146, current learner extra_tree
[flaml.automl: 03-16 16:55:46] {2669} INFO -  at 499.2s,    estimator extra_tree's best error=0.3043,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:46] {2501} INFO - iteration 147, current learner lgbm
[flaml.automl: 03-16 16:55:49] {2669} INFO -  at 502.8s,    estimator lgbm's best error=0.2688,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:55:49] {2501} INFO - iteration 148, current learner lgbm
[flaml.automl: 03-16 16:56:47] {2669} INFO -  at 561.0s,    estimator lgbm's best error=0.2688,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:56:47] {2501} INFO - iteration 149, current learner lgbm
[flaml.automl: 03-16 16:56:51] {2669} INFO -  at 564.2s,    estimator lgbm's best error=0.2688,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:56:51] {2501} INFO - iteration 150, current learner lgbm
[flaml.automl: 03-16 16:57:12] {2669} INFO -  at 586.0s,    estimator lgbm's best error=0.2688,    best estimator lgbm's best error=0.2688
[flaml.automl: 03-16 16:57:12] {2501} INFO - iteration 151, current learner lgbm
[flaml.automl: 03-16 16:57:25] {2669} INFO -  at 598.9s,    estimator lgbm's best error=0.2673,    best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:25] {2501} INFO - iteration 152, current learner extra_tree
[flaml.automl: 03-16 16:57:25] {2669} INFO -  at 599.0s,    estimator extra_tree's best error=0.3043,    best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:25] {2501} INFO - iteration 153, current learner extra_tree
[flaml.automl: 03-16 16:57:26] {2669} INFO -  at 599.2s,    estimator extra_tree's best error=0.3043,    best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:26] {2501} INFO - iteration 154, current learner extra_tree
[flaml.automl: 03-16 16:57:26] {2669} INFO -  at 599.3s,    estimator extra_tree's best error=0.3043,    best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:26] {2501} INFO - iteration 155, current learner rf
[flaml.automl: 03-16 16:57:26] {2669} INFO -  at 599.5s,    estimator rf's best error=0.3408,    best estimator lgbm's best error=0.2673
[flaml.automl: 03-16 16:57:39] {2895} INFO - retrain lgbm for 12.7s
[flaml.automl: 03-16 16:57:39] {2900} INFO - retrained model: LGBMRegressor(colsample_bytree=0.7603311183328791,
              learning_rate=0.07676228628554725, max_bin=1023,
              min_child_samples=12, n_estimators=233, num_leaves=634,
              reg_alpha=0.17131266959954505, reg_lambda=0.0009765625,
              verbose=-1)
[flaml.automl: 03-16 16:57:39] {2277} INFO - fit succeeded
[flaml.automl: 03-16 16:57:39] {2279} INFO - Time taken to find the best model: 598.8882813453674
[flaml.automl: 03-16 16:57:39] {2293} WARNING - Time taken to find the best model is 100% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.

        训练完成之后,会给出当前最佳结果的模型。

LGBMRegressor(colsample_bytree=0.7603311183328791,
              learning_rate=0.07676228628554725, max_bin=1023,
              min_child_samples=12, n_estimators=233, num_leaves=634,
              reg_alpha=0.17131266959954505, reg_lambda=0.0009765625,
              verbose=-1)

         这次把结果提交到kaggle得到了5.304分,有所提高。

猜你喜欢

转载自blog.csdn.net/bashendixie5/article/details/123521883