Machine Learning Notes - First Experience of AutoML Framework AutoKeras

I. Overview

        AutoKeras: Keras-based AutoML system. It was developed by the DATA laboratory at Texas A&M University. The goal of AutoKeras is to make machine learning accessible to everyone.

        It provides a simple and effective way to automatically find the best performing models for a variety of predictive modeling tasks, including tabular or so-called structured classification and regression datasets. Automated Machine Learning, or AutoML for short, refers to automatically finding the best combination of data preparation, model, and model hyperparameters for predictive modeling problems. The benefit of AutoML is that it allows machine learning practitioners to handle predictive modeling tasks quickly and efficiently with very little input.

        In the spirit of Keras, AutoKeras provides an easy-to-use interface for different tasks, such as image classification, structured data classification or regression, etc. The user only needs to specify the location of the data and the number of models to try, and a model that achieves the best performance (within the configured constraints) on that dataset is returned.

        Official website address

AutoKeras https://autokeras.com/         is installed using the following command

pip install autokeras

        AutoKeras uses Efficient Neural Architecture Search (ENAS). ENAS applies a concept similar to transfer learning, with the idea that parameters learned for a specific model on a specific task can be used for other models on other tasks. Therefore, ENAS forces all generated submodels to share weights, thereby deliberately preventing each submodel from being trained from scratch. The authors of this paper show that ENAS can not only share parameters between submodels, but also achieve very strong performance. 

2. AutoKeras for regression

1. Dataset

        Use the dataset of kaggle tabular-playground-series-jan-2021. The address is as follows

Tabular Playground Series - Jan 2021 | Kaggle icon-default.png?t=M276https://www.kaggle.com/c/tabular-playground-series-jan-2021         A training set of 30w data and a test set of 20w data.

id cont1 cont2 cont3 cont4 cont5 cont6 count7 cont8 cont9 cont10 cont11 cont12 cont13 cont14 target
1 0.67039 0.8113 0.643968 0.291791 0.284117 0.855953 0.8907 0.285542 0.558245 0.779418 0.921832 0.866772 0.878733 0.305411 7.243043
3 0.388053 0.621104 0.686102 0.501149 0.64379 0.449805 0.510824 0.580748 0.418335 0.432632 0.439872 0.434971 0.369957 0.369484 8.203331
4 0.83495 0.227436 0.301584 0.293408 0.606839 0.829175 0.506143 0.558771 0.587603 0.823312 0.567007 0.677708 0.882938 0.303047 7.776091
5 0.820708 0.160155 0.546887 0.726104 0.282444 0.785108 0.752758 0.823267 0.574466 0.580843 0.769594 0.818143 0.914281 0.279528 6.957716
8 0.935278 0.421235 0.303801 0.880214 0.66561 0.830131 0.487113 0.604157 0.874658 0.863427 0.983575 0.900464 0.935918 0.435772 7.951046
9 0.352623 0.258867 0.327373 0.802627 0.284219 0.296886 0.209743 0.27371 0.308018 0.235851 0.27876 0.251406 0.339135 0.293129 7.346874

2. Using AutoKeras

        Reference Code

# use autokeras to find a model for the insurance dataset
from numpy import asarray
import numpy as np
import pandas as pd
from pandas import read_csv
from sklearn.model_selection import train_test_split
from autokeras import StructuredDataRegressor
import tensorflow as tf
from keras.models import Sequential, load_model

# 加载训练数据
train = pd.read_csv('train.csv', index_col='id')
# 加载验证数据
test = pd.read_csv('test.csv', index_col='id')

# 节省内存的方法
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object and col!= 'time':
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

# 获取标签
y = train['target']
del train['target']
# del test['cont9']

# 扩充下训练数据集
train['max'] = train.max(axis=1)
train['min'] = train.min(axis=1)
train['mean'] = train.mean(axis=1)
train['sum'] = train.sum(axis=1)
train['cha'] = train.max(axis=1) - train.min(axis=1)
train['zhong'] = (train.max(axis=1) + train.min(axis=1))/2

# 扩充下测试数据集,用于测试
test['max'] = test.max(axis=1)
test['min'] = test.min(axis=1)
test['mean'] = test.mean(axis=1)
test['sum'] = test.sum(axis=1)
test['cha'] = test.max(axis=1) - test.min(axis=1)
test['zhong'] = (test.max(axis=1) + test.min(axis=1))/2

# 转数据类型,以便减少内存
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)


def train_1():
    # 分割训练集
    X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.33, random_state=1)
    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

    # 定义搜索
    search = StructuredDataRegressor(max_trials=15, loss='mean_absolute_error')

    # 执行搜索
    search.fit(x=X_train, y=y_train, verbose=1)

    # 验证
    mae, _ = search.evaluate(X_test, y_test, verbose=0)
    print('MAE: %.3f' % mae)

    # 获得最佳模型
    model = search.export_model()
    # 打印模型
    model.summary()

    # 在测试集上进行测试
    predictions = model.predict(test)
    preds = []
    for pred in predictions:
        preds.append(pred[0])

    res = pd.DataFrame()
    res['target'] = preds
    res.to_csv("predict_test_v1.csv")



train_1()

3. Running results

        You can see the structure of the final model.

MAE: 0.622
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 20)]              0         
                                                                 
 multi_category_encoding (Mu  (None, 20)               0         
 ltiCategoryEncoding)                                            
                                                                 
 normalization (Normalizatio  (None, 20)               41        
 n)                                                              
                                                                 
 dense (Dense)               (None, 128)               2688      
                                                                 
 re_lu (ReLU)                (None, 128)               0         
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 re_lu_1 (ReLU)              (None, 64)                0         
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 regression_head_1 (Dense)   (None, 1)                 65        
                                                                 
=================================================================
Total params: 11,050
Trainable params: 11,009
Non-trainable params: 41
_________________________________________________________________

Process finished with exit code 0

4. Submit your forecast

        Upload the csv of the results of the test set to kaggle, because the competition is 21 years old, so the results come out immediately, and the Private Score score is 0.74792.

3. AutoKeras is used for classification

        It is not used for classification testing, please refer to the official demo.

Structured Data Classification - AutoKerasicon-default.png?t=M276https://autokeras.com/tutorial/structured_data_classification/

Fourth, the experience of use

        Comparing the results of the kaggle tabular-playground-series-jan-2021 competition with the traditional algorithm, the scores are still not as good as the traditional machine learning ensemble algorithms (a lot of bagging, k-fold, and stacking scores have been tested before), but they can be used as a Score reference, and get a relatively good model structure suggestion, you can also refer to this model to adjust yourself later.

        Automated machine learning techniques and frameworks like Google AutoML and Auto-Keras should not be overly relied upon. More importantly, expertise in related fields is critical to improving model accuracy.

Guess you like

Origin blog.csdn.net/bashendixie5/article/details/123454174