Automatic machine learning library: TPOT の study notes

1 Introduction to TPOT

Tree-based Pipeline Optimization Tool (TPOT, Tree-based Pipeline Optimization Tool) is an open source library for performing AutoML in Python.

TPOT uses a tree-based structure to represent model pipelines for predictive modeling problems, including data preparation and modeling algorithms, and model hyperparameters. It leverages the popular Scikit-Learn machine learning library for data transformation and machine learning algorithms , and uses a genetic programming stochastic global search process to efficiently discover the best-performing model pipeline for a given dataset.

An optimization process is then performed to find the tree structure that performs best for a given dataset. Specifically, a genetic programming algorithm designed to perform stochastic global optimization on programs represented as trees.
The figure below, taken from the TPOT paper, shows the elements involved in pipeline search, including data cleaning, feature selection, feature processing, feature construction, model selection, and hyperparameter optimization.
insert image description here

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best pipeline for your data.

2 Install and use TPOT

2.1 Install the TPOT library

pip install tpot

Once installed, import the library and print the version number to confirm it installed successfully:

# check tpot version
import tpot
print('tpot: %s' % tpot.__version__)

2.2 Using the TPOT library

process

(1) Create an instance of the TPOTRegressor or TPOTClassifier class and configure it to search
(2) Export the model pipeline with the best performance found on the dataset.

(1) Configuration class

There are two main elements involved.
① How to evaluate the model

For example:
To use neg_mean_absolute_error as a regression metric , select RepeatedKFold for regression cross-validation.

# 定义了评价步骤
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# 定义搜索
model = TPOTRegressor(... scoring='neg_mean_absolute_error', cv=cv)

Or use accuracy as the evaluation index of the classification model , then choose RepeatedStratifiedKFold for classification cross-validation.

# 定义了评价步骤
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 定义搜索
model = TPOTClassifier(... scoring='accuracy', cv=cv)

②Parameter Configuration of Evolutionary Computing
As an evolutionary algorithm, it involves more complex configuration settings, such as population size, number of generations to run, and potential crossover and mutation rates. The former significantly controls the scope of the search; if you are not familiar with evolutionary search algorithms, you can leave the latter set to the default.

For example, a modest population size of 100 generations and 5 or 10 generations is a good starting point.

# define 搜索
model = TPOTClassifier(generations=5, population_size=50, ...)

(2) At the end of the search, the best performing pipeline is found

This pipeline that outputs the best model can be exported as a py file that can be copied and pasted into your own projects later.

# 输出最佳模型
model.export('tpot_model.py')

3 Examples of TPOT classification

The sonar dataset is a standard machine learning dataset consisting of 208 rows of data with 60 numeric input variables and a target variable with two class values, such as binary classification.
This dataset involves predicting whether sonar returns indicate rocks or mines.

① Load data

from pandas import read_csv
dataframe = read_csv(data, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

② Use RepeatedStratifiedKFold cross-validation to define the method of evaluating the model.

# 定义模型评估器
cv = RepeatedStratifiedKFold(n_splits=10, 
                             n_repeats=3, 
                             random_state=1)

③ will use a sample size of 50 for a 50-fold search, and set n_jobs = -1 to use all cores on the system.

# 定义搜索
model = TPOTClassifier(generations=5, 
                       population_size=50, cv=cv,
                       scoring='accuracy', verbosity=2,
                       random_state=1, n_jobs=-1)

④ Start the search and make sure to save the best performing model at the end of the run

# 执行搜索
model.fit(X, y)
# 输出最佳模型
model.export('tpot_sonar_best_model.py')

Finally save the best performing pipeline to a file called "tpot_sonar_best_model.py".

Generic code for loading datasets and fitting pipelines

# 在声纳数据集上拟合最终模型并做出预测的例子
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive
# 导入数据集
dataframe = read_csv(data, header=None)
# 拆分为输入变量和输出变量
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# 以尽量小的内存使用数据集
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# 训练集上的交叉验证平均分数为: 0.8667
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GaussianNB()),
    GradientBoostingClassifier(learning_rate=0.1, max_depth=7, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=10, n_estimators=100, subsample=0.9000000000000001)
)
# 修正了导出管道中所有步骤的随机状态
set_param_recursive(exported_pipeline.steps, 'random_state', 1)
# 训练模型
exported_pipeline.fit(X, y)
# 对新数据行进行预测
row = [0.0200,0.0371,0.0428,0.0207,0.0954,0.0986]
yhat = exported_pipeline.predict([row])
print('Predicted: %.3f' % yhat[0])

4 TPOT regression

The Auto Insurance dataset is a standard machine learning dataset consisting of 63 rows of data, a numeric input variable and a numeric target variable.
The process is similar to classification.

Generic code for loading datasets and fitting pipelines

# 拟合最终模型并在保险数据集上做出预测的例子
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
# 导入数据集
dataframe = read_csv(data, header=None)
# 拆分为输入变量和输出变量
data = dataframe.values
# 以尽量小的内存使用数据集
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# 训练集上的交叉验证平均分数为: -29.1476
exported_pipeline = LinearSVR(C=1.0, dual=False, epsilon=0.0001, loss="squared_epsilon_insensitive", tol=0.001)
# 修正了导出估计器中的随机状态
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)
# 模型训练
exported_pipeline.fit(X, y)
# 对新数据行进行预测
row = [108]
yhat = exported_pipeline.predict([row])
print('Predicted: %.3f' % yhat[0])

5 Practical Cases

Pima Indians Diabetes Dataset Predicts Prevalence of Diabetes Over 5 Years

# import the AutoMLpackage after installing tpot.
import tpot
# 导入其他必要的包。
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from tpot import TPOTClassifier
import os
# 导入数据
file_path = './pima-indians-diabetes.data.csv'
df = pd.read_csv(file_path,header=None) 
#可以用你自己的数据集.csv文件名替换
# 将数据帧的值拆分为输入和输出特征
data = df.values 
X, y = data[:, :-1], data[:, -1] 
print(X.shape, y.shape) 
#(768, 8 ) (768,) 
X = X.astype('float32') 
y = LabelEncoder().fit_transform(y.astype('str')) 
#模型评估定义,这里使用10倍StratifiedKFold 
cv = StratifiedKFold(n_splits=10) 
# 定义 TPOTClassifier 
model = TPOTClassifier(generations=5, population_size=50,
                       cv=cv, score='accuracy',
                       verbosity=2, random_state=1,
                       n_jobs=-1) 
# 执行最佳拟合搜索
model.fit(X , y) 
# 导出最佳模型
model.export('tpot_data.py')

6 TPOT other parameters and configuration

6.1 Parameters

generation : int or None optional (default=100)
The number of iterations to run the pipeline optimization process. It must be a positive number or None.

Population_size : int, optional (default 100)
Number of individuals retained in the genetically programmed population per generation. Must be a positive number.

offspring_size : int, optional (default=None)
Number of offspring produced in each gene programming generation, positive number, by default the number of offspring is equal to the population size.

variant_rate : float, optional (default = 0.9)
The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to each generation, the default parameter is recommended unless you understand how the mutation rate affects the GP algorithm.

crossover_rate : float, optional (default=0.1).
The crossover rate of the genetic programming algorithm is in the range of [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to "breed" each generation. variant_rate + crossover_rate cannot exceed 1.0. We recommend using the default parameters unless you understand how the cross rate affects the GP algorithm

scoring: string or callable, optional (default="accuracy").
Function for evaluating the quality of a given pipeline for classification problems, the following built-in scoring functions are available:
'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', ' f1_samples', 'f1_weighted', 'neg_log_loss', 'precision' etc. (suffixes apply as with 'f1'), 'recall' etc. (suffixes apply as with 'f1'), 'jaccard' etc. (suffixes apply as with 'f1'), 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted'
If you want to use a custom scorer, you can pass the callable with scorer(estimator, X, y) /function.

cv : int, generator of cross-validation or iterable, optional (default=5)
Cross-validation strategy to use when evaluating the pipeline.

subsample: float, optional (default=1.0)
Fraction of training samples used during TPOT optimization. Must be in the range (0.0, 1.0]. Setting subsample=0.5 tells TPOT to use a random subsample of half the training data, which will remain constant throughout the pipeline optimization.

n_jobs : integer, optional (default = 1)
Number of processes to use in parallel for evaluating the pipeline during TPOT optimization.
Setting n_jobs=-1 will use as many cores as possible. For n_jobs less than -1, (n_cpus +1 + n_jobs) will be used. So for n_jobs=-2, all CPUs except one will be used. Note that using multiple processes on the same machine may cause memory issues with large datasets.

max_time_mins : integer or None, optional (default=None)
How many minutes TPOT takes to optimize the pipeline. If not set to None, this setting will allow TPOT to run until max_time_mins minutes have elapsed, then stop. If generations is set, TPOT will stop early when generations have been executed.

max_eval_time_mins : float, optional (default = 5)
How many minutes TPOT must evaluate to evaluate a pipeline.
Setting this parameter to a higher value will allow TPOT to evaluate more complex pipelines, but also allow TPOT to run for longer periods of time. Using this parameter can help prevent TPOT from wasting time evaluating time-consuming pipelines.

random_state : integer or None, optional (default=None)
The seed for the pseudo-random number generator used in TPOT. Using this parameter ensures that TPOT will give the same results every time you run it on the same dataset with that seed.

config_dict: Python dictionary, string or None, optional (default=None)
A configuration dictionary used to customize the operators and parameters TPOT searches for during optimization.
Possible inputs are:
Python dictionary, TPOT will use your custom configuration,
string "TPOT light", TPOT will use built-in configuration with only fast model and preprocessor, or
string "TPOT MDR", TPOT will use A built-in config specifically for genomics research, or
the string "TPOT sparse": TPOT will use a config dictionary with one-hot encoders and operators normally included in TPOT also support sparse matrices, or None, TPOT will use the default TPOTClassifier configuration.

warm_start: boolean, optional (default=False)
Flag indicating whether the TPOT instance will reuse padding from previous calls to fit().
Setting warm_start=True is useful for running TPOT on a dataset for a short time, checking the results, and then continuing TPOT from where it left off.

memory: A joblib.Memory object or string, optional (default=None)
If provided, the pipeline will cache each transformer after calling fit. Use this feature to avoid computing a fitting transformer inside a pipeline if the parameters and input data are the same as another fitting pipeline during optimization. More details about memcache in the scikit-learn docs, possible inputs are: string "auto": TPOT uses memcache with a temporary directory and clears it on shutdown, or path to cache directory, TPOT uses The memory cache of the provided directory, and TPOT will not clean up the cache directory when closing, or the memory object, TPOT uses the joblib instance.Memory is used for memory caching, and TPOT will not clean up the cache directory when closing, or None, TPOT does not use memory cache.

use_dask : boolean, optional (default: False)
Whether to use Dask-ML's pipeline optimization. This avoids refitting the same estimator multiple times on the same data split. More detailed diagnostic information will also be provided when using Dask's distributed scheduler.

period_checkpoint_folder: path string, optional (default: None)
if provided, is a folder where TPOT periodically saves pipelines so far while optimizing. Currently, once per generation, but no more than once every 30 seconds. Useful in multiple situations: TPOT may cause sudden death, saving an optimized pipeline; tracking progress; fetching a pipeline while still optimizing.

early_stop : integer, optional (default: None)
TPOT checks for several generations, whether the optimization process has not improved. End the optimization process if there is no improvement for the given number of generations.

verbosity : integer, optional (default = 0)
How much information TPOT conveys at runtime.
Possible inputs are:
0, TPOT will print nothing,
1, TPOT will print minimal information,
2, TPOT will print more information and provide a progress bar, or
3, TPOT will print everything and provide a progress bar.

disable_update_check: Boolean, optional (default=False)
Flag indicating whether TPOT version checker should be disabled. The update checker will tell you when a new version of TPOT has been released.

log_file: io.TextIOWrapper or io.StringIO, optional (default sys.stdout)
Save progress content to file.

6.2 Method

Runs the TPOT optimization procedure on the given training data
to optimize the machine learning pipeline using genetic programming to maximize the score for the provided features and objectives. This pipeline optimization pass uses internal k-fold cross-validation to avoid overfitting to the provided data. At the end of the pipeline optimization process, the best pipeline is then trained on the entire sample set provided.


class predict(features) using an optimized pipeline to predict feature sets

Use an optimized pipeline to estimate class probabilities for a feature set
Note: This function is only available for pipelines whose final classifier supports the predict_proba function. Otherwise, TPOT raises an error.
predict_proba(features)

Returns the score for the optimization pipeline given the test data, using a user-specified scoring function. The
default scoring function for TPOTClassifier is "accuracy".
score(testing_features, testing_classes)

Export the optimized pipeline as Python code
export(output_file_name, data_file_path)

6.3 Configuration

Tried TPOT for the above problem, it just uses the default configuration. In fact, AutoML TPOT also has many built-in configurations. The variants are listed below:

  • TPOT light: If you want simple operators in your pipeline. Also, this configuration ensures that these operators are also fast to execute.
  • TPOT MDR: If your problem is in the field of bioinformatics research and this configuration is well suited for genome-wide association studies.
  • TPOT sparse: If you need a configuration suitable for sparse matrices.
  • TPOT NN: If you want to utilize the default TPOT's neural network estimator. Also, these estimators are written in PyTorch.
  • TPOT cuML: If your dataset size is medium or large and leverages GPU-accelerated estimators to search for optimal pipelines on limited configurations.

7 TPOT Disadvantages: Time-consuming

In the case of a large number of parameters, it is quite time-consuming to automatically search for the network architecture, from which we can see the TPOT problem- time- consuming . With the default TPOT parameters (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before completing. Consider a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long the grid search takes. These 10,000 models were evaluated with 10-fold cross-validation, which means that about 100,000 models were matched and evaluated on a grid searched training data. This is a time-consuming process, even for simple models like decision trees.
A typical TPOT run will take hours to days to complete (unless it is a small data set), but it is possible to interrupt the run and see the best results so far. TPOT also provides a warm_start parameter, which can restart a previously running TPOT from where it was interrupted.
Possible solution:
When doing data mining problems, you can try to sample a small part of the data and run TPOT after data cleaning to get a baseline, and the effect may be good.

Guess you like

Origin blog.csdn.net/weixin_45928096/article/details/125840325