Let machine learning automatically help us model, these 4 Python libraries can open your eyes

Automatic machine learning (usually referred to as AutoML) is an emerging field, in this field, the process of modeling data by establishing machine learning models is automated. AutoML makes modeling easier and makes it easier for everyone to access.

 

 

If you are interested in AutoML, these four Python libraries are the best choices!

 

1.auto-sklearn

auto-sklearn is an automated machine learning tool library that can be seamlessly integrated with the standard sklearn interface that everyone is very familiar with. By using recent methods, such as Bayesian optimization, the library can be used to navigate the space of possible models and learn to infer whether a particular configuration can perform a given task well.

The library was created by Matthias Feurer and others, and its technical details are described in the paper "Efficient and Robust Machine Learning", Feurer wrote:

…We introduced a powerful new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods and 4 data preprocessing methods, resulting in a structured hypothesis space containing 110 hyperparameters) .

auto-sklearn may be the best library for getting started with AutoML. In addition to data preparation and model selection for discovery data sets, it can also learn from models that perform well on similar data sets, and can aggregate the best performing models together.

Let machine learning automatically help us model, these 4 Python libraries can open your eyes

In addition to efficient implementation, auto-sklearn requires minimal user interaction, and the library can be installed using pip install auto-sklearn.

The main classes that can be used are AutoSklearnClassifier and AutoSklearnRegressor, which are used for classification and regression tasks, respectively. Both have the same parameters specified by the user, the most important of which are time limit and collection size.

 
  1. import autosklearn as ask 
  2. #ask.regression.AutoSklearnRegressor() for regression tasks 
  3. model = ask.classification.AutoSklearnClassifier(ensemble_size=10, #The size of the final collection (minimum value is 1) 
  4.                                                  time_left_for_this_task=120, #The number of seconds the process runs 
  5.                                                  per_run_time_limit=30) #The maximum number of seconds allocated for each model 
  6. model.fit(X_train, y_train) #Start fitting search model 
  7. print(model.sprint_statistics()) #Print search statistics 
  8. y_predictions = model.predict(X_test) #Get predictions from the model 

AutoSklearn official document: https://automl.github.io/auto-sklearn/master/

 

 

2.TPOT

TPOT is another Python library for automated modeling, but the library focuses more on data preparation and modeling algorithms and model hyperparameters. It uses a phylogenetic tree-based structure (called "Tree-based Pipeline Optimization Tool (TPOT), which can automatically design and optimize machine learning pipelines) to automate feature selection, preprocessing, and construction."

Let machine learning automatically help us model, these 4 Python libraries can open your eyes

The program or pipeline is represented by a tree. Genetic programs select and evolve certain programs to maximize the final result of each automated machine learning pipeline.

As Pedro Domingos said, “Garbage algorithms are more powerful than smart algorithms with limited data.” The truth is also true: TPOT can generate complex data preprocessing pipelines.

The TPOT pipeline optimizer may take several hours to produce good results, just like many AutoML algorithms (unless the data set is small). You can also run these long programs in Kaggle committes or googlecolab.

 
  1. import tpot 
  2. pipeline_optimizer = tpot.TPOTClassifier(generations=5, #number of iterations for training 
  3.                                          population_size=20, #training data amount 
  4.                                          cv=5) #StratifiedKFold in multiples 
  5. pipeline_optimizer.fit(X_train, y_train) #Adapt pipeline optimizer-may take a long time print(pipeline_optimizer.score(X_test, y_test)) #Print pipeline score 
  6. pipeline_optimizer.export('tpot_exported_pipeline.py') #Export pipeline-use Python code! 

The best function of TPOT is to export the model as a Python code file, which is convenient for future use.

TPOT official document: https://epistasislab.github.io/tpot/

TPOT example: https://epistasislab.github.io/tpot/examples/

 

 

3.HyperOpt

HyperOpt is a Python library developed by James Bergstra for Bayesian optimization. The library is designed for large-scale optimization of models with hundreds of parameters, can be explicitly used to optimize machine learning pipelines, and has options for selecting optimization processes across multiple cores and machines.

However, HyperOpt is difficult to use directly because it is very technical and requires careful specification of optimization steps and parameters. Instead, I suggest you use HyperOpt-sklearn, a HyperOpt wrapper that includes the sklearn library.

Specifically, although HyperOpt does support preprocessing, the focus is still on many hyperparameters in a specific model. If we carefully observe a search result of HyperOpt-sklearn, we will find that the result leads to a gradient boosting classifier without preprocessing:

 
  1. {'learner': GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, 
  2.                            learning_rate=0.009132299586303643, loss='deviance', 
  3.                            max_depth=None, max_features='sqrt', 
  4.                            max_leaf_nodes=None, min_impurity_decrease=0.0, 
  5.                            min_impurity_split=None, min_samples_leaf=1, 
  6.                            min_samples_split=2, min_weight_fraction_leaf=0.0, 
  7.                            n_estimators=342, n_iter_no_change=None, 
  8.                            presort='auto', random_state=2, 
  9.                            subsample=0.6844206624548879, tol=0.0001, 
  10.                            validation_fraction=0.1, verbose=0, 
  11.                            warm_start=False), 'preprocs': (), 'ex_preprocs': ()} 

Document used to build HyperOpt-sklearn model: http://hyperopt.github.io/hyperopt-sklearn/.

Although the HyperOpt library is much more complicated than auto-sklearn and TPOT, if hyperparameters are very important to your model, it is worth using it.

4.AutoKeras

Compared with standard machine learning libraries, neural networks and deep learning are more powerful and therefore more difficult to automate.

Using AutoKeras, neural structure search algorithms can find the best structure, such as the number of neurons in a layer, the number of layers, the layers to be merged, and layer-specific parameters such as filter size or the percentage of missing neurons. Once the search is complete, the model can be used as a normal TensorFlow/Keras model.

By using AutoKeras, we can build a model that contains complex elements (such as embedding and space reduction) that are not easily accessible to people who are still learning deep learning.

When AutoKeras creates a model for us, many preprocessing, such as vectorization or clearing text data, will be completed and optimized for you.

Two lines are required for startup and training search. AutoKeras has a Keras-like interface, so it is not difficult to remember and use.

AutoKeras supports text, images and structured data, as well as an interface for beginners and those who seek more technical details. AutoKeras uses an evolutionary neural structure search method to eliminate heavy work and ambiguity for us.

Although AutoKeras takes a long time to run, there are many user-specified parameters that can be used to control the running time, the number of models explored, the size of the search space, etc.

Consider this considered architecture for text classification tasks generated by AutoKeras.

 
  1. Hyperparameter      |Value     |Best Value So Far    
  2. text_block_1/block_type|transformer|transformer          
  3. classification_head_1/dropout|0         |0                    
  4. optimizer           |adam      |adam                 
  5. learning_rate       |0.001     |0.001                
  6. text_block_1/max_tokens|20000     |20000                
  7. text_block_1/text_to_int_sequence_1/output_sequence_length|200       |200                  
  8. text_block_1/transformer_1/pretraining|none      |none                 
  9. text_block_1/transformer_1/embedding_dim|32        |32                   
  10. text_block_1/transformer_1/num_heads|2         |2                    
  11. text_block_1/transformer_1/dense_dim|32        |32                   
  12. text_block_1/transformer_1/dropout|0.25      |0.25                 
  13. text_block_1/spatial_reduction_1/reduction_type|global_avg|global_avg           
  14. text_block_1/dense_block_1/num_layers|1         |1                    
  15. text_block_1/dense_block_1/use_batchnorm|False     |False                
  16. text_block_1/dense_block_1/dropout|0.5       |0.5                  
  17. text_block_1/dense_block_1/units_0|20        |20 

AutoKeras tutorial: https://towardsdatascience.com/automl-creating-top-performing-neural-networks-without-defining-architectures-c7d3b08cddc

AutoKeras official document: https://autokeras.com/

 

 

Comparison: Which AutoML library should you use?

If your top priority is a clean, simple interface and relatively fast results, choose auto-sklearn. In addition: The library is naturally integrated with sklearn and can be used with commonly used models and methods, so that it can have more control over time.

If your top priority is high accuracy, regardless of the long training that may be required, use TPOT. Emphasis on advanced preprocessing methods, which are made possible by representing the pipeline as a tree structure. Extra tip: TPOT can output Python code for the best model.

If your top priority is high accuracy, regardless of the long training that may be required, you can use HyperOpt sklearn. Emphasize the hyperparameter optimization of the model, and I don’t know whether it will produce an effect. The specific situation depends on the data set and algorithm.

If your problem requires a neural network to solve, especially when it appears in the form of text or images, use AutoKeras. Although it does require a long training time, there are many measures to control the time and the size of the search space.

 

 

【Editor's Choice】

  1. What is the artificial intelligence that everyone is talking about?
  2. The dark side of artificial intelligence: how to make artificial intelligence trustworthy
  3. Artificial intelligence can help healthcare make diagnosis and treatment decisions
  4. Application direction of artificial intelligence in mechanical fault diagnosis
  5. AI employment direction and prospects

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/108659723