Easy to play with automatic machine learning AutoML: H2O Flow

Think about how tiring we write multiple lines of code every time we create a machine learning model! Although here is a summary of a universal template for building machine learning models, it is still tiring!

Ever wondered how easy and efficient it would be if we could build machine learning models with the click of a mouse? H2O Flow has a solution for all such problems!

Introduction to H2O Flow

H2O is an open source machine learning and artificial intelligence platform. It supports a web-based interface called Flow. H2O Flow can be used to create various types of machine learning models without writing any code. We can build machine learning pipelines with simple clicks. It has API support for R, Python, Scala.

AutoML (Automated Machine Learning) automates the modeling process, which allows data scientists to focus on other key aspects of the machine learning pipeline, such as feature engineering and model deployment.

H2O flow installation

picture

Download the latest version of the software from the official page H2O Download page[1]. First of all, you need to ensure that the server comes with a Java environment, because the bottom layer of H2O is Java.

In a Java environment, find the latest h2o flow installation package directly from the download link above. The above 5 services, except Driverless AI, are all open source. Then we scp it to the server, decompress it and start it directly with the command. now:

unzip h2o-3.34.0.7.zip
cd h2o-3.34.0.7/
java -jar h2o.jar

picture

If there is no problem, it will provide an address in the last log, http://localhost:54323, enter this address, you can directly enter the h2o flow page, no password verification

picture

The browser page displays as shown below. At first you might think that Flow is designed in a very similar way to Jupyter notebooks. The right panel is the help section, which is insightful for beginners.

picture

The above Assistance are:

importFiles(读取数据集)
importSqlTables(读取SQL表)
getFrames(查看已经读取的数据集)
SplitFrame(将一个数据集分成多份数据集)
mergeFrame(将两个数据集进行列组合或行组合)
getModels(查看所有训练好的模型)
getGrids(查看网格搜索的结果)
getPredicitons(查看模型预测结果)
getJobs(查看目前模型训练的任务)
runAutoML(自动建模)
buildModel(手动建立模型)
importModel(从本地读取模型)
predict(使用模型进行预测)

Their steps are the same as the normal modeling process, and there is a certain pre-order. For example, if there is no corresponding data set, it is impossible to find the relevant records and models directly by clicking on the last predict. There is nothing in the drop-down list. Choose model and dataset, so to play this web well, it really takes a lot of effort to learn, here I lead to the official more detailed README [2], which is a manual that can be consulted.

data loading

We will use a freely available dataset [3]. This data concerns the bank's direct marketing activities, and based on several characteristics, we need to predict the registration of its customers.

Now start creating our own Flow notebook. Notice the buttons on the top toolbar “+”. We can use it to insert cells. Just like Jupyter notebooks, we can include markdown cells for any text we want to write.

Click the Import file option and specify the location of the data file and start importing. We can also import files from other sources like HDFS and S3 buckets.

There are several ways to import data in H2O streams:

  • Click Helpthe button in the assist mebutton , then click the importFiles link. Enter the file path in the autocomplete search input field and press Enter . Select the file from the search results and click the "**Add All"** link to confirm.

picture

Flow_Import_AutoSuggest

  • In a blank cell, select CS format, and enter importFiles ["path/filename.format"](where path/filename.formatrepresents the full file path of the file, including the full file name. The file path can be a local file path or a website address.

data analysis

Data parsing refers to defining schemas. The parsing guesser automatically detects patterns for us. We are free to change any column as needed. We can change the categorical data type to a numeric data type, here we can change the " day" column to enumbecause there are only 7 days in a week.

We predict whether customers will sign up for term deposits, so this is a binary classification problem.

Usually one-hot encoding is applied to categorical data before building a machine learning model, but Flow provides us with automatic one-hot encoding. Keep clicking Parse, and after parsing is complete, we can view the refined data, including size, columns, and rows.

data exploration

In exploring and visualizing data, you can select columns for visualizing them individually. You can obtain distributions for numeric columns or frequency counts for categorical columns.

We can see agethe characteristics and summaries of the " " column along with the frequency distribution.

Here you can see ythe distribution of the " " column. By visualizing the target column, it can be seen that there is a high degree of class imbalance.

Likewise, we can check other columns as well.

Flow provides functions for estimating data. This is useful in situations where a linear model will be fitted with missing values. A number of methods are provided for imputation. The default method is set to _mean_

picture

Split train and test sets

Before we start training the model, we need to split the data into training and test sets. We can do this by data -> split framenavigating .

Note that the default split times for training and test sets are 75:25, respectively. This can be modified as needed. Rename the split to "training_set"and "test_set".

picture

split

Each dataset can be inspected individually by selecting a frame

picture

trainset

Building models with AutoML

AutoML trains various types of models, including GLM random forests, distributed random forests, extreme random forests, deep learning, XGboost, and stack ensemble models . It also provides a leaderboard where all models are sorted by some metric.

picture

AutoML

and can be training_frame, response_frameselected validation_frameas "training_set", "y"and respectively "test_set". We can ignore the other options as they are used to add advanced functionality.

picture

Frame

The number of cross-validations defaults to 5. Since we have a class imbalance situation, we can choose the Balanced class option.

If we know some models are irrelevant, we can also exclude them. max_runtime_secondsChange to 300 seconds. AutoML trains the model until max_runtime_secondsafter which it will stop. The default setting is 3600 seconds.

picture

max_runtime_seconds

Finally, we can "build models"start building the model by selecting options. There is an option to see live updates of the model while training. We can also view the scoring history in real-time with a graphical representation.

picture

build models

model exploration

Since we specified max_runtime_seconds, the training process will take five minutes. After the job is complete, we can navigate to the model leaderboard.

All models trained by AutoML are displayed in sorted order based on performance. In this case, the XGboost model is the best performing model.

picture

Leaderboard

AutoML provides visualizations of various metrics that can be used for model exploration. We can click on the curve for more details.

picture

visualization

We can also check the importance of variables. We can see that the duration variable is highly predictive and is used by this model.

picture

importance

Confusion matrices provide various evaluation metrics, such as correlation, and are therefore also important. We can also highlight specific variables and view them.

picture

confusion matrix

predict

After we are satisfied with the selected model, we can move on to making predictions.

First select the Prediction option from the toolbar, then select the model to use, then select the validation framework. Now just by clicking the forecast button, we can view the forecast value along with various evaluation metrics such as mean squared error.

picture

picture

We can see the confusion matrix and analyze our results along with various metrics and graphical representations.

picture

picture

At this point, we have successfully learned to use H2O in H2O Flow with a simple web-based UI, and train and visualize models without writing any code! From this article, it may be easier to simply understand and use AutoML to model machine learning. But if you want to dig deeper into H2O Flow, I recommend looking into the official documentation.

recommended article

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/124291856