H2O automated machine learning framework introduction and construction notes

introduction

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces such as R, Python, Scala, Java, JSON and the Flow notebook/web interface and works seamlessly with big data technologies such as Hadoop and Spark. H2O provides implementations of many popular algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks, Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), Cox Scale Risk, K-Means, PCA, Word2Vec, and Fully Automated Machine Learning Algorithms (H2O AutoML).

GitHub link: https://github.com/h2oai/h2o-3

Download address: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

This article is to install and test the python method under Ubuntu 18.04. Regarding h2o, there are currently 3 usage methods supported, namely R, python and Hadoop. The following describes the python method.

H2O introduction and installation

h2o installation

Installation address: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html#install-in-python

The python installation is very simple as a whole. According to the official website, it is very simple to install the precompiled version directly with pip.

pip install requests
pip install tabulate
pip install future

pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

The installation is completed without any errors, and the Java environment needs to be installed, because the bottom layer is written in Java, otherwise the demo cannot run. If there is no Java before the test environment, h2o.init() will fail as follows:

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Traceback (most recent call last):
  File "/home/anaconda3/envs/py_h2o/lib/python3.6/site-packages/h2o/h2o.py", line 263, in init
    strict_version_check=svc)
  File "/home/anaconda3/envs/py_h2o/lib/python3.6/site-packages/h2o/backend/connection.py", line 385, in open
    conn._cluster = conn._test_connection(retries, messages=msgs)
  File "/home/anaconda3/envs/py_h2o/lib/python3.6/site-packages/h2o/backend/connection.py", line 682, in _test_connection
    % (self._base_url, max_retries, "\n".join(errors)))
h2o.exceptions.H2OConnectionError: Could not establish link to the H2O cloud http://localhost:54321 after 5 retries
[21:59.43] H2OConnectionError: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Metadata/schemas/CloudV3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90c982c198>: Failed to establish a new connection: [Errno 111] Connection refused',))
[21:59.63] H2OConnectionError: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Metadata/schemas/CloudV3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90c987c048>: Failed to establish a new connection: [Errno 111] Connection refused',))
[21:59.84] H2OConnectionError: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Metadata/schemas/CloudV3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90c982c940>: Failed to establish a new connection: [Errno 111] Connection refused',))
[22:00.04] H2OConnectionError: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Metadata/schemas/CloudV3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90c95b10b8>: Failed to establish a new connection: [Errno 111] Connection refused',))
[22:00.25] H2OConnectionError: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Metadata/schemas/CloudV3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90c95b17f0>: Failed to establish a new connection: [Errno 111] Connection refused',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anaconda3/envs/py_h2o/lib/python3.6/site-packages/h2o/h2o.py", line 279, in init
    bind_to_localhost=bind_to_localhost)
  File "/home/anaconda3/envs/py_h2o/lib/python3.6/site-packages/h2o/backend/server.py", line 142, in start
    bind_to_localhost=bind_to_localhost, log_dir=log_dir, log_level=log_level, max_log_file_size=max_log_file_size)
  File "/home/anaconda3/envs/py_h2o/lib/python3.6/site-packages/h2o/backend/server.py", line 272, in _launch_server
    java = self._find_java()
  File "/home/anaconda3/envs/py_h2o/lib/python3.6/site-packages/h2o/backend/server.py", line 444, in _find_java
    raise H2OStartupError("Cannot find Java. Please install the latest JRE from\n"
h2o.exceptions.H2OStartupError: Cannot find Java. Please install the latest JRE from
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html#java-requirements

After the Java environment is installed, you can enter the python interactive environment or jupyter notebook test:

import h2o
h2o.init()
h2o.demo("glm")

The above three sentences of code, the second sentence is to check the status of h2o. For example, I printed the detailed output of h2o_cluster I printed below. There is still a problem here, that is, XGBoost may need to download the precompiled version in advance, that is, pip install and pull down the so file. I found that h2o will look for it at the beginning. If it is not found, it may report an error. But I have installed it before, if it is not installed, it should have no effect.

The third sentence is to directly load the GLM algorithm demo in the h2o package. This sentence cannot be performed in the jupyter notebook, because the interactive command is required to keep pressing the keys to automatically enter the code. The specific effect can be seen in the screenshot below, which is a relatively complete one. demo of automl.

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import h2o
>>> h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
--------------------------  ------------------------------------------------------------------
H2O_cluster_uptime:         55 secs
H2O_cluster_timezone:       Asia/Shanghai
H2O_data_parsing_timezone:  UTC
H2O_cluster_version:        3.34.0.7
H2O_cluster_version_age:    6 days
H2O_cluster_name:           H2O_from_python_root_6wla5i
H2O_cluster_total_nodes:    1
H2O_cluster_free_memory:    10.90 Gb
H2O_cluster_total_cores:    96
H2O_cluster_allowed_cores:  96
H2O_cluster_status:         locked, healthy
H2O_connection_url:         http://localhost:54321
H2O_connection_proxy:       {
    
    "http": null, "https": null}
H2O_internal_security:      False
H2O_API_Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version:             3.6.5 final
--------------------------  ------------------------------------------------------------------
>>> h2o.demo("glm")
-------------------------------------------------------------------------------
Demo of H2O's Generalized Linear Estimator.

This demo uploads a dataset to h2o, parses it, and shows a description.
Then it divides the dataset into training and test sets, builds a GLM
from the training set, and makes predictions for the test set.
Finally, default performance metrics are displayed.
-------------------------------------------------------------------------------

>>> # Connect to H2O
>>> h2o.init()

insert image description here
All that needs to be done here is to keep pressing Enter, and then the terminal will continue to demonstrate a series of codes and printouts from pulling down the data to finally getting the predicted value. Personal impression is still very good.

h2o test demo

Still take the Titanic data in kaggle as an analysis example. Before this, I wrote an example about Titanic data analysis and modeling, which detailed the process and steps of this data analysis. The link is:

Kaggle (1): Random Forest and Titanic

I won't go into details here, just use Titanic to model, let's import the data first:

import h2o
import pandas as pd
h2o.init()

# h2o导入方式
# train_1 = h2o.import_file("/home/data/train.csv")
# test_1 = h2o.import_file("/home/data/test.csv")

# pandas导入方式
train = pd.read_csv("/home/data/train.csv")
test = pd.read_csv("/home/data/test.csv")

Originally, I wanted to use h2o to model all at once, but I couldn't find a clear document on the official website to benchmark the various operations of pandas. In addition, it originally had many functions of pandas and sklearn. I didn't study it deeply, go to I found a kaggle notebook for reference. If you are a novice and just started data processing, I still recommend h2o. I feel that its API is closer to the understanding of Chinese people, but I am not used to pandas players like me. .

Then we can do the relevant analysis:

train_indexs = train.index
test_indexs = test.index
print(len(train_indexs), len(test_indexs))
# (891, 418)

df =  pd.concat(objs=[train, test], axis=0).reset_index(drop=True)
df = df.drop('PassengerId', axis=1)
train = df.loc[train_indexs]
test = df[len(train_indexs):]
test = test.drop(labels=["Survived"], axis=1)

After the data processing is completed, then we can carry out the Modeling of h2o:

from h2o.automl import H2OAutoML

hf = h2o.H2OFrame(train)
test_hf = h2o.H2OFrame(test)
hf.head()

# 选择预测变量和目标
hf['Survived'] = hf['Survived'].asfactor()
predictors = hf.drop('Survived').columns
response = 'Survived'

# 切分数据集,添加停止条件参数为最大模型数和最大时间,然后训练
train_hf, valid_hf = hf.split_frame(ratios=[.8], seed=1234)
aml = H2OAutoML(
    max_models=20,
    max_runtime_secs=300,
    seed=1234,
)

aml.train(x=predictors,
        y=response,
        training_frame=hf,
)
"""
......

ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.052432901181573156
RMSE: 0.22898231630755497
LogLoss: 0.19414205126074563
Null degrees of freedom: 890
Residual degrees of freedom: 885
Null deviance: 1186.6551368246774
Residual deviance: 345.9611353466487
AIC: 357.9611353466487
AUC: 0.9815107745076109
AUCPR: 0.9762820080059409
Gini: 0.9630215490152219
......

"""

lb = aml.leaderboard
lb.head(rows=5)
"""
model_id	auc	logloss	aucpr	mean_per_class_error	rmse	mse
StackedEnsemble_BestOfFamily_4_AutoML_1_20211228_171936	0.880591	0.389491	0.868648	0.170573	0.346074	0.119767
GBM_5_AutoML_1_20211228_171936	0.880452	0.392546	0.87024	0.172323	0.347309	0.120624
StackedEnsemble_BestOfFamily_5_AutoML_1_20211228_171936	0.879773	0.391389	0.867679	0.175702	0.347747	0.120928
StackedEnsemble_AllModels_2_AutoML_1_20211228_171936	0.879339	0.395392	0.866278	0.175774	0.348615	0.121533
StackedEnsemble_BestOfFamily_3_AutoML_1_20211228_171936	0.878607	0.394689	0.86662	0.178339	0.348864	0.121706

"""

valid_pred = aml.leader.predict(valid_hf)
metrics.accuracy_score(valid_pred.as_data_frame()['predict'], valid_hf.as_data_frame()['Survived'])
"""
0.9441340782122905
"""

The following is the result of my running. It can be seen that the relevant acc is even higher than the random forest used in my previous notes. h2o does have its advantages and greatly simplifies the code operation. The performance is very good, except that my personal experience is a little slow.
insert image description here

H2O flow

The products of H2O are probably:

H2O Flow: An open source distributed machine learning framework that can quickly build models through web pages;
Deep Water: An automated machine learning framework that supports TensorFlow, MXNet and Caffe in the backend.
Sparkling Water: scalable H2O machine learning algorithms and Spark functions are combined. With Sparkling Water, users can drive computation from Scala/R/Python and leverage the H2O Flow UI, providing application developers with an ideal machine learning platform.
Stream: real-time machine learning intelligent application solutions;
Driverless AI: driverless technology platform;

Quoted from the download page: https://www.h2o.ai/download/

The way of building with python as written above still requires code, the same as R and Hadoop, but H2O flow is basically a complete web, and basically no code is required.

H2O flow installation

First of all, you need to ensure that the server comes with a Java environment, because the bottom layer of h2o is Java:

root@R740:/home/program# java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

In a Java environment, find the latest h2o flow installation package directly from the download link above. The above 5 services, except Driverless AI, are all open source. Then we scp it to the server, decompress it and start it directly with the command. now:

>> unzip h2o-3.34.0.7.zip
>> cd h2o-3.34.0.7/
>> java -jar h2o.jar

insert image description here
If there is no problem, it will provide an address in the last log, http://localhost:54323. Entering this address, you can directly enter the h2o flow page without password verification.
insert image description here

The above Assistance are:

  • importFiles (read dataset)
  • importSqlTables (read SQL tables)
  • getFrames (view already read datasets)
  • SplitFrame (divide a dataset into multiple datasets)
  • mergeFrame (column or row combination of two datasets)
  • getModels (view all trained models)
  • getGrids (view grid search results)
  • getPredictons (view model prediction results)
  • getJobs (view the tasks currently trained by the model)
  • runAutoML (automatic modeling)
  • buildModel (build the model manually)
  • importModel (read model from local)
  • predict (use the model to predict)

Their steps are the same as the normal modeling process, and there is a certain pre-order. For example, if there is no corresponding data set, it is impossible to find the relevant records and models directly by clicking the last predict. There is nothing in the drop-down list. Choose model and dataset, so to play this web well, it really takes a lot of effort to learn, here I lead to the official more detailed readme, which is an accessible manual:

https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/README.md

If we want to learn while doing it, the pack in the help on the far right of the page provides the test demo in GitHub, we can run it and observe the output directly:
insert image description here

For example, the xgboost_example.flow I use here, after loading, running the entire file, you can see all the charts currently analyzed.
insert image description here


references:

[1]. https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/README.md

[2]. A preliminary study on h2o flow

[3]. In-depth analysis of AutoML framework - H2O: an automatic machine learning platform that Xiaobai can also use

[4]. Automated Modeling | Introduction to H2O Open Source Tools

[5]. https://cs.uwaterloo.ca/~tozsu/courses/CS848/W19/projects/Singla,%20Rao%20-%20SparkMLvsSparklingWater.pdf

[6]. The H2O Python Module

[7]. https://www.kaggle.com/andreshg/titanic-dicaprio-s-safety-guide-h2o-automl

Guess you like

Origin blog.csdn.net/submarineas/article/details/122177936