跨平台机器学习实践小结（一）

一、问题来源：

如何在node web服务下调用sklearn的模型结果来进行实时模型预测？

二、问题分析：

1、sklearn的模型结果有几种存储方式：

（1）pickle.dumps ，结果通过变量保存在内存中

附上pickle文档：https://docs.python.org/2/library/pickle.html

>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])

（2）joblib.dump，持久化到二进制文件pkl中，可复用性更强

>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl')

预测时，在另一个python进程中可以执行：

>>> clf2 = joblib.load('filename.pkl')

（3）pmml文件，PMML一种使用xml描述模型的语言标准

最有用的参考文档是：http://dmg.org/pmml/v4-1/GeneralStructure.html ，对PMML文件的结构及标签含义都有较清楚的说明。

先说明一下生成方式，以sklearn的gbdtregression为例：


from sklearn2pmml import sklearn2pmml

from sklearn2pmml.pipeline import PMMLPipeline

from sklearn.ensemble import GradientBoostingRegressor

from sklearn2pmml import sklearn2pmml

import numpy as np

import matplotlib.pyplot as plt



from sklearn import ensemble

from sklearn import datasets

from sklearn.utils import shuffle

# Load data

boston = datasets.load_boston()

X, y = shuffle(boston.data, boston.target, random_state=13)

X = X.astype(np.float32)

offset = int(X.shape[0] * 0.9)

X_train, y_train = X[:offset], y[:offset]

X_test, y_test = X[offset:], y[offset:]

params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,

'learning_rate': 0.01, 'loss': 'ls'}

pipeline = PMMLPipeline([

    ("classifier", GradientBoostingRegressor(**params))

])
pipeline.fit(X_train, y_train)

sklearn2pmml(pipeline, "GradientBoostingRegressor.pmml", with_repr = True)

scikit-learn源码中有许多example和数据集，可参考。

总的来说，想要跨平台调用模型，实时的进行预测，那只能采用持久化之后的结果了。不然跑一边模型拿到内存结果再预测，用户不知道得等多久。

2、如何在web中应用pkl，pmml

(1)首先考虑pkl

.pkl是python对象经过序列化和持久化之后的文件。首先来看序列化：

pickle模块可以实现将Python对象序列化成byte stream字节流以及反序列化过程。

注意：一般“serialization”, “marshalling,” or “flattening”也表示pickle，序列化过程。

但是pickle是没有对加载序列化后的数据时做防护措施的，就是说如果有恶意代码插入其中的话，pickle模块不能识别，所以一般要在这个阶段把好关，只加载可信数据。

python还有一个marshal模块也是做序列化，但是它有几个缺点：不能识别已经序列化过的对象，所以在序列化递归代码时程序可能会崩。

shelve可以将pickle\unpickle对象持久化存成dbm文件

根据官方文档

The data format used by pickle is Python-specific. This has the advantage that there are no restrictions imposed by external standards such as XDR (which can’t represent pointer sharing); however it means that non-Python programs may not be able to reconstruct pickled Python objects.

By default, the pickle data format uses a printable ASCII representation. This is slightly more voluminous than a binary representation. The big advantage of using printable ASCII (and of some other characteristics of pickle’s representation) is that for debugging or recovery purposes it is possible for a human to read the pickled file with a standard text editor.

There are currently 3 different protocols which can be used for pickling.

Protocol version 0 is the original ASCII protocol and is backwards compatible with earlier versions of Python.

Protocol version 1 is the old binary format which is also compatible with earlier versions of Python.

Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes.

pkl文件是Pyhon的Pickle模块独有的，因此别的语言无法解析文件内容。

因此想要跨平台使用是不可能的了。

（2）转换成PMML

Java提供了一个PMML API 即JPMML。https://github.com/jpmml

JPMML封装了处理pmml文件的类库，可以方便使用，并且支持Spark\R\sklearn\xgboost

如对sklearn的支持是与Python 侧的 sklearn2pmml配合使用的。

测试的时候不使用pmml-evalutor的example，而通过自己修改pom.xml去简单的运行实例会更加方便。

可以下载maven库查看JPMML源代码。http://mvnrepository.com/artifact/org.jpmml/pmml-evaluator/1.4.3

三、待解决问题

sklearn2pmml样例库中的boston房价数据使用GBDTRegression一共生成了500颗深度为4的树，查看同一个训练模型生成的pmml文件中Segmentation的Tree 1，与Python图形库生成的dot文件描述的不一致，pmml中描述的节点少了许多，但是两者预测值确实基本一致的，只是默认数值精度上的区别。

四、方案制定

待续

跨平台机器学习实践小结（一）

猜你喜欢