Mirror -58 intelligent data platform architecture and practice of visualization

background

Mirror is a data product development platform based on large data set of visual data intelligence platform. Traditional machine learning, data modeling process for non-science professionals, the overall high threshold, which is mainly reflected in several aspects:

1. Machine learning more abstract concept

Such as training set, validation set, test set, features, dimensions, label leaked, less fit, too fit, learning curve, verify curve, ROC curve, confusion matrix, etc., in addition to the need to understand the concept, the need to understand the specific use scene, using the method.

2. Machine learning modeling process complex

Data preparation, data preprocessing, statistical analysis, feature engineering, training models and modeling, model evaluation and comparison such as, in particular statistical analysis and engineering features, involving a large number of feature generation, transformation features, feature selection process, taking up data mining most of the time.

3. mathematical derivation and development capabilities

In addition to the familiar Python, R, in a big data environment using Java and Scala was also relatively high. Especially for those scenes need to implement their own algorithms to match their business, it will involve a large number of formula derivation, and other related high-order model parallel optimization capabilities.

 

aims

For those who understand the business needs, sensitive policy rules, and algorithm engineering capabilities relative lack of personnel, such as: business, operations, data products, we define the non-expert users; to have a rich modeling experience, superior engineering capabilities and algorithms, business we understand weaker person is defined as expert users.

Mirror user encompasses both types of users, we design for these two types of visual user interface, rich algorithmic components, convenient parameter adjustment mode, detailed assessment comparison reports.

Mirror objective aimed at non-expert users to reduce the threshold of machine learning, data mining tools to accelerate its business conduct exploration; while supporting the expert user, accelerate its overall modeling process.

 

Overall structure

Mirror -58 intelligent data platform architecture and practice of visualization

 

 

1. User system

The company's integrated SSO and BSP, with OA account login account as the sole, unified user system.

2. Security system

Integrated multi-tenant architecture big data platform, access to Hadoop accounts department when a command is executed in accordance with OA account, and then obtain the corresponding Hadoop Hive table privileged account, HDFS path and the corresponding resource queue.

3. Resource Layer

The current data sources, and based on the results of data storage Hive, model files stored in HDFS. Calculation engine based on Spark, based on Scala for data preprocessing, statistical analysis, logic, most of the model algorithm from Spark MLLib, part of the algorithm (XGBoost, LightGBM, FM, etc.) integrate third-party.

4. The logical layer

Currently logical layer covering the data source / destination ,, data preprocessing, statistical analysis, characterized in engineering, machine learning, tools six, a total of 70 assembly.

The application layer

For experiments provide project management and test management, provides a complete access control. Data management capabilities integrated with DP, can directly read large data platform that they have permission Hive table. At present, the management model for the binary model, provides a comprehensive model comparison, model release and other functions.

Meanwhile mirror provides a complete scheduling feature that automatically resolve dependencies experiment multiple components, and provides a variety of flexible scheduling policy.

6. Service Layer

Offline Scheduling: Scheduling integration under the trained model lines may form off-line scheduling service regularly trained to use the new data model to predict. The following figure shows a flowchart of scheduling offline:

 

Mirror -58 intelligent data platform architecture and practice of visualization

 

 

Online Forecast: binding model release, model correlation can be trained offline publishing model, to provide users the ability to use the HTTP interface for real-time forecasting.

 

Scheduling dependent

Scheduled for the following dependencies between components, the order quickly determine in what way their execution of it?

 

Mirror -58 intelligent data platform architecture and practice of visualization

 

 

Traditional thinking is used recursively determines which dependencies in the mirror we topological sort similar manner to convert a two-dimensional array form interdependence, simply determining the coordinates inside the lateral elements are all 0 can be determined promptly the next task is about to perform a node, reducing the amount of computation, improve computing speed:

 

Mirror -58 intelligent data platform architecture and practice of visualization

 

 

Feature generation

Feature engineering accounted for most of the time the entire data mining, and the characteristic features of the project generation and inside which takes up most of the time, how to reduce the feature generation time is a key concern of the mirror.

Expert users commonly used third party Python library implementation feature generation, which is a good FeatureTools, provided the entity relation, the associated entity, and after selecting the polymerization conversion method has good characteristics can be automatically generated explanatory.

Based stand-alone version of the Python framework Python or parallel to perform, and can not be directly integrated DP, and there is no way to use the company's resources to the cluster, based on this background, mirror on the wall made the following design:

 

Mirror -58 intelligent data platform architecture and practice of visualization

 

 

Thrust of thinking is: data-driven calculations.

1. Data Definition

需要定义好特征主表以及各特征子表,主表和子表之间需要通过关联字段进行关联。

2. 数据切分

按主表主键将主表数据以及关联的子表数据进行切分,按照规则生成在指定的HDFS目录中。

3. 分布执行

使用Spark定义序列,并将序列号作为Python函数参数,传入到特征生成函数中。各路径中的数据由各自指定的Python进程进行运算并存储。

4. 汇总结果

由Spark的Driver进程汇总各Python进程的执行结果,进行最终输出。

总结:利用集群的分布式计算能力加快了生成速度,同时利用了Python的第三方库能力进行了功能扩展。

 

自动建模(二分类)

目前业界对自动学习部分的支持力度越来越大,不仅国内的阿里PAI、第四范式,国外的H2O、TransmogrifAI 等也做得很深入。

更有甚者DataRobot完全针对非专家用户提供一键式运行,内部集成了数据预处理、特征生成、特征转换、特征选择等特征工程,内部更是集成了Spark、Python、R等等多种模型算法,算法并行执行、自动选择最优算法,并完成了模型一键部署以及模型部署后的效果跟踪。

魔镜也实现了类似的尝试,目标是针对非专家用户生成基线模型。

整体流程如下:

 

Mirror -58 intelligent data platform architecture and practice of visualization

 

 

1. 数据预处理

对特征维度进行统计,删除缺失值大于90%的特征。并对剩余特征按照60%、20%、20%的比例进行切分,分别作为训练集、验证机和测试集。

2. 特征工程

分别针对数值特征、非数值特征以及树形模型、非树形模型做了不同处理。

3. 模型训练

目前集成了4类算法(RF、GBDT、LR、XGBoost),分别针对每类算法提供了一组默认的参数组合,利用Spark的分布式能力,对各算法的各参数组合(网格式),在60%训练集以及20验证集上并行运行。

4. 评估报告

将各模型的结果按照评估指标进行排序,在评估报告中分别显示各算法的具体效果。提取最优模型的训练参数在前80%的数据上进行训练,并在最终20%测试集上进行最终效果评估。

 

展望

后期会在集成云窗调度、高维特征支持、Python模型支持、模型在线预测上进一步展开。

Guess you like

Origin www.cnblogs.com/cuiyubo/p/11297310.html