One-stop machine-learning platform construction practice

According to the US group distribution senior technical experts Zheng Yanwei in 2019 SACC speeches (Chinese system architect of the General Assembly) made on finishing, it introduces the lessons learned and to explore the technical team at the US Mission Distribution building a one-stop platform for machine learning process , hope to help students in this field.

0. EDITORIAL

AI is the Internet industry hot "star", whether it is old giants, or flow upstart, are seeking to develop AI technology for their own business empowerment. As a distribution platform for closed-loop chain important takeaway on a ring, distribution efficiency and user experience is the core competitiveness distribution business. With a single volume rise, increasing rider, complicated distribution scenarios, various algorithms distribution scene in faster (fast iterative algorithm requires rapid on-line), better (business increasingly dependent on machine learning algorithms to generate positive results) and more accurately (various prediction algorithms such as estimated arrival time, the need for accurate approximation of the true value) at the target is also facing increasing challenges. The final algorithm on the line from research to play a role, requires a range of engineering development and docking, which triggered a new problem: how to define the boundary algorithm and engineering, their duties and their respective good long? How to improve the speed and efficiency of the algorithm iterations line? How to quickly and accurately assess the effect of the algorithm? This article will share their beauty regiment distribution and technical team to explore some experience in the construction of a one-stop platform for machine learning process, we hope to be able to help or inspire.

1. Business Background

July 2019, the Japanese take-away orders US group has exceeded 30 million single, occupies a relatively leading market share. Around the user, the merchant, the rider, the US group to build a global leader in the distribution of real-time distribution network, the construction industry-leading intelligent distribution system of the US delegation, formed the world's largest takeaway delivery platform.

How to make a distribution network run more efficiently, better user experience, is a very difficult challenge. We need to address a large number of sophisticated machine learning and optimization of logistics and other issues, including ETA forecasting, intelligent scheduling, maps optimization, dynamic pricing, situational awareness, intelligence operations and other fields. At the same time, we also need to strike a balance between experience, efficiency and cost.

2. US group distribution machine learning platform evolution

2.1 Why build a one-stop machine-learning platform

If you want to solve the problem of machine learning, we need to have a powerful and easy to use machine learning algorithm platform to assist research and development staff to help you out complicated engineering development, the limited focus on iterative algorithm above energy strategy.

At present the industry's relatively good machine learning platform there are many, both large research and development of commercial products, such as Microsoft Azure, Amazon SageMaker, Ali PAI platform, Baidu's PaddlePaddle and Tencent TI platform, there are many open source products, such as UC Berkeley's Caffe, Google's TensorFlow, Facebook's PyTorch and the Apache Spark MLlib and so on. The open-source platform are mostly machine learning or deep learning basic computing framework, focusing on training machine learning or depth learning model; the company's commercial products are based on machine learning and deep learning computational framework for secondary development, providing one-stop ecological services, to provide users with data preprocessing, model training, model evaluation, model line prediction of the whole process of development and deployment support, in order to reduce the threshold algorithm students.

One-stop machine-learning platform for company-level goals and positioning, and our need for machine learning platform coincide: to provide users with end to end one-stop services to help them out complicated engineering development, the limited energy policy focused on iterative algorithm above. In view of this, the US group distribution of one-stop machine-learning platform came into being.

US group distribution machine learning platform evolution can be divided into two stages:

  • MVP stage : flexible, fast trial and error, with rapid iteration capabilities.

  • Platform stage : business exponentially growing need for machine learning algorithms scene more and more, how to ensure that both business development, but also solve the system availability, scalability, efficiency research and development and other issues.

2.2 MVP stage

The initial stage, it is not clear on machine learning platform to develop into what looks like, a lot of things to think clearly. But in order to support business development, must be fast on the line, quick trial and error. Therefore, at this stage, each business line construction alone own machine learning tools, were each iteration according to the specific needs of their business, fast support line landing applied to specific business scenarios on machine learning algorithms, that is, as we know it "chimney mode". This mode conflicts, is very flexible and able to quickly support the individual needs of the business, for the business to seize market opportunities to win. But with the gradual expansion of business scale, the disadvantage of this "chimney mode" on the highlights out, mainly in the following two aspects:

  • Repeat-create the wheel : engineering features, model training, model predictions are their own online research and development, from scratch, the low efficiency of the iterative algorithm.

  • Features Caliber confusion : each party repeat business development characteristics, the same statistical characteristics are inconsistent, making it difficult to work together between algorithms.

2.3 platform stage

In order to avoid various departments repeat-create the wheel, improve R & D efficiency, while unified computing caliber business metrics and features, data system standardized distribution side, the US group distribution of R & D team set up an algorithm engineering team dedicated to regular machine each line of business learning tools, want to build a unified platform for machine learning, their needs include the following aspects:

  • This relies on the underlying platform within Hadoop / Yarn resource scheduling management, integrated Spark ML, XGBoost, TensorFlow three kinds of machine learning framework, and retained extensibility, easy access to other machine learning framework, such as from US group grinding MLX (oversized scale machine learning platform, designed for scheduling problem search, recommendations, advertising and other customization, support and streaming features ten billion update).

  • By Spark ML, XGBoost, encapsulation of TensorFlow machine learning framework, we achieved the visual off-line training platform, to generate the DAG by dragging pull manner, mask the differences plurality of training frames, a unified model training and resource allocation, reducing algorithm RD access threshold.

  • Model management platform that provides a unified model registration, discovery, deploy, switch, demotion and other solutions and provides highly available online forecasting service for real-time computing machine learning and deep learning model.

  • Wherein the offline platform, the collected log sorting line, characterized in refined into calculation algorithm required characteristics under applied to the line and the line.

  • Real-time platform features, online data collected in real time, to calculate the feature extraction algorithms needed, and real-time push applied to the line.

  • Version management platform, model version and version management algorithm algorithm used, the characteristics and parameters.

  • AB experimental platform, through scientific triage and assessment methods, faster and better results validate the algorithm.

3. Turing platform

Platform stage, we have the US group distribution machine learning platform targeting is: a one-stop platform for machine learning, algorithms students to provide one-stop service, covering algorithm students research, development, on-line, the whole process of evaluation algorithm effects, including : data processing, feature production, sample production, model training, model evaluation, model release, online prediction and impact assessment. In response to this, we returned to the platform took a bold name --Turing, Chinese name for the Turing platform, though a bit "audacious", but also be an encouragement for our team.

1) In the first stage of data acquisition, processing to support online and offline two levels, respectively, characterized by the production of real-time and off-line sampling, filtering, normalization and standardization means, and pushed to the online feature database for use online services .

2) model training phase, support the classification, regression, clustering, deep learning and other models, and support for custom Loss Loss Function.

3) evaluation phase model, support a variety of evaluation index, such as the AUC, MSE, MAE, F1 and so on.

4) model release phase, to provide one-click deployment capabilities, support for local and remote modes, corresponding to the online prediction model deployed in a cluster of local business services and deployment in specific.

5) online forecast period, supported AB test, flexible gray published volume, and to achieve a unified assessment of experimental results AB Buried log.

3.1 off-line training platform

Offline training target platform is: build visualization platform training, a plurality of training frames mask the differences, the threshold algorithm to reduce the access of RD.

In order to reduce the barriers to entry of machine learning algorithms RD field, we have developed off-line training platform with a visual interface, drag pulled through the various components are combined into the DAG to generate a complete machine-learning training mission.

Currently supported components roughly divided into: an input, an output, characterized in preprocessing, processing of the data set, machine learning models, learning models depth of several categories, each category have developed a number of different components, each different application scenarios support . Meanwhile, in order not to lose flexibility, we also spent a lot of thought, provides a variety of parameters such as custom, automatic parameter adjustment, custom Loss function and other functions, to meet various business needs the flexibility of a variety of algorithms directions classmates.

Our off-line training platform in the output model, in addition to the model output file, the output is also a MLDL (Machine Learning Definition Language) file, all the models of pre-processing module information is written MLDL file with the model saved in the same directory. When the model release, model files associated MLDL file as a whole to co-publish online. When line calculation, to automatically execute the preprocessing logic MLDL then performs logic calculation model. MLDL opened by the online prediction and offline training, machine learning throughout the platform, so that the same set of features and a line using a pre-code for the next frame line to ensure consistency offline and online processing.

Obtaining a characteristic of work when publishing model, we also provide the binding feature model function that allows users to associate features and parameters into the model, automatically get online prediction model features a convenient time, which greatly simplifies the algorithm for constructing model input RD the amount.

3.2 Model Management Platform

Introduced in front of our Turing platform integrates Spark ML, XGBoost, TensorFlow three kinds of training underlying framework, based on this, the type of machine learning models our training platform output is also very much, there is a simple LR, SVM, tree model there GBDT, RF, XGB and other deep learning model has RNN, DNN, LSTM, DeepFM and so on. The goal of our model is to provide a unified management platform model registration, discovery, deploy, switch, demotion and other solutions and provides highly available online forecasting service for machine learning and deep learning model.

Model Management Platform supports local and remote two deployment modes:

  • Local: MLDL unified model and pushed to the side of business service node, while Turing platform provides a Java Lib packages, embedded into the business side of the application, the business side of the way through the local interface call model calculations.

  • Remote: Turing platform maintains a dedicated online computing clusters, model and MLDL to deploy unified online computing clusters, business party application calls online computing services model calculated by the RPC interface.

For large scale model, stand-alone can not be loaded, the need for model Sharding. In view of the US group's distribution business features, you can follow the delivery of urban / regional zoning training, each city or a small model of regional output, multiple partitions dispersion model deployed on multiple nodes, a single node can not solve the problem of loading large model . Partition model requires us to provide the routing function model in order to accurately locate the business side node corresponding partition model deployment.

Meanwhile, the model management platform also collect heartbeat of each service node reporting information, maintenance status and switch the version of the model to ensure consistent on all nodes of the model version.

Offline features 3.3 platform

Distribution online business will record data every day many riders, businesses, and other dimensions of the user, the data obtained through so-called ETL processing off-line features, algorithms students use these features off-line training model, and use these online features online models predict. Offline features of the platform it is to be stored in the Hive table offline to feature data production line, providing access to the external features of the online service offline capabilities, supporting various business and distribution of high concurrency and fast iterative algorithm.

The simplest solution, directly to the offline feature storage DB, DB read directly acquired online service features Value. DB read operation is heavy, this solution does not meet the Internet significantly large complicated scenes directly out Pass.

The second program, wherein the off-line is stored as the respective structures to Redis KV, the online services directly read Redis Key Value according to the characteristic features acquired. This program takes advantage of high-performance memory KV Redis database, at first glance, seems to meet the needs of the business, but the actual use, there are also serious performance problems.

A typical business scenarios: For example, we want to predict the 20 distribution businesses grow, assuming that every business needs 100 features, then we need 20 * 100 = 2000 features model calculations, 2000 KV. If a single direct acquired, can not meet the performance requirements of the business side; if the interface using the bulk Mget Redis provided, if acquired every 100 KV, the required 20 Mget. Mget consuming TP99 cache of about 5ms, 20 times Mget, TP99 close 100ms, can not meet the performance needs of the business side (upstream service timeout time of about 50ms).

Therefore, we need to optimize the offline feature from the storage and retrieval. We proposed the concept of group of features, features of the same dimensions, polymerized into a KV according to the structure feature set, greatly reducing the number of Key; and provides a relatively comprehensive management capabilities to support dynamic adjustment feature group (assembly, split, etc.).

3.4 features real-time platform

Compared to traditional distribution, both in real-time location information distribution, riders load, or are rapidly changing in terms of the current road network, as well as businesses such as the food situation, the real-time requirements are very high. In order to be able to take effect on machine learning algorithms instantly online, we need to collect a variety of online real-time business data, perform calculations, refined into features of the algorithm needed, and real-time updates.

3.5 AB experimental platform

AB test is not a new concept, since 2000, Google engineers will use this approach in its Internet products, AB experiments at home and abroad is becoming increasingly popular, the Internet has become an important manifestation of product operations finesse. Briefly, AB experimental use in the method of optimization of the product is: Before the official version of iterations made to develop two (or more) programs for the same target, a corresponding user traffic divided into several groups, wherein each user to ensure under the same premise, allowing users to see a different design, feedback based on real data sets of users and scientific products to help make decisions.

AB Common Internet field experiment, mostly for the C-terminus of the user selected flow rate, such as after a random or based on a hash calculation shunt registered user identifier UID or the user equipment (mobile subscriber number IMEI / PC user Cookie). Such programs are widely used in the search, recommendations, advertising and other areas, reflecting the personalized features thousands of thousand faces. Features such programs is simple, assuming request independent and identically distributed, independent decision-making between the flow without disturbing each other. Such AB experiment was able to do so because: C end flow is relatively large, enough samples, and there is no mutual interference between different users, as long as sufficient random split, i.e. substantially request can ensure independent and identically distributed.

AB immediate delivery in the field of experiments around the user, the merchant, the three riders were no longer independent of each other between the user / merchant / rider, but interact with each other constraints. For such a scenario, the existing diversion program would result in different strategies interfere with each other, unable to effectively evaluate the merits of each individual strategy flow.

In view of the above problems, we will test the distribution side AB is divided into three stages: ex ante AA group, in the matter of AB diversion effect after the assessment.

  • AA groups: the candidate flow in accordance with pre-established rules divided into control and experimental groups, based on the theory of mathematical statistics to ensure the control and experimental groups did not differ significantly on the business metrics of interest.

  • AB Diverting: real-time request to the online version assigned to control or experimental.

  • Effect Evaluation: Evaluation results Comparative Experiment AB based on the data of the control and experimental groups.

Because immediate delivery more special scenes, such as when AB experiments conducted in accordance with the distribution area or city, because of the limited sample space is hard to find no difference in the control group and the experimental group, so we devised a method of diversion control of sub-time slice AB : fragmentation support, a plurality of time slice rotary switch by the day, hour, minute, in different regions, between different time slots, different strategies for alternately switching shunt AB, the line to minimize the impact of factors, ensure that experimental science fair.

4 Summary and Outlook

Turing platform currently supports the US group distribution, algorithm offline training, online forecasting, AB experiments baby elephant, LBS platform BU, etc., pay more attention to the RD algorithm iterative algorithm optimization strategy itself, significantly improve the efficiency of the algorithm RD. In the future we will continue to explore in depth the following aspects:

1) strengthen the depth of learning.

  • Strengthen the construction of deep learning, comprehensive support deep learning and achieve deep learning related components and assemblies as machine learning, and can use any combination of components in a visual interface.

  • More commonly used off-line training support deep learning model.

  • Support direct write Python code to customize the depth learning model.

2) the online prediction platform, further decoupling algorithm and engineering.

  • Simplified Turing platform SDK, the main computational logic separation, building the online prediction platform.

  • Online platform for dynamic load forecasting algorithm package, decoupling algorithm, business engineering side, Turing platform.

About the Author

Yan Wei, the US group distribution technical team of senior technical experts.

Job Offers

If you want to feel close to the charm of Turing platform, welcome to join us. US Mission Recruitment scheduling and distribution technology team performance direction, LBS direction, machine learning platform, algorithm engineering technical expert direction and architect, to build the industry's largest single real-time platform and distribution network, to face complex business and high concurrent flow the challenge to meet the comprehensive distribution business intelligence era. Interested students can send your resume to: [email protected] (e-mail with the subject: the US group distribution technical team).

Guess you like

Origin juejin.im/post/5e3d321ce51d4526e74fd765