How hornet's nest sorting algorithm model is recommended for fast iterative

(Original article hornet's nest technology, micro-letter ID: mfwtech)

Part.1 hornet's nest recommendation system

Hornet recommendation system mainly consists of recall (Match), sorting (Rank), reordering (Rerank) several components, the overall structure is as follows:

In the recall stage, will be screened from massive library of content a candidate set in line with the user's preferences (one hundred, one thousand); sorting phase On this basis, a more candidate set content based on specific optimization goals (such as click-through rate) calculation and selection of precise, accurate scoring for each piece of content, and thus most users select a small high-quality content of interest from the contents of hundreds of thousands of candidate sets.

In this article we will focus on one of the core hornet's nest recommendation system - sorting algorithm platform, how its overall architecture; in order to present the user with a more accurate recommendation results in support model fast, efficient iterative process, the sorting algorithm platform play practice and experience what the role of the.

Evolution Part.2 sorting algorithm platform

2.1 overall architecture

Currently, hornet sorting algorithm to sort line model mainly internet general data processing module, alternatively model production module, monitoring and analysis module consists of three parts, the structure of each module and the overall platform workflow shown below:

2.1.1 Module function

(1) General data processing module

The core function is to build a building, and feature training samples, and it is the most basic and critical part of the entire ranking algorithm. Click data source directed impressions log user portrait, portrait and the like content, dependent on the underlying data processing Spark offline batch processing real-time streaming and Flink.

(2) Alternatively model production module

Primarily responsible for constructing the training set, the model train and a generating line configuration, seamless synchronization line model.

(3) monitoring and analysis module

Mainly it includes an upstream-dependent monitoring data, monitoring recommended pool, monitoring and analysis feature, visual analysis of the model functions.

Function of each module and interact with JSON configuration files between them integrate the training and on-line model only need to modify the configuration can be done, which greatly enhance the efficiency of development laid a solid foundation for the rapid iterative sorting algorithm.

2.1.2 Main Profile Type

Configuration file is divided into TrainConfig, MergeConfig, OnlineConfig, CtrConfig four categories, and its role are:

(1)TrainConfig

It refers to the configuration of training, including training set configuration and model configuration:

  • Configuration comprising a training set which specifies the use of training feature; training data which specifies a period of use; specified scene, page, and channel, etc.

  • Model configuration parameters including model train set path, the path test set, model save path, etc.

(2)MergeConfig

Refers to a configuration wherein, contextual characteristics including, user selection feature, wherein the article, intersecting feature.

Here, we also calculated the cross characteristic of achieving configuration. For example, some of the features a user feature vector, there are also some content characteristic vector features. When we want to use a cosine similarity of two vectors as a cross or the Euclidean distance to the model feature when used, and this cross-calculation feature selection can be implemented directly as a configuration, and configuration synchronization line for line in use .

(3)OnlineConfig

Refers to a line configuration, the process of constructing the training data for online use automatically generated, including the configuration of the feature (feature context, user characteristics, wherein the content, wherein the cross), the release path, the model features.

(4)CtrConfig

CTR refers to the default configuration, to effect smoothing processing for the user and content CTR characteristics.

2.1.3 features project

From an application perspective, it comprises three main features, the user characteristic (User Feature), wherein the content (Article Feature), wherein the context (Context Feature).

If the manner of acquisition can be divided into:

  • Statistical characteristics (Statistics Feature): including user, content, traffic to a specific time period / amount of exposure / CTR, etc.

  • Vector features (Embedding Feature): the label with information, such as destination-based, using the user clicks on the historical behavior, the use of vector feature Word2Vec training and so on;

  • CROSS feature (Cross Feature): based on the tag or the destination vector, the vector or construct user vector item, whereby the user with an article wherein the degree of similarity, etc.

2.2 sorting algorithm platform V1

In the sorting algorithm platform V1 phase, by simple JSON file configuration, the platform can be realized feature selection, select the training set, training divisional scene XGBoost model, XGBoost model evaluated offline AUC generates online profile automatic synchronization lines Features.

2.3 Sorting Algorithm Platform V2

To solve these problems above, we sort algorithm platform of monitoring and analysis modules add data validation, model explains features that help us provide more scientific and accurate basis for continued iterative model optimization.

2.3.1 Data Validation (DataVerification)

V1 stage in the algorithm platform, when the offline model results (AUC) performed well, while online the result is not as expected, it is difficult to locate troubleshoot problems affecting the model iteration.

Through the investigation and analysis of the problem, we found that the effect of causing the line does not meet expected a very important reason, and probably the train set model is based on the number of positions a summary of clicks every day to get exposure meter. Since the data reporting delay and other reasons, some of the features this context offline exposure meter clicks in real-time behavior of click exposures there may be errors, bring a number of issues offline and online features inconsistent.

In view of this situation, we have added data validation capabilities, real-time log feature built offline and online print training set of comparative analysis of each dimension.

Specific approach is to click the real-time log exposure (containing the model used, the model predictive features and grading information) on the basis of the line, real-time for each exposure record clicks to add a unique ID, click impressions summarized in Table offline We will keep this unique ID. In this way, for a record of click exposures, we can build on an offline training set of features, characteristics and the actual use of online associate of AUC online and offline model, forecast and features online and offline model of case comparison, which found some problems.

For example, in the previous iteration of the process model, the model offline AUC high, but the effect is not over the line. Through data validation, we first compared the situation online and offline model AUC and found inconsistencies exist effect phenomenon, then compare online and offline model to predict the points, and find online and offline difference between the forecast and the biggest TopK samples, their offline and online features characteristics were analyzed. Finally, since the data is found to report a number of inconsistencies caused by delay line and contextual features of offline and online XGBoost, when selected parameters DMatrix constructed missingValue problem, resulting in both online and offline model forecast deviations. After the above fixes, online UV CTR uplift 16.79%, PV 19.10% CTR uplift.

Through data validation capabilities and solution strategy, we quickly locate the cause of the problem, to accelerate the process of iterative algorithm model developed to enhance the effect of the application line.

2.3.2 model explains (ModelExplain)

Model interpretation can open the black box of machine learning models, increasing our confidence in the decision-making model to help understand the decision-making model, providing inspiration for the improved model. Some concept of the model to explain and recommend it to everyone to help understand the two articles: "Why Should I Trust You Explaining the Predictions of Any Classifier", "A Unified Approach to Interpreting Model Predictions".

In the actual development, we are always trade-off between accuracy and model model interpretability. Simple model has good explanatory, but the accuracy is not high; and complex models to improve model accuracy at the same time at the expense of the model interpretability. Using a simple model to explain complex model is one of the core methods of interpretation of the current model.

Currently, we are using the online ordering model XGBoost model. However XGBoost model, conventional methods of interpretation based on the importance of the model feature, on the whole only give a measure of the importance of each feature, and does not support the interpretation of the model of partial output, or the output of a single sample model explains . In this context, our model explains modules use a new model to explain the method Shap and Lime, not only the importance of supporting features, and also supports the local interpretation model, so that we can understand in a single sample, a feature of the value of the output of a model of what degree of positive or negative can play a role.

From below through a simplified example of the actual scene to introduce the core functions explained by the model. First, tell us about the meaning of several features:

Our model explains a single sample will be given the following analysis:

  • U0-I1

  • U0-I2

  • U_0-I3

As shown, a single sample model U0 - I2 , U0 -_I3_ the predictive value of 0.094930, 0.073473, 0.066176. The prediction of a single sample, each of the positive and negative eigenvalues have much effect can be seen from the characteristic length of the strip with the figures, red represents a positive effect, blue represents negative effect. This value is determined by the values in the table shap_value of:

Wherein, logit_output_value = 1.0 / (1 + np.exp(-margin_output_value)),logit_base_value = 1.0 / (1 + np.exp(-margin_base_value)), output_value is XGBoost model output value; A BASE_VALUE model is a desired output; concentration approximately equal to the mean value of the entire training model predictions; shap_value is a measure of the characteristics of the positive and negative predictors of the role played.

Model Predictions logit_output_value, 0.094930> 0.073473> 0.066176, so the result is sorted I1 > I2 > I3 , U0 - I1 prediction value 0.094930, characterized doubleFlow_article_ctr_7_v1 = _I1_ctr played a positive role of 0.062029, so that compared to the base value predicted value , there is an increasing trend. Similarly, ui_cosine_70 = 0.894006, 0.188769 played a positive role in.

Intuitively, we can see the contents of seven days, and user click-through rate - the higher the content similarity, the higher the predicted values, which is in line with expectations. The actual scenario, we will have more features.

Shap model to explain the core function is to support the local one-sample analysis, of course, it also supports global analysis, features such as the importance of the role of positive and negative features, such as interactive features. Below is an analysis of the characteristics doubleFlow_article_ctr_7_v1, it can be seen, the content CTR 7 days less than the threshold value of the predictive model to play a role in the negative, is greater than the threshold value from the predictive model a positive effect.

Part.3 recent planning

Recently, the sorting algorithm platform will continue to enhance the online application effect of training model, and the real-time features as a priority, reflecting the rapid changes in line.

The current model XGBoost advantage is sorting algorithm platform does not require much engineering features, including feature missing values, wherein successive discrete cross constructed like characteristics. But there are also many deficiencies, including:

  1. Cushman & Wakefield is difficult to handle sparse feature

  2. Complete set of data to be loaded into memory training model does not support online learning algorithm, it is difficult to achieve real-time updates of the model.

To solve these problems, we will be late for the construction Wide & Deep, DeepFM other deep models, as shown below:

Further, each time the current model prediction Item individual scores, and then sort the results to take a brush, (Learning to rank, pointwise). We hope that the latter can achieve a result of a recommendation brush (Learning to rank, listwise) to the user, giving users a more real-time, accurate recommendation results.

Author: Xia took the lead, Wang Lei, a hornet's nest platform recommendation algorithm R & D engineers.

Guess you like

Origin juejin.im/post/5dc8fc00f265da4d0d0dd4de