[Recommendation algorithm engineer technology stack series The recommended system - data and evaluation results

The basic conditions on the line recommendation system

A new line on the final recommendation algorithm, complete the above mentioned three experiments:
(1) First, it requires offline proved in many offline index superior to the existing algorithms;
(2) Then, need user surveys (artificial or internal evaluation) to determine its user satisfaction is not lower than existing algorithms;
(3) Finally, the online AB test to determine which is better than existing algorithms on the indicators we are concerned.

AB test

(1) AB test the benefits are obvious, it is fair to obtain performance of different algorithms when the actual line;
(2) AB test and user surveys, same need to take into account the random distribution, you want as much as possible correlated with the final index the factors are listed, all in all is segmentation of traffic is the key to AB test;
a significant drawback (3) AB test is long experimental period, in order to get reliable results, therefore AB test should not test all of the algorithm, but only test performed well in the offline test algorithms and user surveys;
(4) If a user tag library, it will greatly help online experiments.

function list

Features description
Bypass Recall targeting sorting support streaming, such as the tail number hash, or hash stratified
whitelist Some users began to experiment preset white list to the experimental group before
Experiment Management Experimental support for creation, list of experiments, experiments and other modifications
Index Management Index calculation defined
Performance charts diff, AAdiff; and a corresponding graph
Reliability Analysis The confidence intervals for the experimental group (95% confidence interval) and p values ​​(when p value is less than equal to 0.05, the results are significant) determines whether the trusted experiments

Data Indicator

It corresponds here to the index set general recommendations include clickthrough rate (CTR), interest rate, length of viewing, MAU / DAU and the like;
and the lower line is generally employed (wherein / recall) coverage, AUC, gAUC, correlation & accuracy (artificial evaluation) and other indicators;

Coverage

Characterized wherein the coverage \ (coverage = \ frac {N_1 } {N}; wherein N_1 wherein i is the number of non-null, N is the total number of samples; corresponds to recall coverage, number of users is N_1 result recall (or recall content number), N is the total number (or the number of contents recommended) user \)

I must mention the ability to cover items in light of "recommended practice system" in the recommendation system is a measure of a recommendation system mining long tail commodities (degree of dispersion of the contents of exposure, the more concentrated the head Effect / Matthew more serious):

\ [Coverage = \ frac {| \ cup_ {u \ in U} R (u) |} {| I |}; where U is the set of users, I is the collection of items, R (u) for the user u recommended N a collection of articles \]

Coverage metric recommender system indicator is a more scientific Gini coefficient (Gini Index), because it takes into account the recommended number of times each item is whether the average, the greater the coefficient, the more uneven, the smaller the coefficient, the more uniformly . gini coefficient is initially used to measure the degree of disparity, particularly derived Appendix Lorenz curve and Gini coefficient :

gini
In 1905, statisticians Lorenz Lorenz curve presents, as a. The total community population by income from low to high order levels were divided into 10 groups, each grade group account for 10% of the population, and then calculate the proportion of total revenue for each group. Then the cumulative percentage of the population on the horizontal axis, vertical axis is cumulative percentage of revenue, draw a curve reflecting the status of the income distribution gap, is the Lorenz curve.

\ [Gini = \ frac {1} {n-1} \ sum ^ n_ {j = 1} (2j-n-1) p (j); list of items where p (j) is in ascending order of the first j-th article is recommended ratio, i.e., p (j) = \ frac {j article recommended number} {\ sum ^ n_ {j = 1}} times the recommended item j \]

AUC and gAUC

AUC (Area under curve) is a machine learning are two commonly used classification evaluation means, direct meaning is the area under the ROC curve; further said that in fact is the random draw off a sample (a positive sample, a negative sample), classification and then use the training to get the these two samples is used to predict the predicted probability of the positive samples is greater than the probability of the negative samples.

ROC
The horizontal axis of the ROC curve for the false positive rate (False Positive Rate, FPR); the vertical axis represents the "real rate" (True Positive Rate, TPR)
\ [TPR = \ FP FRAC {} {P}; the FPR = \ {TP FRAC } {n}; fp is actually a negative number of samples predicted positive; TP positive prediction is actually also a positive number; n is the total number of true negative samples; p is the total number of true positive samples \]
\ [= the AUC \ frac {\ sum_ {i \ in rank_i postitiveClass} - \ frac {M (1 + M)} {2}} {M + N}; wherein the number rank_i solid represents sample i predicted probability exceeds samples (highest probability of rank n, is the second highest n-1); M is a positive number of samples, N is the number of negative samples \]

AUC is a reflection of the ability to sort between the overall sample, but the actual result is that the user's personalized, we are more concerned with the ability of a user to sort between different items, gAUC (group auc) is calculated for each user's actual AUC, then the weighted average of the last obtained group auc, resulting in decreased sort result between different users compare this effect is not very good.

\[ gAUC = \frac{\sum_{(u,p)} w_{(u,p)} * AUC_{(u,p)}}{\sum_{(u,p)} w_{(u,p)}} \]

Usually the number of times the weight of each user's view, or click may be set to the actual processing, and general computing time, a single user will filter out all of the positive samples or negative samples case.
But in fact generally still mainly depends auc this indicator, but when auc can not find a good reflection of the quality of the model (such as auc increased a lot, the actual effect is worse), this time can look at gauc this indicator.

Indicators show

This aspect is generally carried out by the data platform product support, simply can write your own SQL data run; this is not a detailed launched.

Monitoring indicators

This area generally rely on alarm monitoring and data-plane; the business side can also write your own scripts monitoring; this is not a detailed launched.

Artificial Reviews

Data indicators and real differences in user experience, feedback Bad case is a great way of recommendation system optimization

appendix

Guess you like

Origin www.cnblogs.com/arachis/p/REC_DEV_data_index.html