How to provide a credible AB testing solution

This article introduces how to provide a reliable AB testing solution based on the specific practice in the performance scenario. On the one hand, from the perspective of experimental methods, discuss the statistical traps that are easily overlooked in the experimental process, and give specific solutions; The method is free to choose by the user, because the experimental method is slightly different, and the result may be a thousand miles away. Hope to bring you some help or inspiration.

  • 1 background

  • 2 Walk into AB testing

    • 2.1 Overview of AB testing

    • 2.2 Key issues of AB testing

    • 2.3 Platform construction based on a set of core abstractions is difficult to adapt to all business scenarios

  • 3 How do we conduct AB testing in performance

    • 3.1 AB testing difficulties faced in the multilateral business model

    • 3.2 Organization and process of AB testing

    • 3.3 Introduction to AB Testing Platform

  • 4 Summary and Outlook

1 background

While the statistical foundations of AB testing (AB experiments) are centuries old, building a correct and reliable A/B testing platform at scale remains a formidable challenge: not only dealing with spillover effects and small The double challenge of the sample, balance the experimental deviation and variance to determine the appropriate experimental unit, grouping method and analysis method, give a reasonable experimental design, and deal with variance calculation, P value calculation, multiple comparisons, confounding factors, Various statistical pitfalls such as false negatives (the actual strategy has an effect, but the test shows no effect). Therefore, obtaining high-quality results requires an expert understanding of experiments and statistics, which undoubtedly increases the threshold for experiments, and it is difficult to achieve the goal that anyone who conducts experiments can draw credible conclusions.

This article will introduce how to correctly use statistical methods to avoid statistical traps, and what kind of platform capabilities are output from the two perspectives of experimental methods and platform construction, so as to ensure that anyone using the platform can draw credible conclusions. At the same time, we have also accumulated how to conduct better experiments and how to use experiments to make better decisions. We hope to be helpful to students who are engaged in related work. We also sincerely hope that you can give feedback or suggestions, and continue to optimize our work.

2 Walk into AB testing

Which online option is better? We often need to make this choice. When we want to make a decision between two strategies, the ideal solution is for the same dial-up user to experience the original strategy A in parallel time-space 1 and the new strategy B in parallel time-space 2 in two parallel time-spaces, and then according to the observation The facts are compared to determine which strategy wins. However, in the real world, there are no two parallel time-spaces. For the same user, we can only observe one effect of accepting strategy A or strategy B, that is, the counterfactual result is unobservable.

Therefore, in the real world, we usually take an experimental approach to making decisions. It assigns users to different groups, users within the same group use the same strategy during the experiment, and users in different groups use different strategies. At the same time, the log system marks the users according to the experimental system to record the user's behavior, and then calculates the measurement difference based on the marked logs, and performs statistical analysis to exclude any differences due to noise. Experimenters use these indicators to understand and analyze the effects of different strategies on users and whether they meet the pre-assumptions of the experiment.

36a2acadd021e617585798e4deb1c858.png

Figure 1 Ideal and Realistic Policy Evaluation

2.1 Overview of AB testing

In the demonstration, since it is impossible to observe two potential results of the same group under different strategies at the same time, it is impossible to decide which strategy wins. It is necessary to construct a counterfactual (Counterfactual) to represent the potential result of the group accepting strategy B when accepting strategy A .

Specifically, construct a control group that has no difference from the mean value of the characteristics of the experimental group, and use its observed results to represent the potential results of the experimental group when strategy A is applied. At this time, the mean difference between the two results is the size of the strategy effect. Since the conclusion is based on the observed data of the sample, it needs to pass the significance analysis (Significance Test) to prove that the conclusion has statistical significance. This is the complete path of strategy evaluation.

According to whether the allocation of strategies can be controlled before the experiment, we divide the experiment into AB experiments and observational studies (Observational Studies). Experiments (Randomized Experiments) and quasi experiments (Quasi Experiments). Different experiment types use different grouping methods, which to a certain extent affect the form of post-experiment analysis data. It is particularly important to choose an analysis method that matches the experiment type after the experiment, which directly restricts whether we can draw scientific conclusions in the statistical sense. The specific classification is as follows:

87bd7b0067f9b9fa9033a24776992c6e.png

Figure 2 Three types of experiments under contract fulfillment business

For most of the experimental scenarios, we can control the allocation of different strategies to different experimental subjects before the experiment. However, in some scenarios, such as: ① testing the impact of online concert activities on the short video platform, considering user fairness, Concert activity strategies need to be applied to all users; ② In the scenario of testing the impact of different marketing email strategies on users, we cannot control which users will eventually accept the strategy. Either we cannot control the distribution of strategies, or we cannot control the effectiveness of strategies in the corresponding population, so we can only use observational research, that is, observe and record the characteristics of the research objects in a natural state, and describe and analyze the results.

In the scenario where we can control the strategy imposed on the experimental subjects, such as ①test the impact of different product UIs on users, and then decide which UI to use; ②quickly verify the impact of the home page product list image material on the conversion rate. In these typical C-end experimental scenarios, there are not only a large number of users but also the behavior of users in the experimental group and the control group will not affect each other, and a homogeneous and independent experimental group and control group can be found through random grouping. This type of experiment is called It is a randomized controlled trial and is the industry's gold standard for measuring the effect of a strategy.

However, in Meituan’s contract fulfillment business scenarios, such as scheduling scenarios, it is necessary to test the impact of different scheduling strategies on the user experience in the region. Consumers) have large differences, and it is difficult to obtain a homogeneous experimental group and control group by random grouping. Moreover, since the transportation capacity can be shared between regions, the experimental group and control group with different strategies affect each other, which does not meet the requirements of the experimental unit. independent conditions. In this scenario, we cannot randomly allocate experimental subjects, but can only selectively allocate the experimental group and the control group. This kind of experiment that can control the strategy allocation but cannot control the strategy random allocation is called quasi-experimental. Experiments, commonly used quasi-experimental methods such as double differences.

Randomized controlled experiments, because they can ensure that the mean values ​​of the characteristics of the experimental group and the control group are the same, and will not interfere with the measurement of the real effect due to group differences, are the industry's gold standard for measuring strategic effects. To measure the strategy effect in a business scenario that does not meet the constraints of a randomized controlled experiment, we adopt a quasi-experimental method, eliminate the difference in observable characteristics between the experimental group and the control group by improving the grouping method, or keep the difference constant, and the analysis process adopts the adaptation standard Analytical methods for experimental scenarios.

If, due to scenario constraints, experiments can only be conducted on the basis of post-experimental data, only methods suitable for observational studies can be used. Although quasi-experimental and observational studies are not the gold standard for measuring strategy effects, if used properly, they can also draw relatively scientific and credible analytical conclusions. In academia, the confidence levels of the three different experiment types are as follows:

db04a3e8b1b56437a1a3d1bb7b462044.png

Figure 3 Confidence levels used to assess the quality of AB testing

| 2.2 Key issues of AB testing

No matter what type of AB experiment, it conforms to the basic process of triage -> experiment -> data analysis -> decision-making, and needs to meet the three basic elements of AB experiment. Streaming is the top-level design of the experimental platform. It regulates and constrains how different experimenters can independently run their own experiments on the platform without interfering with each other. Running experiments seems simple, but the prerequisite for successfully running different types of experiments is that the experimental scenarios must meet their requirements. theoretical assumptions.

The AB experiment mainly infers the overall behavior by observing the sampled samples. It is a predictive conclusion. Data analysis involves a large number of statistical theories. If you are not careful, it is easy to fall into a statistical trap. Any error in the above process may lead to wrong conclusions. Therefore, it is easy to count a number in AB experiment, but it is not easy to obtain reliable and credible statistical conclusions.

312a25456eb6b96d9a5ef5fd551ad30c.png

Figure 4 Key elements of building a credible AB test

2.2.1 Distribution framework for AB testing

At Fulfillment Technology Platform, we use experiments to measure real user responses to determine the effects of new product features. Failure to run multiple parallel experiments at the same time will greatly slow down iterations. Scaling up the number of experiments running simultaneously is essential to achieve faster iterations. In order to increase the number of experiments that can be run at the same time, improve parallelism, and allow multiple mutually exclusive experiments to be run at the same time, two shunt frameworks have emerged in the industry. One is companies with unilateral business forms like Google, Microsoft, and Facebook. A distribution framework with layer and domain nesting is adopted; the other is companies with multilateral business forms such as Uber and DoorDash, which adopt a constraint-based distribution framework. Specifically as shown in the figure below:

0cbb7c4d2a1f77df2a51536af593b54b.png

Figure 5 Two popular distribution frameworks in the industry

Overlapping distribution framework based on layer and domain nesting : The characteristic of this distribution framework is to randomly break up the traffic in advance to identify the bucket number and plan the purpose of the traffic in advance. As shown in the figure above, divide the national traffic into 10 equal parts in advance and use 1 Bucket number 10 is used to identify traffic, traffic in buckets 1-6 is used for short-term policy verification, and buckets 7-10 are used for long-term policy verification. In order to support running multiple mutually exclusive experiments at the same time and improve iteration efficiency, the orthogonal buckets and mutually exclusive buckets are further distinguished in buckets 1-6 and buckets 7-10 respectively. The traffic falling in the orthogonal buckets can be simultaneously Entering multiple experiments, before entering each experiment, break up the traffic again to avoid the residual effect of the previous experiment from affecting the next experiment, and realize the parallel operation of multiple mutually exclusive experiments, and the traffic falling in the mutually exclusive bucket , only one experiment can be entered at a time, which is used to run experiments that do not meet the random splitting conditions. A set of traffic collections for specific purposes based on bucket numbers is called a domain; different types of experiments entered by the same traffic are called layers.

The advantage of this splitting framework is that it can not only achieve traffic reuse and expand the parallelism of experiments, but also easily avoid the bad experience that experiments with potential interactions may bring to users. Introduce the concept of layers, divide the system parameters into multiple layers, and constrain the experiments that may cause poor user experience when combined together, they must be located in the same layer. Access multiple experiments on the same layer to avoid bad user experience.

The disadvantages are : first of all, a major premise of this distribution framework is to break up the traffic in advance, which is acceptable in a unilateral scenario with large traffic, but difficult to work in a multilateral scenario with small traffic. In the multilateral scenario, considering the spillover effect, it is not possible to directly use unilateral entities for diversion. Instead, clustering is used to aggregate interacting multilateral entities into a large entity, and the diversion is based on large entities. Considering limited entities Quantity, it is difficult to obtain a uniform flow in this way of breaking up in advance; secondly, the domain specifies the traffic usage in advance, and this way of isolation in advance reduces the traffic utilization rate and cannot meet the experimental efficacy requirements in a small flow field, such as even in There is no experiment in the mutually exclusive domain, and it is impossible to use this traffic to conduct other orthogonal experiments; third, this kind of distribution framework for pre-planning traffic usage in advance is not flexible enough. If the domain setting is found to be unreasonable later, the configuration of the domain needs to be changed will pay a higher price.

Distributing framework based on conflict detection : The characteristic of this distributing framework is that the experimenter formulates constraints, and the platform ensures that experiments that cannot avoid potential interaction effects are not exposed to users at the same time according to the constraints formulated by the experimenters. For companies such as Microsoft and Uber, the experimental platforms have integrated automated systems for detecting interactions to avoid potential interactive effects between experiments. Taking Uber as an example, the strategy is regarded as a set of independent parameters, and the dedicated parameters involved in the corresponding strategy and the parameters shared with other strategies are declared in advance. When configuring experiments, it is detected whether there are any overlaps between two experiments that affect the same parameters. Experimental create or update operations are allowed as long as they do not overlap.

The advantage of this distribution framework is that it is flexible and can reuse traffic to the greatest extent. Compared with the overlapping traffic framework, it is not restricted by the domains divided in advance and can only be tested in a specific domain, even if there is no other domain in the corresponding domain at this time. experiment. As long as the conditions for conducting parallel experiments are met, the traffic can be arbitrarily circled for experiments; the disadvantage is that the experimental platform needs to build the ability to automatically detect interactions.

2.2.2 Basic elements of AB testing

When running an AB experiment, three basic elements must be met: ① The experimental group and the control group with different strategies can be compared, that is, the characteristic mean values ​​of the experimental group and the control group are the same before the experiment or have a fixed difference before the experiment, which is convenient for post-experiment Calculate which differences are caused by different strategies; ②There is no interference between strategies, and the experimental groups are independent of each other, that is, when we compare strategy A and strategy B, the behavior of users who accept strategy A will not be affected by the behavior of users who accept strategy B; ③The number of experimental groups should be sufficient to meet the efficacy requirements and avoid false negative experimental results, that is, the actual strategy has an effect but is not detected due to insufficient sample size.

If the first element is not satisfied, it is difficult to determine whether the difference between the experimental group and the control group is caused by the strategy or the grouping after the experiment, and it is difficult to accurately measure the real effect of the strategy; if the second condition is not satisfied, the strategy effect may be overestimated. For example, in the fulfillment delivery scope experiment, the orange color is the experimental group. The expansion of merchant A’s range will shift the user’s demand from merchant B to merchant A. Compared with the control group, the quantity of orders is more. During the experiment, it is concluded that expanding the distribution range will increase the overall order quantity. However, when the strategy is applied nationwide, it is found that the order quantity has not increased significantly, because the increase observed during the experiment is only the quantity transfer. The experimental group was transferred to the control group. If condition three is not met, it is difficult to determine whether the strategy no effect is a true no effect or whether no strategy effect was detected due to insufficient sample size.

3586bc301832ac5bfadb49009b0fc956.png

Figure 6 Example of estimation bias caused by spillover effects

2.2.3 Statistical pitfalls that cannot be ignored

The AB experiment is mainly to infer the overall behavior by observing the sampled samples. It is a predictive conclusion and involves a lot of statistical theory. If you are not careful, it is easy to fall into a statistical trap and it is difficult to draw a reliable statistical conclusion.

Whether the difference between the experimental group and the control group is real or noise is assisted by the significance test. To draw a conclusion involves variance, test method and P value calculation. These links are full of statistical traps, and a little carelessness will lead to Leading us to get wrong conclusions through hypothesis testing. The sampling method, distribution characteristics and sample size of the sample determine our inspection method and the specific P value calculation method adopted. The difference between the experimental unit, the analysis unit, the experimental group and the control group determines the variance calculation, and the variance is used as the P value. An input to the calculation that directly affects the resulting P value. In the above link, ignoring any factor will lead to the wrong calculation of P value, so that we can get wrong conclusions through hypothesis testing.

aba60a5326cd89e78fb233a66719761b.png

Figure 7 Factors affecting the conclusion of the experiment in the analysis link

Variance calculation traps that are easy to ignore : If the variance cannot be estimated correctly, then both the P value and the confidence interval will be wrong, and these mistakes will lead us to get wrong conclusions through hypothesis testing. Overestimated variance can lead to false negatives, while underestimated variance can lead to false positives. Below are a few common mistakes when estimating variance.

The two are not equal. For example, in a rotation experiment that rotates alternately by day, once it is determined whether the experiment group or the control group is on the first day of the experiment, the subsequent days will be determined successively in the experiment group and the control group. At this time, the samples of the experimental group and the control group are not independent. If the variance is calculated according to the independent method, the variance will be wrongly estimated. In fact, the distribution mechanism of the sample affects its variance calculation. In the AB test, we divide the traffic into the experimental group and the control group and apply a strategy to the experimental group, and then calculate the absolute improvement of a metric in the experimental group relative to the control group Value or relative improvement rate, and test whether the difference is statistically significant, and then judge whether the experimental strategy is real and effective.

 It can be seen from the above formula that the variance calculation is related to the distribution mechanism, and if the distribution mechanism is ignored, it will lead to wrong variance calculation.

When evaluating relative lift or when the experimental unit is inconsistent with the analysis unit, the wrong variance calculation method can easily underestimate the actual variance, resulting in false positives . When calculating the relative promotion rate of the indicator, the following formula is shown:

 P-value calculation pitfalls caused by easily overlooked test methods : Statistics does not have a complete conclusion on how large the sample size can be considered to be true. Not all sample distributions in large-sample scenarios satisfy the normality assumption to avoid bias The sample adopts the test method under the default normal distribution. The Weich t hypothesis test is a commonly used test method for parameter testing. It essentially assumes that the asymptotic normality of the sample mean of the experimental group and the control group is established. This theory is actually based on the central limit theorem in the case of large samples. superior. Statistics does not have a complete conclusion on how large the sample size can be considered to hold the central limit theorem, which actually depends on the degree to which the original distribution itself deviates from the normal distribution.

As a rule of thumb, a sample size greater than 30 may be sufficient if the sample deviates only slightly from a normal population. However, for biased samples, Ron Kohavi et al. (2014)   pointed out that when the sample skewness is greater than or equal to 1, an empirical criterion is that only when the observed sample size for calculating the sample mean is greater than the central limit law can be considered valid. An activity experiment with a sample size of 13832 was actually selected, and the sampling distribution of the difference between the experimental group and the control group was right-biased and did not conform to the normal distribution, as shown in the following figure:

0c98f77809631f660d0b5ab583ce7b38.png

Figure 8 Skewness example of data distribution

If the normal distribution test method is used to calculate the P value by default in all scenarios, it will easily lead to wrong P value calculations.

2.3 Platform construction based on a set of core abstractions is difficult to adapt to all business scenarios

The entire AB experiment process involves a lot of statistical knowledge. The premise of correctly using the theoretical knowledge in books is that the actual business scenarios satisfy the theoretical assumptions. The actual situation is that many scenarios do not satisfy the theoretical assumptions. In this case, obtaining high-quality results requires an expert understanding of experiments and statistics, as well as a lot of work including: experimental design, configuration, index processing, custom analysis and other pipeline work. If any link fails, it will lead to a lot of wasted work.

It is difficult to avoid the experimental confidence problem caused by improper use of methods through the core abstraction to output different method capabilities for experimenters. A small deviation in any design during the process will lead to incomparable experimental groups and control groups, thus affecting Experimental results. For example, the variance estimation error of the case: In the experimental analysis, a mistake that is often made, no matter whether the grouping method is random grouping or not, in the actual analysis, the variance is still calculated according to the condition that the sample satisfies the independent and identical distribution, resulting in our estimate. Overconfidence in accuracy, underestimation of variance, prone to false positive errors.

As an extreme example, 100 students are randomly selected to estimate the required average grade. If the 100 people who are sampled are all the same student, their grades only reflect the grade of one student. For the information on estimating the average grade of all students Content is equivalent to information provided by a student.

If we treat them as independent, the standard error of the obtained sample mean is obviously wrong. As a result, we are overconfident in the accuracy of the estimated value, that is, the standard error of the estimated value is estimated to be too small.

Case 2 The business scenario does not meet the theoretical constraints: double difference is a commonly used analysis model in our quasi-experiments, and its calculation process is very simple, that is, the difference between the mean value before and after the intervention of the experimental group minus the difference between the mean value before and after the intervention of the control group, according to the business scenario , you can choose the traditional DID model or the fixed-effect DID model, but which model is suitable needs to be further studied. In the current business scenario, which model satisfies the parallel trend assumption, that is, without intervention, the experimental group and The mean difference of the indicators in the control group remains consistent at different times. Under the assumption of parallel trends, which model is better? If rigorous testing is not performed, biased estimates will result. The following is a case in our specific scenario:

f8dc97b81e744036905eca5bca8d625c.png

Figure 9 Factors to be considered in choosing a double-difference model in quasi-experimental scenarios

Although the double difference model can be judged based on the general scene characteristics, which model to use needs to be further verified and selected based on actual data. As shown in the figure above, in the current business scenario, the parallel trend assumption of the traditional DID model is not satisfied. If it is used rashly, it will cause estimation bias, the double difference model of time effect and the double difference model of individual + time effect, although Both meet the parallel trend assumption, but from the actual confidence interval, because the latter takes into account the differences in strategies for different individuals, the fluctuation is small, and the estimated result is closer to the actual value, so the latter should be used.

3 How do we conduct AB testing in performance

3.1 AB testing difficulties faced in the multilateral business model

Spillover effects and small samples are the biggest challenges faced by experiments in the current business scenario. Secondly, the fairness imposed by the strategy constrains the experimental grouping, which is also a challenge we have to face. The individual constraints of each factor will be quite a challenge to draw a confident experimental conclusion, but in the performance scenario, these factors are combined together, which intensifies the challenge.

Our real-time delivery logistics system plays the role of a transaction middleman in the multilateral market. It matches the needs of users, riders, and merchants through the platform. The platform optimizes this matching process through product strategies. Other matchings within the time will have an impact, which has a strong spillover effect. Affected by spillover effects, the experimental results of an experimental unit not only depend on the individual itself, but are also affected by other experimental units. The existence of network effects violates the principle of independence of experimental units, leading to biased experimental results.

For example, in the delivery range experiment of merchants under the contract fulfillment business, the experimental Treatment (division of delivery range) directly determines whether users can place an order at a certain merchant, and different merchants share the same users at the same time and space. Users who can place an order at Merchant B can place an order at Merchant A, resulting in the transfer of orders from the control group to the experimental group. Although from the experimental effect, the strategy has increased the order volume, after the strategy is fully implemented, the effect is not significant. Arriving at the a priori expectations of the experiment doesn't even have an effect. Because this kind of improvement may only be a single-volume transfer, and it does not mean that the strategy has really brought about an improvement.

The business form of contract fulfillment LBS determines that most of its strategies are carried out in regions (mainly distribution areas). Due to the limited number of distribution regions and their own regional differences, it is difficult to obtain enough samples to detect small improvements in strategies. For example, the scheduling experiment is limited by its own business form and spatial dimension. The minimum unit of action of the scheduling algorithm is a region or a region group. The experiment must consider regions or more coarse-grained shunts. However, most urban regions and region groups are few, and The differences among different regions of a city are often quite significant, which is reflected in the data as the sharp fluctuations in interregional indicators.

In this scenario, the serious problems of small samples and significant differences between regions lead to low statistical power, making it difficult to effectively detect the small improvement effect of the strategy. It will also lead to a large gap in the distribution of covariates related to the response variable in the experimental group and the control group under random shunting, which amplifies the heterogeneity of the experimental group and the control group in business and brings doubts to the experimental results.

What's more deadly is that in the mixed scheduling mode in this scenario, the overlapping areas of different capacity types can share capacity and dispatch orders, and the area can recall the capacity of other nearby areas and dispatch orders. The spillover effect brought about by the characteristics will lead to the experimental effect The estimation is not precise enough and even brings significant estimation bias. It is not a small challenge to overcome the dual constraints of small samples and spillover effects in similar scheduling experiments.

7a846002d13dddf11f9ce8548bc0c616.png

Figure 10 AB testing difficulties faced in the contract fulfillment business model

3.2 Organization and process of AB testing

With the development of contract fulfillment business, we are increasingly relying on good strategies to drive the rapid development of business scale, as well as the continuous optimization of efficiency, experience and cost. A/B testing provides the most scientific way to assess the impact of strategy changes and map out clear cause and effect relationships. Quantify the impact through A/B testing, and ultimately assist the team in making decisions. We bring people, process, and platforms closer together—essential elements of a successful experimentation ecosystem.

In terms of personnel, we organically combined algorithms (experimental users), algorithm engineering, and data science (hereinafter referred to as mathematics) into a virtual team. Students from mathematics participated in the discussion of the annual goal of the algorithm at the beginning of the strategy iteration, assisting Algorithms work together to formulate comprehensive evaluation indicators to quantify the quality of strategies, and select appropriate experimental methods based on the characteristics of the scene, and complete the experimental design in the corresponding scene. Algorithm engineering students are responsible for integrating new methods into the experimental platform and providing services for users as public capabilities. .

1adadd1dde1bcdf3a3cdb31d5225761c.png

With the organization and platform in place, building an efficient, data-driven workflow based on AB experiments is the key to helping us achieve product goals through AB experiments. We divide the entire process into three stages: building ideas and verifying ideas through AB experiments , Precipitate knowledge base to form experimental memory.

Constructing ideas is the input stage of the experiment, and the quality of constructing ideas directly determines the effect of the experiment. If the ideas constructed at this stage are not good enough, then the AB experiment stage can only play the role of verifying mistakes, reducing the probability of making mistakes, and cannot bring increase.

Verification of ideas is the process of practicing AB experiments, which can be divided into five joints: experimental hypothesis, experimental design, experimental operation, experimental analysis, and experimental decision-making. The experimental hypothesis link is to form experimental goals, construct comprehensive evaluation indicators, and experimental design based on scenario constraints. , choose the appropriate experimental method.

Finally, the experimental decision-making is initiated through Launch Review; the success and failure cases are deposited to form experimental memory, which can not only help us discover the generality of the strategy, but also help us find opportunities from failure.

d2417b04713f0a429766d463f8e53115.png

Figure 11 Compliance AB testing process

3.3 Introduction to AB Testing Platform

3.3.1 Platform overview

The AB experiment can be widely used and promoted in engineering, and it is inseparable from the parallelism of the AB experiment (multiple experiments can be carried out in parallel) and a priori (obtaining the effect evaluation in advance through small traffic). The shunting framework directly determines the degree of parallelism of the experiment , the experimental solution that matches the scene directly affects the credibility of the prior conclusion of the experiment.

To increase experiment parallelism and allow us to run multiple mutually exclusive experiments simultaneously, we built a constraint-based splitting traffic framework to regulate and constrain how different experiments share and use traffic. In order to ensure that the platform provides reliable experimental results, the platform directly outputs solutions rather than capabilities for the experimental design, and the experimental analysis is fully automated, that is, the matching method is adaptively selected based on the experimental design and data characteristics. In order to achieve whether it is an expert in the field of experimentation or statistics, or an ordinary user without statistical and experimental knowledge, anyone can trust the results of the experiment.

In order to improve the parallelism of experiments, there are layer and domain nested overlapping shunting frameworks and constraint-based shunting frameworks in the industry. The former is represented by Google, and the latter is represented by Uber and Microsoft. Layer and domain nested overlapping traffic distribution framework requires that the traffic be evenly dispersed and the usage planned in advance. This not only requires a large amount of traffic to ensure that it can be evenly dispersed, but also requires accurate prediction of the future evolution of the business to ensure a reasonable division of traffic usage. Small traffic cannot be dispersed evenly, and the division of traffic usage is unreasonable. Not only will the number of domain experiments with large distribution traffic be small and waste traffic, but the number of domain experiments with small distribution traffic will be too large, resulting in insufficient traffic and queuing; As a result, online experiments failed, new strategies could not be properly promoted, and long-term experiments could not be carried out.

On the contract fulfillment technology platform, the diversion unit is often a region, a regional group or even a city, and the sample size is limited, which does not meet the sample size requirement of uniform dispersion. Secondly, the contract fulfillment business continues to evolve and change, and it is difficult to plan traffic usage in advance based on business prediction. For the above two points, the implementation of the contract adopts a constraint-based distribution framework.

It is easy to use statistical methods to count a number in the experimental field, but it is not easy to ensure that the statistical methods are reasonably adapted to draw reliable experimental conclusions, especially in the platform-based economic business model that connects users, riders, and merchants. Different experiments need to balance the two goals of reducing network effects and improving experimental efficacy, formulate experimental methods that match the scenarios, and draw confident experimental conclusions.

Doing this requires the experimenter to have an expert understanding of experimentation and statistics. In order to lower the threshold of the experiment and ensure the confidence of the experiment, the platform construction is proposed, and the solution rather than the ability is directly output for the experimental design. For the experimental analysis, it is fully automated, avoiding the experimentalists to put a lot of energy on the demonstration of the program and the experimental confidence problems caused by human factors.

3.3.2 Constraint-based diversion framework to adapt to fulfillment business scenarios

The distribution framework is like laws and regulations that regulate everyone's daily behavior and make everyone live in an orderly social family. It regulates and restricts how different experiments share and use traffic without affecting each other. It is an experiment The top-level design of the platform. The constraint-based distribution framework allows the experimenter to specify constraints, and the platform conflict detection further judges whether to allow the experiment according to the constraints specified by the experimenter. Before unfolding, introduce three concepts: Algorithm Key, Scenario, and Experimental Template.

Algorithm Key represents a group of functions that can be tested independently, and can be expressed as a set of independent parameter sets at the technical level, and scene represents the corresponding algorithm Key (for joint experiments, it is multiple algorithm keys; for non-joint experiments, it is an algorithm key) A collection of experiments with the same experiment template. An experiment template is a set of configurations with the same experiment type, experiment unit, grouping method, and evaluation method.

Considering: ①The same algorithm Key, different experiments are tests for different versions of the same function, and the experiments should be mutually exclusive; ② Between different algorithm Keys, as long as there is no potential interaction between the corresponding functions, the corresponding experiments Natural orthogonality can safely reuse traffic. If there is a potential interaction, as long as the traffic can be randomly dispersed, the influence of the potential interaction between strategies on the experimental conclusion can be eliminated.

Therefore, for parallel experiments, the preliminary constraints are as follows: ① Any two experiments under the same algorithm Key cannot reuse traffic, and conflicts; ② Two experiments with potential interactions under different algorithm Keys, as long as one experiment type is randomized control For experiments, traffic can be reused. Constraint ② not only avoids the risk of potential interaction between different strategy experiments in the full factorial traffic framework, but also avoids the problem of low traffic reuse rate caused by the isolation of traffic in different domains in overlapping traffic frames, especially in quasi-experimental and observational studies. In the case of many randomized controlled experiments, because the quasi-experimental and observational studies are located in different domains, it is impossible to realize the traffic reuse between randomized controlled experiments and quasi-experimental and observational studies.

Considering that different experiments under the same algorithm Key have different target traffic or iterative verification functions, whether different experiments under the same algorithm Key conflict with different experiments under another algorithm Key depends on their corresponding test functions or experimental methods. Scenarios are introduced to describe the functional description of different algorithm keys and their corresponding experimental methods, and a business impact matrix between different scenarios is constructed based on business experience. Based on the business impact matrix of different scenarios, scenario experiment methods, and parallel experiment constraints, a scenario experiment conflict matrix is ​​generated, and based on this matrix, conflict detection between different algorithm Key experiments is completed.

6facc1f97b9469f3d74cf38793b09e44.png

Figure 12 Distribution framework of the contract fulfillment test platform

The joint experiment under different scenarios conflicts with all the scenario experiments under the corresponding algorithm Key, and is further judged according to constraint 2 with other algorithm Key scenario experiments; in order to avoid traffic overlap between conflicting experiments, the ability to define the traffic range based on expressions is provided. Avoid traffic overlap between conflicting experiments by detecting expression traffic coverage. Constraint-based distribution framework, without pre-planning the use of traffic, and there is no complex concept of layers and domains. During the experiment, the traffic is selected on demand. As long as the conflict detection is passed, the experiment can be launched, which not only reduces the threshold for users to use, but also improves the flexibility of the platform. , to adapt to the fulfillment business scenario.

3.3.3 Packing and exporting experimental design, lowering the experimental threshold and ensuring the quality of the experiment

Constrained by performance spillover effects, small samples, and fairness factors, experimental design is a process of balancing variance and bias with multiple objectives such as reducing spillover effects, improving experimental efficacy, and focusing on experimental fairness .

Although the experimenter can use the grouping method allowed by the corresponding scene (whether grouping is allowed and whether random grouping is allowed on the premise of allowing grouping), the following "ordinary randomized controlled experiment", "random rotation experiment", " Which kind of experimental method in quasi-experiment and "observational research" should be used, but which experimental method should be used and what shunt unit should be used in the corresponding experimental method, which is to comprehensively consider spillover effects, experimental efficacy, fairness and other factors. balanced result.

For example: in the waybill experiment, the waybills of the experimental group and the control group can originate from the same area. Since the waybills in the same area can share riders, the waybills are not independent, resulting in spillover effects between the experimental group and the control group. Rotation experiments are an option to solve this problem, provided that we need to balance the following two conflicting goals.

  • We hope to divide more experimental units to increase the sample size, which requires us to divide the experimental unit small enough to obtain more experimental units to ensure that we have more sample size to meet the experimental sensitivity requirements.

  • We hope that the experimental unit is large enough to ensure that individuals who interact with each other are included in an independent unit, so as to eliminate the influence of spillover effects on the experimental results.

3602ed0c70201a828dd09c3a48d63799.png

Figure 13 Rotation experiment - balance between sample size and spillover effect

Under limited samples, if only simple random grouping is performed, not only will some indicators of the experimental group and the control group deviate before the experiment, but also the small improvement of the strategy cannot be detected due to insufficient sample size. The covariates that affect the difference in indicators and the improved grouping method to achieve a balance between deviation and variance, or allow deviation before the experiment, and supplement it by correcting the deviation after the experiment. The data obtained by the combination of analysis methods and schemes represent comprehensive judgments to formulate reasonable experimental schemes.

Provide mathematics students with a series of design tools for experimental schemes, assist them in completing the key variance and deviation balance in experimental design, and output solutions that match the scenarios . Due to the complex interactive relationship formed by users, merchants, and riders through the contract fulfillment platform, it is necessary to weigh the spillover effect and the experimental effect and give a reasonable experimental design when conducting AB experiments in contract fulfillment. Expert-level understanding and sufficient time investment, and the ability of the platform to provide experimental design. You can select the experimental type, experimental unit, grouping method, analysis method, and evaluation index as needed to complete the experimental design and verify the feasibility of the design.

In the Performance Technology Department, mathematics students are responsible for experimental design, and they formulate matching experimental plans for corresponding scenarios, releasing the energy of algorithm students and giving them more time to think about how to iterate strategies. In order to meet the experimental requirements of various scenarios for contract fulfillment, the platform provides the following types of experimental templates:

bc88755fe3f2adf3120521e4746b1265.png

Figure 14 The experimental platform recommends the experimental design template for mathematics students

In order to facilitate mathematics students to further determine the experimental unit, specific grouping method and analysis method under the corresponding experimental template based on the indicators to be detected, the expected improvement of the corresponding indicator, and available samples, and formulate an experimental plan that matches the scene, the platform Provided experimental design toolkits, including: grouping toolkit, variance reduction toolkit, significance analysis toolkit, experimental design report toolkit, including: MDE analysis, homogeneity test, sample size estimation and other toolkits .

68c823d407c721b2c3319ee06c6d5fe4.png

Figure 15 Design of experiments toolkit

The platform directly outputs experimental design schemes that match their scenarios for experimenters, and experimenters do not need to worry about confidence issues caused by improper use of experimental methods . In the scene management module of the platform, the specific experimental plan in this scene is configured: specific experiment type, experimental unit, grouping method, evaluation index (including: target index, guardrail index, driving index), variance reduction method, and the experimenter conducts When configuring the experiment, the platform provides sample size estimation and "MDE" analysis tools for it, completes traffic circle selection and experiment cycle determination, and then the platform outputs experimental design reports, output grouping results, "MDE" analysis and homogeneity inspection reports , after the test is passed, the experimenter can further configure the policy parameters corresponding to the experimental group and the control group, complete the final experiment configuration, and then release the experiment.

6ca3fa00432fcbba42a8f848f256fb1c.png

Figure 16 The platform determines the experimental plan for the specific experiment based on the experimental design plan and the experimental configuration of the experimenter

3.3.4 Build an analysis engine suitable for different experimental methods and standardize experimental analysis

Reliable data and scientific analysis methods are the keys to a credible analysis. Experimental analysis involves a lot of statistical theories. If you are not careful, it is easy to fall into the statistical trap. Considering that most experimenters lack statistical knowledge and the time consumption caused by self-service analysis, we have built a unified analysis engine and standardized the experimental analysis process. , provide matching analysis methods for different experimental designs, and help us make data-driven decisions based on analysis results by verifying the statistical significance of relevant indicators and estimating strategy effects. The analysis process generally includes the following processes:

  1. Through data diagnosis, ensure the reliability of the analysis data;

  2. Based on the grouping method and analysis method, the experimental effect estimation is carried out, and the experimental effect estimation is obtained;

  3. Based on the grouping method, data type, the relationship between the experimental unit and the analysis unit, and the analysis method, select the appropriate variance calculation method to reduce the variance to improve the experimental sensitivity and avoid false negatives;

  4. Based on the grouping method and data distribution characteristics, select the appropriate test method to calculate the variance calculation and P value, and verify the statistical significance of the relevant indicators to give a statistical conclusion;

  5. Based on diagnosis and analysis, output experiment report.

9a51de576777ec682a59cf8b1288a60a.png

Figure 17 Statistical engine of the performance experiment platform

Data diagnostics in the analysis phase are designed to alert experimenters to possible violations of experimental assumptions. Many people assume that experiments must work as designed, when in fact this assumption fails far more often than one would expect. The analytical conclusions of failed experiments are usually seriously biased, and some conclusions are even completely wrong. Before outputting the significance analysis report, pass the inspection of the guardrail indicators to ensure that the business will not be harmed by the iteration of the strategy, and check whether the experiment execution meets expectations through the group homogeneity inspection and "SRM" inspection to ensure the credibility of the experiment itself , the sampling distribution test provides a basis for subsequent selection of an appropriate significance test method.

In the experimental analysis of the analysis link, the analysis method that matches the data and experimental design is automatically selected to avoid statistical traps. According to the grouping method, two effect estimation methods, difference method and double difference method, are provided. Variance calculation and P value calculation are two links that occur in the concentration of statistical traps, and discriminant engines for discriminant variance and P value calculation methods are provided respectively.

First of all, the variance calculation is divided into independent sample variance calculation and non-independent sample variance calculation according to whether random sampling or not. The independent sample variance calculation is based on whether the index is incremental or relative improvement and whether the diversion unit is consistent with the analysis unit. Comprehensive factors, provide It adopts direct calculation and Delta method calculation, avoids the trap of variance calculation, non-independent samples, and gives accurate calculation of variance by simulating the actual distribution of data.

Secondly, in the case of ultra-small samples with a sample size of less than 30, the non-parametric Fisher test is used to meet the power requirements; in the case of a super-large sample with a sample size of about 10,000, the asymptotic normality of the acceptance test statistic is established and the Weich t test is used. Hypothesis test; when the sample size is greater than 30 but less than 10,000, the actual distribution of the sample is further investigated. If the asymptotic normality of the statistics is established, the Weich t test is used. If not, the Bootstrap estimated distribution is used for statistical inference.

3.3.5 Platform construction must not only be rooted in business scenarios, but also implement strict quality control to ensure credible results

The construction of the performance experiment platform rejects the simple stacking of functions, but designs a flexible and scalable experimental Pipeline engineering architecture, and then puts more energy on the experimental design and analysis plan of the business scenario, strictly controls the quality, and ensures that the experimental results are credible . During this process, mathematics students played a dual role. As a member of platform construction, they were always half a step ahead of engineering construction, went deep into business, defined new problems and found answers, and cooperated with engineering students to complete new capacity building.

At the same time, mathematics students, as a group of special users, are responsible for platform quality control and product usability, especially quality control. Before the platform releases new experimental methods, they need to pass their AA simulation. Simulate hundreds of AA experiments to check whether the P value of the concerned indicator is evenly distributed between 0 and 1. After passing the verification, if the new capability release conditions are met, release the new capability. Otherwise, continue to analyze to find out the problem. The following table is the simulation verification when the Fisher and Neyman test is introduced in the random rotation experiment.

cda8bcc55b6b9c9643dc3fda90bd56d5.png

4 Summary and Outlook

Tens of thousands of experiments are run each year on contract fulfillment algorithms and business students. The test content covers all aspects of contract fulfillment business. We have accumulated knowledge on how to conduct better experiments and how to use experiments to make better decisions. From the perspective of experimental general knowledge, this article introduces the practice of contract fulfillment in building credible experiments, hoping to be helpful to experimenters. The compliance problem space presents unique challenges due to its size, reach, and the unique nature of its multilateral business model. Comprehensive factors such as spillover effects, small samples, and strategic fairness restrict us from running credible experiments. When we solve the above problems, we have also accumulated a series of practices, and we will publish related practice articles in the future.

Based on different factors, such as indicator type, sample size, sample distribution characteristics, etc., different methods such as linear model, Delta method, and Bootstrap method can be applied to calculate P value and standard error, and the experimental unit is city, region, and station. In the experiment of , different methods can be automatically selected to adjust the standard error, avoiding false positives due to data clustering. This flexible analytical capability is important for rapid iteration of strategies and is popular with data engineers so they can focus their energy and time on other key aspects of their experiments.

Based on this, we build a unified analysis engine, which standardizes the core framework of experiments, such as ordinary randomized controlled trials, randomized rotation experiments, covariate adaptive experiments, quasi-experimental and observational studies, and some others in industry and academia. Cutting-edge experimental evaluation techniques in the world: such as sample size estimation, variance reduction, MED analysis, data correction, and carryover effect estimation of rotation experiments, etc., to reduce the time spent on analysis. In the future, we will further open up this capability to serve more users.

5 Authors

Wang Peng, Yong Bin, Zhong Feng, etc. are all from the Technology Department of Meituan Daojia R&D Platform-Performance Platform.

----------  END  ----------

 Job Offers 

Fulfillment Platform Technology Department - Data Science Engineer

Candidates are expected to build statistical models, apply machine learning techniques, analyze distribution business data, and use these modeling techniques to construct relevant metrics to assist in key business decisions.

  1. Cooperate with algorithms or business to be responsible for various experiments to ensure the scientific nature and high efficiency of experiments; responsible for the design and evaluation of complex experiments, and provide recommendations for business decisions through experimental analysis; responsible for the planning and evolution of scientific experiment evaluation platforms.

  2. Work closely with the business, deeply understand the business and products, transform business and product problems into data and technical problems, and design reasonable solutions, such as proactively exploring and mining through data, to help the business automatically identify false complaints and drive them to improve Control rules; through systematic causal inference, explain the fluctuations and changes of daily business core indicators, and find problems in a timely manner.

  3. Grasp technological trends and strengthen industry benchmarking research Applicable to various distribution business causal inference, statistical inference, anomaly detection and other data science methods, and apply these methods to actual business problems.

  4. Guide team members and data analysts, help them grow rapidly, and cultivate data analysis talents for the team.

Welcome to join us, please send your resume to: [email protected].

 recommended reading 

  |  Systematic Modeling of Data Governance Integration Practice

  |  Meituan Distribution Data Governance Practice

  |  Systematic thinking and practice of business data governance

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/132485573