Volcano Engine DataTester Sharing: 8 Common Mistakes in A/B Experiments

Volcanic engine DataTester is a scientific and credible A/B testing and intelligent optimization platform, derived from the long-term accumulation of ByteDance, which can deeply couple the needs of various industry scenarios such as recommendation, advertising, search, UI, product functions, etc., and provide services for business growth and transformation. , Product iteration, operation efficiency improvement and other links provide scientific decision-making basis, so that the business is truly data-driven. At present, the volcano engine DataTester has served hundreds of benchmark customers including Midea, Get, Kaishu Storytelling, etc., empowering mature "data-driven growth" experience to various industries.

To really master A/B experiments, you not only need to know "what you should do", but more importantly, you must also know what you "don't do". In this article, we have summarized 8 common mistakes in A/B experiments for everyone, let’s take a look together.

No.1 If AA confidence occurs, diversion service/statistics will be denied

Generally speaking, in the AB experiment platform, it is reasonable to use the AA experiment to verify whether the diversion service is functioning normally. However, once AA confidence occurs, it is insisted that there is a problem with the diversion service/statistics, and there are some understanding deviations.

We have already mentioned when explaining the significance level, in the process of testing hypotheses, I will make the first type of error-my strategy does not work, but the experimental results show that my strategy works. At the 95% significance level, the probability of this type of error is 5%, that is, if we open 100 AA experiments and then observe a certain indicator, there may be 5 significant results. This is due to unavoidable sampling error.

Therefore, if the difference in AA experimental indicators is statistically significant, it is just a matter of probability; on the contrary, hypothesis testing uses "sampling error" to help us quantify the probability of making mistakes and control it within 5% (95% significant level), that is, if we observe a significant conclusion of AB, the probability of making a mistake is the size of the p-value. In summary, the occurrence of AA confidence is a normal phenomenon.

No.2 ignore overexposure

What is overexposure? When setting up the experiment, a large number of users who did not experience the functions of the experimental version were included in the denominator of the experimental index, resulting in the dilution of the index value (the functions involved in the experiment may have "deep access", and users actually do not have access to them after opening the app. Enter this function, but it is still counted as a group user and participates in the indicator calculation).

The impact of "indicator dilution" on the analysis of experimental data is mainly reflected in: in the experiment, the effective sample size is lower than the "actually counted number of users in the group", and noise is introduced into the effective data, so it takes longer for the experiment to reach statistical significance At the same time, due to the dilution of the indicators, there is also a certain statistical error in the confidence interval of the improvement value.

No.3 Multiple comparison problem

Multiple comparisons lead to a higher probability of statistical indicators being wrong. The A/B experiment based on hypothesis testing is only applicable to scenarios where two groups of A and B are compared, and can help you choose a better one from strategy A and strategy B: at a 95% confidence level, Assuming that the new strategy is useless, we make a comparison, and the probability of making the first type of error (that is, my strategy is useless, but the experimental conclusion shows that my strategy is useful) is 5%. However, if the experiment is an AABB experiment, or an ABCD experiment, or an ABCDEFG experiment, etc., then the situation is completely different - we will face the multiple comparison problem: when the experimental group is more than 2 groups, the probability of our mistakes will be greatly increased. Another 5%.

Take the ABCD experiment as an example: Assuming that there is no significant difference among strategies A, B, C, and D, we compare ABCD two by two, and there are a total of 6 combinations, and 6 comparisons are required. As long as one mistake is made in the six comparisons, our conclusion is considered to be a mistake, so the error probability of each statistical indicator becomes 1-(1-5%)^6=26.5%, which is much greater than 5%.

Another point to note is that the hypothesis test uses "there is no significant difference between A and B" as the null hypothesis, and the p-value obtained by B relative to A does not mean that the probability that B is better than A is 1-(p-value). It is precisely because of this that when there are multiple comparisons (especially when the strategies of each group have no obvious difference in pros and cons or are evenly matched), it is difficult for hypothesis testing to provide a criterion for judging which strategy is the best. The above two problems have greatly increased the difficulty of evaluating who is better in ABCD and the risk of making mistakes .

No.4 is significant and significant

What is conspicuous for the sake of conspicuousness? In the actual business process, we found that this kind of error is mainly reflected in two situations:

  • Pay attention to too many insignificant indicators, as long as one indicator is significant, the strategy is considered effective.

We have repeatedly emphasized that in the experiment, we must clarify the goals, determine in advance which indicators can really measure the experimental effect, and set these indicators as the core indicators to be observed in the experiment. If we observe too many indicators in the experiment, it is normal for many insignificant indicators to happen to be significant. The experimenter may well have been misled by this apparent, and thus believed that his strategy worked.

  • Drill down the core indicators in multiple dimensions. If the indicators are significant in a certain dimension, the strategy is considered effective.

In the experiment report, some indicators will have an M-like symbol, which means that although the indicator is not significant overall, it is significant in a certain dimension in the case of multi-dimensional drill-down.

Some experimenters will think when analyzing the experimental results: Under the influence of the new strategy, if the indicator is significant in a certain dimension, then my strategy must be effective. In practice, however, this understanding is not accurate.

For example: Suppose an APP has users distributed in 5 countries and has 3 types of clients, then by combining countries and clients, we can drill down to 15 dimensions. How likely is it that an indicator is accidentally significant in one of the dimensions?

After calculation, it can be seen that there is a possibility of more than 50% to be significant. Therefore, it is unreasonable to use the significance of a certain dimension to verify the effect of the strategy.

In summary, stick to the goals and evaluation criteria determined in the experimental design stage, and do not replace the original core indicators with other indicators with weak causality in order to obtain statistically significant conclusions; and do not over-segment the data. If after analysis, it is determined that the new strategy does have a special impact on the attributes of a certain group of people, it is recommended to update the experimental target and open a targeted A/B experiment for this group of people for a second evaluation.

No.5 If the experiment is significant, stop the experiment immediately

There is a popular saying among byte data analysts: "Don't read the experiment report too early." What does it mean? That is, before reaching the estimated sample size (which can also be understood as reaching the preset number of days of the experiment), don’t look at the experimental results prematurely, because the experimental results may be significant at this time, and you can’t help but want to stop. Experiment, and use the current remarkable results as the experimental conclusion, but this is an incorrect approach.

For experiments with no significant difference (it can be understood that the new strategy is invalid), the indicators are likely to be significant when observed in the early stage of the experiment. We call this situation a false positive . Our theory of using hypothesis testing to quantify sampling error needs to be based on the premise of "satisfying a certain sample size". When the sample size is insufficient, sampling error will have a greater impact on the indicator. With the extension of time, the sample size of the experiment will continue to increase, and the value of p-value will also change. When the cumulative number of users in the group reaches the estimated sample size, the experimental conclusion may change from false positives in the early stage to insignificant.

Looking at the example in the figure below, in this AA experiment, the estimated sample size is 5000. It can be seen from the figure that the experiment reached significance in the middle period (confidence level above 95%); as the sample size gradually increased, the experimental conclusion was finally fixed as insignificant.

Therefore, before the experiment reaches the estimated sample size, the significance may fluctuate between significant and insignificant, and the experimental conclusion of premature decision-making is unreliable. Byte's own A/B testing platform - DataTester recommends users to use multi-day cumulative indicators for indicator observation. From a business point of view, the multi-day cumulative index itself has day-to-day fluctuations, and the performance varies greatly between weekends and weekdays. It is recommended that the experiment be run for an integer number of full natural weeks before making an experiment decision.

No.6 If the experiment is not significant, the experiment will not be stopped

Contrary to Mistake No. 4, in this case the experimenter would keep turning on the experiment until it became significant.

In an A/B experiment, no matter how similar strategies A and B are, they are not the same. In theory, as long as there are enough samples (such as infinite), any difference in the strategies of the experimental group and the control group will cause the experimental results to be statistically significant. For example, an experiment has been started for 10 years, and the new strategy increases the index by 0.001%, which is statistically significant, but this significance is not significant.

Therefore, in the experiment, the experimental design should be followed. If the experiment has reached the required sample size within the expected running period, but the target index changes are still not significant, then there is no need to continue running the experiment. Stop the experiment and continue trying in another direction.

No.7 believes that the promotion value after the strategy goes online should be the same as the experiment

Assuming that an experiment is now opened to optimize the user purchase rate of the product page, the purchase rate of the experimental group that adopts the new strategy B has increased to 3%, and the conclusion is confident. Does this mean that after the strategy B is fully launched, the product page will Will the purchase rate of the website be able to increase by 3%? it's not true. Because in the A/B experiment we take small flow sampling, the sample cannot fully represent the population.

The correct data is to pass the hypothesis test, combined with the significance level, and estimate the range of the promotion value, which is called the confidence interval. Assuming that in the previous example, after calculation, the confidence interval is [1.5%, 4.5%], then after strategy B is actually launched, the estimated interval [1.5%, 4.5%] will have a 95% possibility of containing the real purchase rate Growth rate (if the significance level is 0.05).

To sum up, if you want to know the possible changes in the indicators after the new strategy goes online, you can refer to the confidence interval.

No.8 Completely Data First

We advocate using data to speak instead of subjective assumptions. When evaluating experiments, we should not only look at the improvement of indicators, but also judge the reliability of data in combination with confidence. However, in some cases, the data can only convey one-sided information to us, and we need to make causal inferences based on the facts behind the data to ensure that there is a reasonable causal relationship between the data arguments and the arguments to be proved, so that the data It is an effective weapon for our argument. Otherwise, we simply own the data.

In experiments, we need to design experiments reasonably and clarify expectations based on our own business judgment; when the results of A/B experiments go against our business intuition, we should remain skeptical.

Guess you like

Origin blog.csdn.net/m0_60025795/article/details/130747347