Dry information | Application of Bayesian structural model in full marketing effect evaluation

About the Author

Yiwen, Ctrip data analyst, focuses on user growth, causal inference, data science and other fields.

1. Background

It is very important to scientifically infer the effect of a certain product strategy on observed indicators. This can help products and operations obtain the value of the strategy more accurately, so as to make subsequent iterations and adjustments in the direction.

Under the framework of causal inference, the golden standard for effect evaluation must be the "AB experiment" because the diversion of the experiment is considered to be completely random and uniform. On this basis, a certain intervention can be reflected by comparing the differences in indicators between the experimental group and the control group. The incremental value brought. However, in many scenarios, it is difficult for us to conduct strict AB experiments, such as hotel pricing; cash reward distribution, etc., and it is not suitable to display different content to different groups of people. For these problems, we will adopt the method of causal inference to evaluate the effect of the strategy.

This article mainly introduces the principle of the BSTS model and the code implementation of the model by CausalImpact, aiming to estimate the causal effect value more accurately and scientifically when facing some data with specific periodic characteristics. The following will first briefly explain the principle of the model; then use simulated data to demonstrate the code logic, and finally practice it in specific business scenarios.

2. Existing methods and potential problems

When most operations and products evaluate the effects of some fully launched strategies, the most common way is to look at the difference in effects before and after the launch. But the biggest problem with this method is its premise: it is assumed that the online function is the only variable that affects the effect (that is, there are no other intervening and confounding variables), but this assumption is often difficult to realize in reality.

So we tried to use more causal inference methods, such as PSM (Propensity Score Matching), to find a group of people who have very similar characteristics to the users in the experimental group among all non-experimental group users, and compare their indicator data (such as Order placement rate, order revenue, etc.) are compared with users in the experimental group to reflect the impact of the intervention. However, this method relies more on the selected user characteristics and the final matching effect.

Another example is SCM (synthetic control method), which uses some uninterrupted areas to synthesize a "similar virtual area" to make an overall comparison with the "areas with online strategies". But this also requires a key assumption: it is possible to find areas with highly synchronized long-term trends for comparison, and this condition is often difficult to achieve.

Furthermore, on the basis of traditional SCM, we attempt to use a method similar to ensemble learning, using multiple unintervention control groups as input values, and then combined with the long-term time series fluctuations of the experimental group itself to fit an unintervention virtual control group, thereby reducing the strong hypothesis that " the control group and the experimental group are highly synchronized" to a weak hypothesis . The BSTS model introduced in this article is a data model used to describe a certain "long-term time series fluctuation", and CausalImpact is used to estimate the causal effect value for such data. Below we will introduce these two tools in detail.

3. Model introduction

The BSTS model (Bayesian Structured Time Series) is called "Bayesian Structured Time Series". As its name suggests, its main features are reflected in:

  • Suitable for time series data with structural characteristics

  • Use Bayesian ideas to estimate parameters

Structured time series data is not uncommon in daily life, especially in OTA industries like Ctrip. The order situation on the platform actually has a certain time pattern. For example, weekends and holidays are the peak order periods; mid-week is the flat peak period for orders, etc. On the other hand, Bayesian thinking refers to having some "innate knowledge" of the parameters to be estimated before obtaining the sample data), and then based on this knowledge, combined with the sample data to obtain the posterior distribution (as shown below formula display)

e35f9154a37301a4f9ce156ee253c8b7.png

Therefore, the BSTS model mainly performs model fitting and prediction on structured time series data, and Bayesian prior ideas are used in the fitting process. The advantage is that it can give the confidence interval of the predicted value, making the prediction results more scientific and credible. These ideas will be introduced one by one below.

3.1 State space model

Structured time series data means that behind a certain observation data, there are actually different states that change with time. There is a corresponding relationship between the observation value and the state value; there is also a conversion relationship between the states at different moments. We generally use the following state space model to describe these two mapping logics:

bbfb942d4cda49f564618580a4700370.png

(1) is called the observation equation, which reflects the relationship between the observed value and the hidden state behind it; (2) is called the state equation, which reflects the transition between states over time. ; ac4736c132f9dd2def05384b21bcca72.pngThey are all "relationship mapping matrices" between different variables; 2ba5763ce2bb8ab70dfed74cb57c9a9d.png they are noise that is independent of other variables and obeys a normal distribution. The so-called "structuring" of data mainly includes:

  • Linear Local Trend: monotonicity (monotonous rise or fall) within a certain period of time

  • Seasonality: a fixed-length change, similar to temperature changes throughout the year

  • Cyclical: changes similar to seasonality but with variable fluctuation time and frequency.

6edff7d045860b0d0e90a5dee008538d.png

Figure 3-1: Observation data and its structural elements. The first picture reflects the fluctuation of the original data; the second picture reflects the seasonal factors; and the third picture reflects the local trend.

If you want to add covariate X to the mapping relationship, you can expand (1) to:

72052f575dc2d8cdc7e459a5fc98f4a1.png

It 18b7adb1f4437d026b55474836f31f93.pngrepresents the relationship between covariate The parameters in the above three equations will be estimated later.

3.2 Bayesian and MCMC (Markov Monte Carlo method)

It is assumed that the state sequence at each moment in the state equation (2) 3c95a0eb5402a4178f8e3506e8889078.pngrepresents all parameters in the model. We now want to estimate θ. The core steps are as follows:

  • d3333475d3c2686b11bb2d4612480d26.pngSet the prior distribution and the distribution of the initial state  for θf5cfb08c083711e41f7d7e3e31b9959b.png

  • Construct a Markov chain and obtain it using the MCMC methodbe922bd8f0aca5579767598babbdd234.png

  • The posterior distribution of parameters is calculated through Bayesian formulaa9f6edd898db77a26c36fe8a0f72ac8b.png

The following is a brief description of the methods used in each step:

1) Bayesian estimation : A major feature of the BSTS model is the use of Bayesian estimation in parameter estimation, that is, the prior distribution of the parameters is given before estimation, and then the posterior distribution of the parameters is given based on the sample data. . Different types of parameters generally have some commonly used prior distributions. For example, the mean generally uses the normal distribution, the ca7e5fbafbd1cb5f2ad0c862a6f9cad7.pngvariance uses the inverse-Gamma distribution, 72feb4b2992ad6b07c267b87fa92a60b.pngthe covariance matrix can use the IW distribution, etc. It is worth noting that the setting of the prior distribution will affect the subsequent MCMC convergence and the accuracy of the posterior distribution to a certain extent. Therefore, the prior distribution cannot be set too arbitrarily and should be derived as much as possible based on actual data. Choose the most appropriate prior distribution, or compare the values ​​of the posterior distribution and likelihood function under each prior distribution.

2) MCMC method : We try to construct a Markov chain (a special sequence in which the state value at the current moment is only related to the state value at the previous moment, and the final sequence will converge to a stable distribution) so that its final The convergent steady-state distribution is the posterior distribution of the parameters. We can realize this process through Gibbs sampling: after setting the prior distribution, starting from the initial state d2c2716e396839b5670fe498c8bc82e4.png, fix α and sample θ each time; then fix θ and sample α, gradually update the two sets of parameters again and again, and finally form a line that obeys Markov properties. The distribution of links 40726b7799ad50553ba5ef802f7cc44d.jpegwhose steady-state convergence can be proven is 52a27e9368705c4d3cc161c5ff65b0f6.pngwhere 1f4768266caba8215a6360336cfe72d0.jpegrepresents all observation data.

3) Prediction value estimation : 10f4c5ae9035d94dec1382a27173ceb3.pngAfter obtaining, we sample (α, θ) from the distribution, and then substitute it into the state space equation (1) to predict y, and obtain, where represents the predicted value of y after time point 4d07091295ffaf29edbcf4dd1fd4486e.jpegn a8a89042974579c81f7be3987d8558ff.jpeg.       

19858217d15b03996da5df6dc993e3bc.pngFigure 3-2: Shows a certain structured time series data and the process of each state transition behind it. State α includes Local trend: cdfc239f53f9c6b0be4984711e8c6d3a.png(local trend); local level: 5c5593f626052a9622c8434b91f2a8a2.png(mean value of local trend) and covariate x, 9688f73bd0f94aff0a4b5b5d678c9504.pngwhich represents all observed data; ce40bfa15774d29eb390b85de404c925.pngrepresents predicted data obtained according to the state model. ba88c06a59334912f47abc2ca778253f.pngThese parameters, represented cd253c7253b7db9013ade4abe2afe20d.pngby the standard deviation respectively, are estimated by MCMC.

4. Model application and code implementation

Above we have given a brief theoretical derivation and result output of the BSTS model and MCMC method. The core purpose is to predict the observation value y. Next we will introduce how to apply the BSTS model in causal inference scenarios.

In evaluating the effect of the policy, what we core want is the "counterfactual value" of the observation object, such as "What would be the user's browsing situation if this ad was not served?" Compared with the traditional PSM or SCM method, BSTS is better than the traditional PSM or SCM method. It can evaluate the effect of time series data; at the same time, Bayesian estimation is used to output the prediction of the counterfactual value y, and the confidence interval of the predicted value is given, which can reduce the volatility of the counterfactual value prediction to a certain extent and improve the effect evaluation. accuracy and stability.

In practical applications, the BSTS model can be implemented through Google's open source CausalImpact package, which can be called in both Python and R. For detailed code implementation, see references [7][8].

cbac8b7c99a18bf2ebd20ef01050f198.png

Figure 4-1: Shows the three steps when executing the code: training the BSTS model; predicting counterfactual values; and calculating the causal effect value, including point estimates and confidence intervals of the effect value.

4.1 Code implementation

The specific commands of the code are shown below through simulated data.

import tensorflow as tf
import tensorflow_probability as tfp
from causalimpact import CausalImpact
# 模型初始化 - 自定义时间序列数据:
def plot_time_series_components(ci):
     component_dists = tfp.sts.decompose_by_component(ci.model, ci.observed_time_series, ci.model_samples)
       num_components = len(component_dists)
mu, sig = ci.mu_sig if ci.mu_sig is not None else 0.0, 1.0
for i, (component, component_dist) in enumerate(component_dists.items()):
         component_mean = component_dist.mean().numpy()
         component_stddev = component_dist.stddev().numpy()
# 自定义观测方程以及真实值y:
def plot_forecast_components(ci):
         component_forecasts = tfp.sts.decompose_forecast_by_component(ci.model, ci.posterior_dist, ci.model_samples)
       num_components = len(component_forecasts)
       mu, sig = ci.mu_sig if ci.mu_sig is not None else 0.0, 1.0
       for i, (component, component_dist) in enumerate(component_forecasts.items()):
          component_mean = component_dist.mean().numpy()
          component_stddev = component_dist.stddev().numpy()
# 生成模拟数据,包括一个实验组数据(有干预)以及两条对照组数据(无干预)
observed_stddev, observed_initial = (tf.convert_to_tensor(value=1, dtype=tf.float32),tf.convert_to_tensor(value=0., dtype=tf.float32))
level_scale_prior = tfd.LogNormal(loc=tf.math.log(0.05 * observed_stddev), scale=1, name='level_scale_prior')  # 设置先验分布
initial_state_prior = tfd.MultivariateNormalDiag(loc=observed_initial[..., tf.newaxis], scale_diag=(tf.abs(observed_initial) + observed_stddev)[..., tf.newaxis], name='initial_level_prior')  # 设置先验分布 
ll_ssm = tfp.sts.LocalLevelStateSpaceModel(100, initial_state_prior=initial_state_prior,  level_scale=level_scale_prior.sample())  #训练时序模型
ll_ssm_sample = np.squeeze(ll_ssm.sample().numpy())
# 整合数据
x0 = 100 * np.random.rand(100)    # 对照组1
x1 = 90 * np.random.rand(100)     # 对照组2
y = 1.2 * x0 + 0.9 * x1 + ll_ssm_sample    #生成真实值y
y[70:] += 10     #设置干预点
data = pd.DataFrame({'x0': x0, 'x1': x1, 'y': y}, columns=['y', 'x0', 'x1'])

a795922d352b8817992cdf88fc304b2c.png

Figure 4-2: Showing simulated data. The dotted line represents the time point when the intervention occurred, the blue line represents the observation data that received the intervention; the yellow line and the green line represent the control data of the two groups that did not receive the intervention.

# 调用模型:
pre_period = [0, 69]    #设置干预前的时间窗口
post_period = [70, 99]  #干预后的窗口
ci = CausalImpact(data, pre_period, post_period)  #调用CausalImpact
# 对于causalImpact的使用我们核心需要填写三个参数:观测数据data、干预前的时间窗口、干预后的时间窗口。
# 输出结果:
ci.plot()
ci.summary()图4-3:展示CausalImpact输出的结果图,图1表示真实值与模型拟合值的曲线;图2表示每个时刻真实值与预测值的差异;图3表示真实值与预测值的累计差值。表3-1:展示CausalImpact输出的结果表格,量化效应值effect的估计及其置信区间,反映效应值是否具有显著性。107.71表示干预之后实际值的平均;96.25表示干预之后预测值的平均,3.28表示估计的标准差,[89.77,102.64]表示反事实估计的置信区间。11.46表示实际值与预测值的差距,[5.07,17.94]表示差值的置信区间,由于差距的置信区间在0的右侧,表示干预有显著的提升作用。

03314d0162df626fc651cb1e1b927ae8.png

Figure 4-3: Shows the result graph output by CausalImpact. Figure 1 shows the curve between the real value and the model fitting value; Figure 2 shows the difference between the real value and the predicted value at each moment, and the orange shaded part shows the confidence interval; Figure 3 shows the true value The cumulative difference between the value and the predicted value.

59815f3ba95940d00fd2ecb6eaa582a6.pngTable 3-1: Displays the results table output by CausalImpact, quantifying the estimate of the effect value and its confidence interval, reflecting whether the effect value is significant. 107.71 represents the average of the actual values ​​after the intervention; 96.25 represents the average of the predicted values ​​after the intervention, 3.28 represents the estimated standard deviation, and [89.77,102.64] represents the confidence interval of the counterfactual estimate. 11.46 represents the difference between the actual value and the predicted value, and [5.07, 17.94] represents the confidence interval of the difference. Since the confidence interval of the gap is on the right side of 0, it means that the intervention has a significant improvement effect.

4.2 Model verification

For the model fitting results, we need to perform an "AA verification" similar to the AB experiment. Generally, you can use the second picture in the illustrated results to observe whether the confidence interval of the difference between the true value and the predicted value before the intervention contains 0. If it contains 0, it means that the test has passed and the model fitting effect is good. In the figure above, the confidence intervals all contain 0, indicating that the model is available.

4.3 Model adjustment

  • Process parameters : We can use Decomposition in Tensorflow to view various structural elements in the time series model, including periodicity/seasonality, etc.

seasonal_decompose(data)

3c8d722b5a096d7ec0969f7fb7f6d1d1.png

Figure 4-4 shows the state elements behind the generated data. The first picture reflects the trend of the original data; the second picture reflects the local trend factor; and the third picture reflects the seasonal factor. It can be seen that the data has a seasonal structure and a monotonic upward trend.

  • Custom parameters : We can customize the prior distribution of parameters; the number of iterations; the periodic time window length, etc. Parameter adjustments often have an impact on the result output. For example, correctly selecting the prior distribution will make the results more accurate; more iterations can ensure that MCMC converges more stably (but it may also lead to longer model running time), etc. The most important of these is the setting of the time window length , which needs to correctly reflect the periodicity of the observation data. If the annual dimension data is based on weeks, set neatons=52; if the daily dimension data is based on hours, set neatons=24, etc.

CausalImpact(...,  model.args = list(niter = 20000, nseasons = 24))

15a01ff03b9b1fcd0b3961b76816c3cb.png

Figure 4-5 shows the meaning of each parameter in the CausalImpact package and its default value.

  • Custom timing model : The causalImpact package uses the BSTS model for training by default. We can also change to other timing models, but the premise is that the data needs to be standardized . (Normalization is not necessarily required if using the default BSTS)

from causalimpact.misc import standardize
normed_data, _ = standardize(data.astype(np.float32)) #标准化数据
obs_data = normed_data.iloc[:70, 0]
# 使用tfp中的其他模型来训练时序数据
linear_level = tfp.sts.LocalLinearTrend(observed_time_series=obs_data)
linear_reg = tfp.sts.LinearRegression(design_matrix=normed_data.iloc[:, 1:].values.reshape(-1, normed_data.shape[1] -1))
model = tfp.sts.Sum([linear_level, linear_reg], observed_time_series=obs_data)
# 将自义定时序模型代入CausalImpact包中
ci = CausalImpact(data, pre_period, post_period, model=model)

5. Business scenario practice

User marketing is an important way to promote retention and conversion. Among them, reaching users with messages is a core method, especially pushing users during the peak ticket purchase period during holidays. Methods include in-site push; mini-program subscription in the WeChat ecosystem Messages; public accounts or corporate micro-environments, etc. The purpose includes but is not limited to reminding users to purchase tickets; promoting brand functions; issuing coupons to attract users to convert, etc. After the holidays, we hope to evaluate the effectiveness of this marketing reach.

This is a typical scenario that is not suitable for AB experiments. First of all, because holidays are peak traffic periods, if 50% of users are strictly reserved and not reached, a batch of potential converted users may be lost; if the control group is reserved instead A very small number of people, such as control group: experimental group = 1:9, will have an impact on the scientific nature of subsequent conversion comparisons. Secondly, push strategies during holidays are often very refined, with a total of dozens of messages. It is difficult for us to ensure the “purity” of users in the control group, and users may be cross-reached.

Due to various problems, it is difficult for us to evaluate the conversion effect of push through traditional AB methods. Therefore, we consider using causal inference to solve this problem. General options and potential issues are as follows:

  • If you use PSM, you need to find users in the market who are similar to the push group but have not been pushed as a control group. However, there is generally a cover-up strategy when pushing during holidays, which almost covers more than 95% of platform users. It is difficult to find people who meet the conditions but have not been pushed for comparison.

  • If we use SCM, it is more difficult for us to find a suitable control group to synthesize. For example, when evaluating the push effect of vacation BU, we are unlikely to use various production lines such as trains, air tickets, hotels, etc. to synthesize a "virtual vacation BU" because the user needs of each production line are different. Use such a synthesized virtual control group to Comparing conversion rates for vacation orders is not scientific enough.

  • The same is true for DID. It is difficult for us to find a control group that satisfies the parallel trend assumption and has similar business scenarios to compare before and after push.

To sum up, even if some traditional causal inference methods are technically feasible, they are still lacking in business explanation. Moreover, none of the above three methods take into account the "time periodicity" of users' ticket purchasing behavior. Therefore, even if the control group is synthesized, it may not be able to match the true structural characteristics of the experimental group, which will lead to biased calculation of the effect value. So we considered first verifying the periodicity of user ticket purchase data; after locating the periodic pattern, we tried to use the BSTS model combined with CausalImpact to predict counterfactual values. Below we choose the 2022 Dragon Boat Festival train ticket marketing push scenario for practice.

77539e85ec2c44b4f3380331e8a581aa.png

Figure 5-1 shows different push strategies for reaching users during the Dragon Boat Festival.

5.1 Data selection

  • We use the hour as the period window, and through a simple image we can see that the number of train ticket orders in the market does show a certain fixed trend over time.

e527143ce67381149038564da5b769c8.png

Figure 5-2 shows the number of train ticket payers per hour during the selected Dragon Boat Festival period (10 days before and after the Dragon Boat Festival)

  • Taking into account the natural traffic growth of the Dragon Boat Festival as a holiday itself, the increase in the number of payers cannot be entirely attributed to push. Therefore, when training the time series model, all the Dragon Boat Festival data from 19 to 21 years (10 years before and after the Dragon Boat Festival) were selected. days), input the BSTS model for training to obtain the unique structural state within the Dragon Boat Festival window, and then use this structured model to substitute 22 years of Dragon Boat Festival data to predict the number of conversions after the Dragon Boat Festival push in 2022.

  • Finally, the difference between the actual number of converted people and the predicted number of people is used to reflect the effect of this marketing push.

5.2 R-code implementation

# 选取19-22年每年的端午窗口,按照小时划分,共960个数据点
y_hour=c(x1,x2,x3,x4)
x_time_hour=c(1:960)
data_hour <- cbind(y_hour, x_time_hour)
pre.period <- c(1, 808)   # 2022年的推送发生在第808个时间点,故以此为干预节点。
post.period <- c(809, 960)
# nseasons=24, 迭代次数2000,fit the model
impact_hour <- CausalImpact(data_hour, pre.period, post.period, model.args = list(niter = 20000, nseasons = 24))
summary(impact_hour)
plot(impact_hour)

3d145e45c2cf3ac437de68ddb4cd9b11.png

Figure 5-3 shows the result graph returned by using CausalImpact. The first picture shows the actual number of payers and the predicted number of payers; the second picture shows the difference between the real value and the predicted value and the confidence interval; the third picture shows the cumulative difference and confidence interval.

The image shows that the model can pass AA verification and the model is valid. After the intervention point, the actual value is improved compared to the predicted value, but the confidence interval of the improvement contains 0, so it does not reach a significant level. It shows that the 2022 Dragon Boat Festival marketing strategy has a certain effect on the number of conversions, but the effect is not significant.

6. Advantages and Disadvantages of Methods

Compared with traditional causal inference methods, the BSTS model has two main advantages:

  • Be able to identify the structural features behind the data and make better predictions;

  • The idea of ​​Bayesian estimation is used to obtain the posterior distribution of parameters, and the confidence interval can be given when calculating the effect value. But point (2) is a "double-edged sword" for the BSTS model. If the prior distribution is not set well, it will affect the convergence speed and direction of MCMC and even the final posterior distribution. Therefore, the selection of the prior distribution needs to be cautious.

7. Method expansion

The structured time series model introduced in this article splits the periodic characteristics of the data into trend items, seasonal items, periodic items, etc., and each element is explored one by one. Furthermore, we can split the time series according to the length of the period, and divide it into long period items (using a large sliding window), short period items (using a small sliding window), seasonal items, etc. The advantage of this is to prevent the periodic conditions within some small windows from being smoothed out by long-period information, and can better reflect the periodic characteristics of the data to varying degrees. The specific equation can be split into the following form:

11de7e9618f4085d4e10bd75c3f82771.png

which 9c3f180527aac2218615632e3130be12.pngrepresents the state values ​​at different time points; the 4 modules represent long-period items/short-period items/seasonal items/serial correlation items (with covariate X) respectively; each structural module has a mean and a standard deviation.

629b19e25038dfaa8d3b0711a6e0495b.png

Figure 7-1 shows the four modules behind a certain time series: from top to bottom: original data situation; seasonal factors; short-period items; correlation items; and long-period items. The data fluctuations are more obvious in the short period; the data fluctuations are less obvious in the long period. Therefore, the data structure in the short period needs to be taken into consideration to avoid being smoothed by the long period data.

Next, the above four modules are predicted separately. For long periods and seasonality, since they do not change much in a short period of time, μ and σ in the corresponding equations can be directly used for prediction; for short period terms and correlation terms, other machine learning methods can be used for prediction. After obtaining the prediction results of each module, the features of each module are combined to obtain the overall prediction result. Reference [4] gives more specific prediction methods and comparison results with traditional methods.

0e8b5b970a03ec20a8a763c8a34b134a.png

Figure 7-2 shows different forecasting methods for long and short periods: long-period items and seasonal items can be directly expressed as μ for prediction; short-period and covariate-related items are predicted using a custom machine learning model.

After obtaining the time series prediction model according to the above method, we then substitute it into the CausalImpact code and adjust the parameter model to a custom time series model.

8. Summary

This article introduces the method of causal inference to evaluate the effect of a certain policy on time series data. It uses the state space model of BSTS to predict counterfactual values ​​and implements it through the code of CausalImpact.

Different from the framework of other causal inference methods, the method in this article performs Bayesian estimation of all hyperparameters, and then performs MC sampling on the posterior distribution to obtain counterfactual predicted values. The main advantage is that it can take into account all control groups to the greatest extent. As well as the structural characteristics of the experimental group itself, estimates of counterfactual predicted values ​​and confidence intervals are given to measure the significance of the effect value.

At the same time, the method introduced in this article mainly focuses on structured time series data, using the BSTS model to identify the state values ​​behind the observation data and the transformation between each state, and then eliminate as much as possible the problems caused by hidden states when making counterfactual predictions. Influence.

It should be noted that before using the BSTS model, you need to verify whether the data really has periodic characteristics and what the structural elements are (whether it is a long or short period, etc.), and then select an appropriate time series model for training; at the same time, for the priori parameters Distribution settings also need to be carefully set to make the final effect estimate scientifically stable as much as possible.

[Recommended reading]

b25364c6619b84df03fa610c590b70b1.jpeg

 “Ctrip Technology” public account

  Share, communicate, grow

Guess you like

Origin blog.csdn.net/ctrip_tech/article/details/132843478