Data analysis must be alert to the pit: Simpson's Paradox

Simpson's Paradox is a paradox British statistician EH Simpson proposed in 1951 that the two sets of data under certain conditions, when will meet separately to discuss certain properties, but once the merger consideration, it may lead to the opposite conclusion.

A college two American universities, namely:

8613212-17f2c02ef18a2a35.jpg
image

Law and business schools, the new semester enrollment. It is suspected that two Academy sexist, now make the following statistics:

We can see from the data shown on FIG, law school boys acceptance rates of 8/53 = 15.1%, the proportion of female enrolled 51/152 = 33.6%. Similarly, the business school enrollment ratio of boys is 80.1%, the proportion of girls enrolled was 91.1%.

Both in law school or business school, the proportion of girls are enrolled than boys, it can be inferred when school enrollment more inclined to recruit girls do?

8613212-797d08b605657c5b.jpg
image

Admissions school when calculating the ratio of male was admitted to 209/304 = 68.8%, the proportion of female enrolled 143/253 = 56.5%. Acceptance rate to boys than girls, which, I am afraid to turn the girls feel the injustice.

So the question is: the university's admissions policy, in the end there is no gender discrimination? In the end is discrimination boys or girls?

I will not speak conclusion, we will look at a practical work will encounter case.

Work in a typical case :

The user of a product's 10,000 people use Android devices using IOS device 5000, the overall conversion rate of pay should be 5%. Conversion was found broken IOS device 4%, and Android device is 5.5%. "Smart" data analyst concluded: IOS platform users pay conversion rate is low, it is recommended to give up IOS platform development.

In general, the conversion of IOS flat fee is much higher than Android tablet, and the conversion of IOS phones is relatively better. In this case, the device type is a complex variable, if the data is obtained based on device type, then the other data can be completely ignored.

Now let's compare this set of data:

8613212-48fbe72139d7d2ad.jpg
image

Thus, Android device conversion rate in terms of flat side or in the conversion rate of less than IOS mobile terminal equipment, which is very consistent with our conventional expectations.

8613212-d9629ff266b2724d.jpg
image

当计算全设备情况时,Android的转化比例为550/10000=5.5%,IOS的转化比例只有200/5000=4.0%。这也是题干中“聪明”的数据分析师得出IOS版本应该下线的根源。

原因与应对策略

误区产生的原因说起来也很简单,就在于将“值与量”两个维度的数据,归纳成了“值”一个维度的数据,并进行了合并。

如果要避免“辛普森悖论”给我们带来的误区,就需要斟酌个别分组的权重,以一定的系数去消除以分组资料基数差异所造成的影响。而在实际转化例子中,就需要用如“ARPU”、“ARPPU”这样看似相似实际上有很大差异的指标来进行分割。

同样地,如果要更客观分析产品的运营情况,就需要设立更多角度去综合评判。还是拿上述的设备转化率为例,产品层考虑转化的前提会优先考虑分发量、用户量、运营思路、口碑等等。而往往为了实现最后的转化需要,需要更多前置目标做铺垫。

常用的前置目标

  • 用户量:免费产品需要很大的用户量才能获得足够的总收入,因为该模式的转化率极低。而这些用户通常来自全球各个地区,使用各种不同类型的设备。针对不同的设备类型,采用通用的平均值是没有意义的。
  • LTV范围:免费产品需要很长的货币化周期,把用户消费当作玩家是否开心的依据,就像参与度和消费紧密相关一样,因此可以作为分类的标准。

大多数的用户是不会付费的,免费产品的综合付费转化率比较低,是因为把付费玩家和非付费玩家综合到了一起,所以任何对免费用户的衡量都是非常低的。因为大多数的用户是不付费的,所以ARPU以及ARPPU相差很多。

A/B****测试中的注意点

联想到产品运营的实践,一个常见的A/B测试误判例子是这样的:拿1%用户跑了一个重大版本,发现试验版本购买率比对照版本高,就说试验版本更好,我们应该发布试验版本。

而事实上,我们选取的试验组里往往会挑选那些乐于交流、热衷产品、又或者是付费率高粘性高的用户,把他们的数据与全体用户对比是不客观的。当最后发布试验版本时,反而可能降低用户体验,甚至造成用户留存和营收数据的双双下降。

收获与总结

避免辛普森悖论的关键是要同时参考不同用户间的事实全貌。

第一,准确的用户分群在数据分析中是非常重要的,尤其是在免费产品当中,平均用户不仅不存在,而且是误导研发的因素之一,所以关键在于利用特征将用户进行合理划分。

第二,在一个具体的产品中,普适型的数据(如粗暴的对比IOS和Android总体情况)是没有多大参考意义的,一定要细分到具体设备、国家、获取渠道、消费能力等等再进行比对才有价值。

第三,斟酌个别分组的权重,以一定的系数去消除以分组资料基数差异所造成的影响,同时必需了解该情境是否存在其他潜在要因而综合考虑。

用户分析常用缩略词

  • DNU,Daily New Users:每日新增用户
  • AU,Active User:活跃用户,统计特定周期内完成过指定事项或指标的用户数
  • PU,Paying User:付费用户
  • APA,Active Payment Account:活跃付费用户数
  • ARPU,Average Revenue Per User:平均每用户收入,总收入/AU
  • ARPPU,Average Revenue Per Paying User:平均每付费用户收入,总收入/APA
  • PUR,Pay User Rate:付费比例,APA/AU
  • LTV, Life Time Value: Lifetime Value

Guess you like

Origin blog.csdn.net/weixin_33895695/article/details/90882933