pyspark find variance of a column of df

tmplst = [{
    
    'id': '1', 'f1': 'fff1', 'f2': 0.8, 'y': 1}, 
          {
    
    'id': '2', 'f1': 'fff2', 'f2': 0.1, 'y': 0}, 
          {
    
    'id': '3', 'f1': 'fff2', 'f2': 0.9, 'y': 0}, 
          {
    
    'id': '4', 'f1': 'fff1', 'f2': 3.6, 'y': 1}, 
          {
    
    'id': '5', 'f1': 'fff2', 'f2': 0.5, 'y': 0}, 
          {
    
    'id': '6', 'f1': 'fff1', 'f2': 1.2, 'y': 1}, ]
sfdd = ss.createDataFrame(tmplst)
sfdd.show()

# 方差
'''
variance(), var_samp()是一样的,返回某一列中所有值的无偏方差,分母是n-1
var_pop()返回某一列中所有值的总体方差,即有偏方差,分母是n
在样本量较小的时候,无偏方差更符合实际的总体方差,当样本量较大时,无偏方差和有偏方差区别不大。总的说来用无偏样本方差来估计总体方差会更加准确。
'''
sfdd.select(fn.variance("f2"), fn.var_samp("f2"), fn.var_pop("f2")) \
.show(truncate=False)
+----+---+---+---+
|  f1| f2| id|  y|
+----+---+---+---+
|fff1|0.8|  1|  1|
|fff2|0.1|  2|  0|
|fff2|0.9|  3|  0|
|fff1|3.6|  4|  1|
|fff2|0.5|  5|  0|
|fff1|1.2|  6|  1|
+----+---+---+---+

+------------------+------------------+------------------+
|var_samp(f2)      |var_samp(f2)      |var_pop(f2)       |
+------------------+------------------+------------------+
|1.5416666666666667|1.5416666666666667|1.2847222222222223|
+------------------+------------------+------------------+

refer to:

https://www.jianshu.com/p/f50c4568f375
https://zhuanlan.zhihu.com/p/157799814

Guess you like

Origin blog.csdn.net/qq_42363032/article/details/123752151