sparksql_ correlation analysis

DataFrame long as the data format is very easy to calculate the correlation PySpark.
The only difficulty is # .corr (...) method now supports Pearson correlation coefficient, and it can only be calculated pairwise correlations

# 只要数据是DataFrame格式,在PySpark中计算相关性非常容易。
# 唯一的困难是.corr(…)方法现在支持Pearson相关系数,而它只能计算成对的相关性,如下:
fraud_df.corr('balance', 'numTrans')
0.00044523140172659576
In order to create a correlations matrix you can use the script below.

# 创建一个相关矩阵:
n_numerical = len(numerical)
corr = []
for i in range(0, n_numerical):
    temp = [None] * i
    
    for j in range(i, n_numerical):
        temp.append(fraud_df.corr(numerical[i], numerical[j]))
    corr.append(temp)
    
corr
# 可以看见特征之间几乎不存在相关性,因此,所有的特征都能用到我们的模型中。
[[1.0, 0.00044523140172659576, 0.00027139913398184604],
 [None, 1.0, -0.0002805712819816179],
 [None, None, 1.0]]

Published 273 original articles · won praise 1 · views 4694

Guess you like

Origin blog.csdn.net/wj1298250240/article/details/103946964