N个向量间的两两皮尔逊系数的矩阵计算方法

版权声明:authored by zzubqh https://blog.csdn.net/qq_36810544/article/details/81363176

目的:有N个行向量 [ e 1 , e 2 , . . . . e n ] 需要计算两两之间的皮尔逊系数,最简单的办法就是两个for循环,分别计算就搞定了。但是,如果n的值不大这样还ok,一旦n的值很大一般在10W左右,再循环效率损失就不小了,在做科学计算的时候,能用矩阵运算的就尽量别用循环太耗时!
PS:代码里连矩阵运算都没有还怎么愉快的装X啊,哈哈
======================我是只想安安静静写代码的分割线===================
两个向量的皮尔逊系数计算公式:
这里写图片描述
有个更简单的计算公式:
ρ x , y = i = 1 n ( x x ¯ ) ( y y ¯ ) i = 1 n ( x x ¯ ) 2 i = 1 n ( y y ¯ ) 2
根据公式写出对应的代码:

 # 计算两序列的皮尔逊系数,数值越大相关性越大
 def get_distance(self, vector1, vector2):
     num1 = vector1 - np.average(vector1)
     num2 = vector2 - np.average(vector2)
     num = np.sum(num1 * num2)
     den = np.sqrt(np.sum(np.power(num1,2)) * np.sum(np.power(num2,2)))
     if den == 0:
         return 0.0
     return np.abs(num/den)

现在来利用矩阵重写上述公式:
e e ¯ = e ~ = [ e 1 e 1 ¯ e 2 e 2 ¯ . . . . . e n e n ¯ ]

e ~ = [ e 1 ~ e 2 ~ . . . e n ~ ]
则有:
e ~ e ~ = [ e 1 ~ 2 , e 1 ~ e 2 ~ , . . , e 1 ~ e n ~ e 2 ~ e 1 ~ , e 2 ~ 2 , . . . , e 2 ~ e n ~ . . . e n ~ e 1 ~ , e n ~ e 2 ~ , . . . , e n ~ 2 ]
即得到分子矩阵,观察分母可知就是上述矩阵的对角线
设:
dot = [ e 1 ~ 2 e 2 ~ 2 . . . e n ~ 2 ]
则分母有:
d o t d o t
所以
ρ x , y = e ~ e ~ d o t d o t
根据公式写代码:

def get_pairwise_distances(self, embeddings):
    """
        计算嵌入向量之间的皮尔逊相关系数
        Args:
            embeddings: 形如(batch_size, embed_dim)的张量
        Returns:
            piarwise_distances: 形如(batch_size, batch_size)的张量
    """
    avg_vec = tf.reduce_mean(embeddings, axis=1)
    # 归一到期望E(x)=0
    nomal_embed = embeddings - tf.expand_dims(avg_vec, 1)

    # 计算 sum((x-avg(x))*(y-avg(y)))的混淆矩阵,即分子矩阵
    dot_product = tf.matmul(nomal_embed, tf.transpose(nomal_embed))

    # 计算分母 sqrt((x-avg(x))^2 * (y-avg(y))^2)
    square_norm = tf.diag_part(dot_product)
    square_norm = tf.matmul(tf.expand_dims(square_norm, 1), tf.expand_dims(square_norm, 0))

    distance = dot_product / tf.sqrt(square_norm)
    return tf.Session().run(distance)

验证一下:

 import time
 import tensorflow as tf
 import numpy as np

 data_helper = DataHelper() 
 a = np.array([[2, 7, 18, 88, 157,90, 177, 570],
               [3, 5, 15, 90, 180, 88, 160, 580],
               [1,2,3,4,5,6,7,8]],dtype=float)
 start = time.time()
 dis = []
 for i in range(a.shape[0] - 1):
     for j in range(i + 1, a.shape[0]):
         dis.append(data_helper.get_distance(a[i,:],a[j,:]))
 end = time.time()
 for i in dis:
     print(i)
 print('for cost {0}s'.format(end - start))

 start = time.time()
 dis = data_helper.get_pairwise_distances(a)
 end = time.time()
 print(dis)
 print('matrix cost {0}s'.format(end - start))

结果如下:

0.9983487486440501
0.7993246094489326
0.7851394659823645
for cost 0.0s
2018-08-02 15:03:52.121404: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
[[1.         0.99834875 0.79932461]
 [0.99834875 1.         0.78513947]
 [0.79932461 0.78513947 1.        ]]
matrix cost 0.10700535774230957s

猜你喜欢

转载自blog.csdn.net/qq_36810544/article/details/81363176