版权声明:authored by zzubqh https://blog.csdn.net/qq_36810544/article/details/81363176
目的:有N个行向量
需要计算两两之间的皮尔逊系数,最简单的办法就是两个for循环,分别计算就搞定了。但是,如果n的值不大这样还ok,一旦n的值很大一般在10W左右,再循环效率损失就不小了,在做科学计算的时候,能用矩阵运算的就尽量别用循环太耗时!
PS:代码里连矩阵运算都没有还怎么愉快的装X啊,哈哈
======================我是只想安安静静写代码的分割线===================
两个向量的皮尔逊系数计算公式:
有个更简单的计算公式:
根据公式写出对应的代码:
# 计算两序列的皮尔逊系数,数值越大相关性越大
def get_distance(self, vector1, vector2):
num1 = vector1 - np.average(vector1)
num2 = vector2 - np.average(vector2)
num = np.sum(num1 * num2)
den = np.sqrt(np.sum(np.power(num1,2)) * np.sum(np.power(num2,2)))
if den == 0:
return 0.0
return np.abs(num/den)
现在来利用矩阵重写上述公式:
则有:
即得到分子矩阵,观察分母可知就是上述矩阵的对角线
设:
dot =
则分母有:
所以
根据公式写代码:
def get_pairwise_distances(self, embeddings):
"""
计算嵌入向量之间的皮尔逊相关系数
Args:
embeddings: 形如(batch_size, embed_dim)的张量
Returns:
piarwise_distances: 形如(batch_size, batch_size)的张量
"""
avg_vec = tf.reduce_mean(embeddings, axis=1)
# 归一到期望E(x)=0
nomal_embed = embeddings - tf.expand_dims(avg_vec, 1)
# 计算 sum((x-avg(x))*(y-avg(y)))的混淆矩阵,即分子矩阵
dot_product = tf.matmul(nomal_embed, tf.transpose(nomal_embed))
# 计算分母 sqrt((x-avg(x))^2 * (y-avg(y))^2)
square_norm = tf.diag_part(dot_product)
square_norm = tf.matmul(tf.expand_dims(square_norm, 1), tf.expand_dims(square_norm, 0))
distance = dot_product / tf.sqrt(square_norm)
return tf.Session().run(distance)
验证一下:
import time
import tensorflow as tf
import numpy as np
data_helper = DataHelper()
a = np.array([[2, 7, 18, 88, 157,90, 177, 570],
[3, 5, 15, 90, 180, 88, 160, 580],
[1,2,3,4,5,6,7,8]],dtype=float)
start = time.time()
dis = []
for i in range(a.shape[0] - 1):
for j in range(i + 1, a.shape[0]):
dis.append(data_helper.get_distance(a[i,:],a[j,:]))
end = time.time()
for i in dis:
print(i)
print('for cost {0}s'.format(end - start))
start = time.time()
dis = data_helper.get_pairwise_distances(a)
end = time.time()
print(dis)
print('matrix cost {0}s'.format(end - start))
结果如下:
0.9983487486440501
0.7993246094489326
0.7851394659823645
for cost 0.0s
2018-08-02 15:03:52.121404: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
[[1. 0.99834875 0.79932461]
[0.99834875 1. 0.78513947]
[0.79932461 0.78513947 1. ]]
matrix cost 0.10700535774230957s