代码参考:https://github.com/ahmaurya/topics_over_time,如有侵权,请告知删除~
吉布斯采样(Gibbs sampling)是统计学中用于马尔科夫蒙特卡洛(MCMC)的一种算法,用于在难以直接采样时从某一多变量概率分布中近似抽取样本序列。该序列可用于近似联合分布、部分变量的边缘分布或计算积分(如某一变量的期望值)。某些变量可能为已知变量,故对这些变量并不需要采样。
吉布斯采样常用于统计推断(尤其是贝叶斯推断)之中。这是一种随机化算法,与最大期望算法等统计推断中的确定性算法相区别。与其他MCMC算法一样,吉布斯采样从马尔科夫链中抽取样本,可以看作是Metropolis–Hastings算法的特例。
Gibbs Samping 是MCMC中最常用的方法,基本的原理就是通过随机模拟, 采集期望数量的 目标分布的样本,这些样本构造了一条马尔可夫链,而由这些样本集,基本可以推断出目标分布的参数以及其它的想了解的后验分布。但通常如何采集 样本成为关键,应用它的原因是目标分布的分布函数未知,但是构成目标分布的变量的条件分布是知道的,那么就可以用随机模拟的思想,利用贝叶斯公式的特性,从条件概率依次对构成目标分布的每个变量进行采样,做跳转,每次只针对一个维度做迭代,当所有维度都跳转一次之后所的样本即可则看作是目标分布的一次跳转。
通常若按数学公式是可以直接求出联合概率的,但随着变量数量的增大,公式求解,变得异常复杂,遂通过采样的方式求得联合概率分布。Gibbs Samping的基本过程:比如我们已知变量A,B,C,并知p(A|B,C),p(B|A,C),p(C|A,B)。
step1:给ABC随机赋值,即随机一个样本,如(A0,B0,C0);
step2:根据p(A|B0,C0)得A1;
step3:根据p(B|A1,C0)得B1;
step4:根据p(C|A1,B1)得C1;
现得样本(A1,B1,C1),重复step1到step4,经过若干次迭代过程后,则结果基本趋于变量实际的分布,即可作为联合发布。
通过上面对吉布斯采样的简单理解,下面便开始基于gibbs采样的topic over time算法的构建。
【数据集】
1、allstopword:停用词数据集
2、alltime:时间戳数据集
3、alltitle:topic数据集
【定义topicovertime类】
操作之前需要导入的包如下:
import fileinput
import random
import scipy.special
import numpy as np
import scipy.stats
import copy
1、定义获取语料库和词典的方法
def GetPnasCorpusAndDictionary(self, documents_path, timestamps_path, stopwords_path):
'''
获取PNAS语料库和词典(去除停用词)
:param documents_path: 文件路径
:param timestamps_path:时间戳路径
:param stopwords_path:停用词路径
:return: 以列表的形式返回时间戳、文本和词典
'''
documents = []
timestamps = []
dictionary = set()
stopwords = set()
for line in fileinput.input(stopwords_path): # 更新停用词
stopwords.update(set(line.lower().strip().split()))
for doc in fileinput.input(documents_path): # 将停用词之外的单词加到文件中并且文本中不重复的加载到词典中
words = [word for word in doc.lower().strip().split() if word not in stopwords]
documents.append(words)
dictionary.update(set(words))
for timestamp in fileinput.input(timestamps_path): # 获取时间戳
num_titles = int(timestamp.strip().split()[0]) # 时间戳文件第一列作为主题个数
timestamp = float(timestamp.strip().split()[1]) # 时间戳文件第二列作为时间戳
timestamps.extend([timestamp for title in range(num_titles)]) # 将时间戳文件中主题后面对应的时间戳重复对应的主题次数后保存到时间戳列表中
for line in fileinput.input(stopwords_path): # 更新停用词
stopwords.update(set(line.lower().strip().split()))
first_timestamp = timestamps[0]
last_timestamp = timestamps[len(timestamps) - 1]
timestamps = [1.0 * (t - first_timestamp) / (last_timestamp - first_timestamp) for t in timestamps] # 通过这个函数获取新的时间戳列表
dictionary = list(dictionary)
assert len(documents) == len(timestamps) # 断言,如果为false,则触发异常
return documents, timestamps, dictionary
2、初始化参数
def InitializeParameters(self, documents, timestamps, dictionary):
'''
初始化参数
:param documents: 文件内容列表
:param timestamps: 时间戳列表
:param dictionary: 词典列表
:return:返回存放所有参数的字典
'''
par = {} # 定义存放所有参数的字典
par['dataset'] = 'pnas' # 数据集的名字
par['max_iterations'] = 10 # gibbs采样的最大迭代次数
par['T'] = 10 # 主题个数
par['D'] = len(documents) # 文本长度
par['V'] = len(dictionary) # 词典长度
par['N'] = [len(doc) for doc in documents] # 文本中每个字符的长度列表
par['alpha'] = [50.0 / par['T'] for _ in range(par['T'])] # 返回主题个数长度的一个列表,列表内容为公式计算的内容,作为alpha的值
par['beta'] = [0.1 for _ in range(par['V'])] # 返回词典长度的一个列表,列表内容为0.1,作为beta的值
par['beta_sum'] = sum(par['beta']) # beta值的和
par['psi'] = [[1 for _ in range(2)] for _ in range(par['T'])] # 重复主题个数次的[1,1]
par['betafunc_psi'] = [scipy.special.beta(par['psi'][t][0], par['psi'][t][1]) for t in range(par['T'])] # 返回一个主题个数的列表,内容为beta函数值
par['word_id'] = {dictionary[i]: i for i in range(len(dictionary))} # 返回一个字典列表,形式为{字典内容: 字典下标}
par['word_token'] = dictionary # 字典列表
par['z'] = [[random.randrange(0, par['T']) for _ in range(par['N'][d])] for d in range(par['D'])] # 返回每个字符长度的0-10之间的随机整数
par['t'] = [[timestamps[d] for _ in range(par['N'][d])] for d in range(par['D'])] # 返回每个字符长度的对应下标的时间戳的值列表
par['w'] = [[par['word_id'][documents[d][i]] for i in range(par['N'][d])] for d in range(par['D'])] # 返回每个字符长度的对应字典内容的字典下标
par['m'] = [[0 for t in range(par['T'])] for d in range(par['D'])] # 返回文本长度的列表,每个元素为主题个数长度的全零列表
par['n'] = [[0 for v in range(par['V'])] for t in range(par['T'])] # 返回主题个数长度的列表,每个元素为字典长度的全零列表
par['n_sum'] = [0 for t in range(par['T'])] # 返回主题个数长度的全零列表
np.set_printoptions(threshold=np.inf) # 设置打印是显示方式,此时为全部输出,中间部门不包含省略号
np.seterr(divide='ignore', invalid='ignore') # 设置如何处理浮点错误
self.CalculateCounts(par)
return par
对初始化参数的方法中使用的CalculateCounts()方法进行定义:
def CalculateCounts(self, par):
'''
计算总数
:param par: 存放所有参数的字典
:return:
'''
for d in range(par['D']): # 文本长度
for i in range(par['N'][d]): # 文本中每个字符串的长度列表
topic_di = par['z'][d][i] # 在文档d中i位置的topic id
word_di = par['w'][d][i] # 在文档d中i位置的Word id
par['m'][d][topic_di] += 1
par['n'][topic_di][word_di] += 1
par['n_sum'][topic_di] += 1
3、获得主题时间戳
def GetTopicTimestamps(self, par):
'''
获得主题时间戳
:param par: 存放所有参数的字典
:return: 返回主题时间戳
'''
topic_timestamps = []
for topic in range(par['T']): # 主题个数
current_topic_timestamps = []
current_topic_doc_timestamps = [[(par['z'][d][i] == topic) * par['t'][d][i] for i in range(par['N'][d])] for
d in range(par['D'])]
for d in range(par['D']): # 文本长度
current_topic_doc_timestamps[d] = filter(lambda x: x != 0, current_topic_doc_timestamps[d])
for timestamps in current_topic_doc_timestamps:
current_topic_timestamps.extend(timestamps)
assert current_topic_timestamps != [] # 断言,如果为false,则触发异常
topic_timestamps.append(current_topic_timestamps)
return topic_timestamps
4、估计psi的值
def GetMethodOfMomentsEstimatesForPsi(self, par):
'''
估计psi的值
:param par: 存放所有参数的字典
:return: psi的值
'''
topic_timestamps = self.GetTopicTimestamps(par) # 获得主题时间戳
psi = [[1 for _ in range(2)] for _ in range(len(topic_timestamps))] # 得到主题时间戳的一个列表,列表的每个元素为[1,1]
for i in range(len(topic_timestamps)): # 主题时间戳的长度
current_topic_timestamps = topic_timestamps[i] #获得当前主题时间戳
timestamp_mean = np.mean(current_topic_timestamps) # 得到当前主题时间戳的均值
timestamp_var = np.var(current_topic_timestamps) # 得到当前主题时间戳的方差
if timestamp_var == 0: # 设置方差不为0
timestamp_var = 1e-6
common_factor = timestamp_mean * (1 - timestamp_mean) / timestamp_var - 1 # 得到公共因子
# 计算得到psi的值
psi[i][0] = 1 + timestamp_mean * common_factor
psi[i][1] = 1 + (1 - timestamp_mean) * common_factor
return psi
5、获取θ和φ的值
def ComputePosteriorEstimatesOfThetaAndPhi(self, par):
'''
计算theta和phi的值
:param par: 存放所有参数的字典
:return: 返回theta和phi的值
'''
theta = copy.deepcopy(par['m']) # 对par['m']进行深层复制操作
phi = copy.deepcopy(par['n']) # 对par['n']进行深层复制操作
for d in range(par['D']): # 文本长度
if sum(theta[d]) == 0:
theta[d] = np.asarray([1.0 / len(theta[d]) for _ in range(len(theta[d]))])
else:
theta[d] = np.asarray(theta[d])
theta[d] = 1.0 * theta[d] / sum(theta[d])
theta = np.asarray(theta) # 得到θ的值
for t in range(par['T']): # 主题个数
if sum(phi[t]) == 0:
phi[t] = np.asarray([1.0 / len(phi[t]) for _ in range(len(phi[t]))])
else:
phi[t] = np.asarray(phi[t])
phi[t] = 1.0 * phi[t] / sum(phi[t])
phi = np.asarray(phi) # 得到φ的值
return theta, phi
6、进行gibbs采样
def TopicsOverTimeGibbsSampling(self, par):
'''
topic over time模型进行Gibbs采样【马尔科夫链蒙特卡洛方法:吉布斯采样】
:param par: 存放所有参数的字典
:return: 返回theta, phi, psi的值
'''
for iteration in range(par['max_iterations']): # gibbs采样最大迭代次数
for d in range(par['D']): # 文本长度
for i in range(par['N'][d]): # 文中中每个字符的长度列表
word_di = par['w'][d][i]
t_di = par['t'][d][i]
old_topic = par['z'][d][i]
par['m'][d][old_topic] -= 1
par['n'][old_topic][word_di] -= 1
par['n_sum'][old_topic] -= 1
topic_probabilities = [] # 定义一个主题概率列表
for topic_di in range(par['T']):
psi_di = par['psi'][topic_di]
topic_probability = 1.0 * (par['m'][d][topic_di] + par['alpha'][topic_di])
topic_probability *= ((1 - t_di) ** (psi_di[0] - 1)) * ((t_di) ** (psi_di[1] - 1))
topic_probability /= par['betafunc_psi'][topic_di]
topic_probability *= (par['n'][topic_di][word_di] + par['beta'][word_di])
topic_probability /= (par['n_sum'][topic_di] + par['beta_sum'])
topic_probabilities.append(topic_probability) # 计算主题概率
sum_topic_probabilities = sum(topic_probabilities) # 对主题概率求和
if sum_topic_probabilities == 0:
topic_probabilities = [1.0 / par['T'] for _ in range(par['T'])]
else:
topic_probabilities = [p / sum_topic_probabilities for p in topic_probabilities]
new_topic = list(np.random.multinomial(1, topic_probabilities, size=1)[0]).index(1)
par['z'][d][i] = new_topic
par['m'][d][new_topic] += 1
par['n'][new_topic][word_di] += 1
par['n_sum'][new_topic] += 1
if d % 1000 == 0:
print('Done with iteration {iteration} and document {document}'.format(iteration=iteration,
document=d))
par['psi'] = self.GetMethodOfMomentsEstimatesForPsi(par)
par['betafunc_psi'] = [scipy.special.beta(par['psi'][t][0], par['psi'][t][1]) for t in range(par['T'])]
par['m'], par['n'] = self.ComputePosteriorEstimatesOfThetaAndPhi(par)
return par['m'], par['n'], par['psi']
【执行topic over time算法】
'''
执行topic over time算法
'''
from nlp_tot.tot import TopicsOverTime
import numpy as np
import pickle
if __name__ == "__main__":
datapath = 'C:/topics_over_time-master/data/'
resultspath = 'C:/topics_over_time-master/results/pnas_tot/'
documents_path = datapath + 'alltitles'
timestamps_path = datapath + 'alltimes'
stopwords_path = datapath + 'allstopwords'
tot_topic_vectors_path = resultspath + 'pnas_tot_topic_vectors.csv'
tot_topic_mixtures_path = resultspath + 'pnas_tot_topic_mixtures.csv'
tot_topic_shapes_path = resultspath + 'pnas_tot_topic_shapes.csv'
tot_pickle_path = resultspath + 'pnas_tot.pickle'
tot = TopicsOverTime()
documents, timestamps, dictionary = tot.GetPnasCorpusAndDictionary(documents_path, timestamps_path, stopwords_path)
par = tot.InitializeParameters(documents, timestamps, dictionary)
theta, phi, psi = tot.TopicsOverTimeGibbsSampling(par)
np.savetxt(tot_topic_vectors_path, phi, delimiter=',')
np.savetxt(tot_topic_mixtures_path, theta, delimiter=',')
np.savetxt(tot_topic_shapes_path, psi, delimiter=',')
tot_pickle = open(tot_pickle_path, 'wb')
pickle.dump(par, tot_pickle)
tot_pickle.close()
【主题-单词分布可视化以及展示主题随时间演变的β分布】
操作之前需要导入的包:
import scipy.special
import math
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
import pickle
1、主题-单词分布可视化
def VisualizeTopics(phi, words, num_topics, viz_threshold=9e-3):
'''
可视化主题-单词分布
:param phi: phi列表
:param words: 词典列表
:param num_topics: 主题个数(10)
:param viz_threshold: 阈值
:return:
'''
phi_viz = np.transpose(phi) # 转置
words_to_display = ~np.all(phi_viz <= viz_threshold, axis=1) # 测试沿横轴所给定的元素是否小于等于viz_threshold,对结果取非
words_viz = [words[i] for i in range(len(words_to_display)) if words_to_display[i]] # 如果words_to_display[i]为True,保存words[i]的值到words_viz
phi_viz= phi_viz[words_to_display] # 保存phi_viz为True的值到phi_viz
# 绘图操作
fig, ax = plt.subplots()
heatmap = plt.pcolor(phi_viz, cmap=plt.cm.Blues, alpha=0.8)
plt.colorbar() # 给子图添加渐变色条
# fig.set_size_inches(8, 11)
ax.grid(False)
ax.set_frame_on(False)
ax.set_xticks(np.arange(phi_viz.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(phi_viz.shape[0]) + 0.5, minor=False)
ax.invert_yaxis()
ax.xaxis.tick_top()
# plt.xticks(rotation=45)
for t in ax.xaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
for t in ax.yaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
column_labels = words_viz # ['Word ' + str(i) for i in range(1,1000)]
row_labels = ['Topic ' + str(i) for i in range(1, num_topics + 1)]
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(column_labels, minor=False)
plt.show()
2、主题随时间演变的beta分布
def VisualizeEvolution(psi):
'''
主题随时间演变的beta分布
:param psi:
:return:
'''
xs = np.linspace(0, 1, num=1000) # 创建位于0-1之间的1000个元素的数组
fig, ax = plt.subplots()
for i in range(len(psi)):
ys = [math.pow(1-x, psi[i][0]-1) * math.pow(x, psi[i][1]-1) / scipy.special.beta(psi[i][0], psi[i][1]) for x in xs]
ax.plot(xs, ys, label='Topic ' + str(i+1))
ax.legend(loc='best', frameon=False)
plt.show()
3、运行上面的两个方法
if __name__ == "__main__":
resultspath = 'C:/topics_over_time-master/results/pnas_tot/'
tot_pickle_path = resultspath + 'pnas_tot.pickle'
tot_pickle = open(tot_pickle_path, 'rb')
par = pickle.load(tot_pickle)
VisualizeTopics(par['n'], par['word_token'], par['T'])
VisualizeEvolution(par['psi'])
4、结果如下:
(1)主题-单词分布
(2)主题随时间演变的beta分布