文本向量化---从向量到向量（tfidf）

corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corpus) # 第一步--初始化一个模型


doc_bow = [(0, 1), (1, 1)]

print tfidf[doc_bow] # 第二步--用模型转换向量

 

[(0, 0.70710678), (1, 0.70710678)]


####或者在整个语料上应用转换

corpus_tfidf = tfidf[corpus]

for doc in corpus_tfidf:

	print doc

 

[(0, 0.57735026918962573), (1, 0.57735026918962573), (2, 0.57735026918962573)]

[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.32448702061385548), (6, 0.44424552527467476), (7, 0.32448702061385548)]

[(2, 0.5710059809418182), (5, 0.41707573620227772), (7, 0.41707573620227772), (8, 0.5710059809418182)]

[(1, 0.49182558987264147), (5, 0.71848116070837686), (8, 0.49182558987264147)]

[(3, 0.62825804686700459), (6, 0.62825804686700459), (7, 0.45889394536615247)]

[(9, 1.0)]

[(9, 0.70710678118654746), (10, 0.70710678118654746)]

[(9, 0.50804290089167492), (10, 0.50804290089167492), (11, 0.69554641952003704)]

[(4, 0.62825804686700459), (10, 0.45889394536615247), (11, 0.62825804686700459)]

很多模型就是基于tf-idf来做的

比如lsi，lda等

举个栗子


lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # 初始化一个LSI转换

corpus_lsi = lsi[corpus_tfidf] # 在原始语料上创建一个双重封装器: bow->tfidf->fold-in-lsi

这里我们用潜在语义索引（Latent Semantic Indexing）将我们的Tf-Idf语料库转换到潜在2-D空间（2-D因为我们设置 num_topics=2）。现在你可以觉得奇怪：这两个潜在维度是什么？让我们检查一下models.LsiModel.print_topics():


lsi.print_topics(2)

 

topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"

topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

得到主题-词分布


for doc in corpus_lsi: # 在这里，bow->tfidf 和 tfidf->lsi 转换实际上都是即时执行的

	print doc

 

[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"

[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"

[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"

[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"

[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"

[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"

[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"

[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"

[(0, -0.617), (1, 0.054)] # "Graph minors A survey"

2个主题，0是第一个topic，1是第二个topic

所以这是文档--主题分布！！！！

文本向量化---从向量到向量（tfidf）

猜你喜欢