唐诗掠影:基于词移距离(Word Mover's Distance)的唐诗诗句匹配实践

转载:https://www.cnblogs.com/combfish/p/8126857.html


词移距离(Word Mover's Distance)是在词向量的基础上发展而来的用来衡量文档相似性的度量。
 
词移距离的具体介绍参考 http://blog.csdn.net/qrlhl/article/details/78512598   或网上的其他资料
 
此处,用词移距离来衡量唐诗诗句的相关性。为什么用唐诗?因为全唐诗的txt很容易获取,随便一搜就可以下载了。全唐诗txt链接:https://files.cnblogs.com/files/combfish/%E5%85%A8%E5%94%90%E8%AF%97.zip。
 
步骤:
1. 预处理语料集: 唐诗的断句分词,断句基于标点符号,分词依靠结巴分词
2. gensim训练词向量模型与wmd相似性模型
3. 查询
 
代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import  jieba
from  nltk  import  word_tokenize
from  nltk.corpus  import  stopwords
from  time  import  time
start_nb  =  time()
import  logging
 
print ( 20 * '*' , 'loading data' , 40 * '*' )
f = open ( '全唐诗.txt' ,encoding = 'utf-8' )
lines = f.readlines()
corpus = []
documents = []
useless = [ ',' , '.' , '(' , ')' , '!' , '?' , '\'' , '\"' , ':' , '<' , '>' ,
          ',' '。' '(' ')' '!' '?' '’' '“' , ':' , '《' , '》' , '[' , ']' , '【' , '】' ]
for  each  in  lines:
     each = each.replace( '\n' ,'')
     each.replace( '-' ,'')
     each = each.strip()
     each = each.replace( ' ' ,'')
     if ( len (each)> 3 ):
         if (each[ 0 ]! = '卷' ):
             documents.append(each)
             each = list (jieba.cut(each))
             text = [w  for  in  each  if  not  in  useless]
             corpus.append(text)
 
print ( len (corpus))
 
print ( 20 * '*' , 'trainning models' , 40 * '*' )
from  gensim.models  import  Word2Vec
model  =  Word2Vec(corpus, workers = 3 , size = 100 )
 
# Initialize WmdSimilarity.
from  gensim.similarities  import  WmdSimilarity
num_best  =  10
instance  =  WmdSimilarity(corpus, model, num_best = 10 )
 
print ( 20 * '*' , 'testing' , 40 * '*' )
while  True :
     sent  =  input ( '输入查询语句: ' )
     sent_w  =  list (jieba.cut(sent))
     query  =  [w  for  in  sent_w  if  not  in  useless]
 
     sims  =  instance[query]   # A query is simply a "look-up" in the similarity class.
 
     # Print the query and the retrieved documents, together with their similarities.
     print ( 'Query:' )
     print (sent)
     for  in  range (num_best):
         print
         print ( 'sim = %.4f'  %  sims[i][ 1 ])
         print (documents[sims[i][ 0 ]])

  

结果:从结果kan
 
 
 
 
 
 
 
 
 

<wiz_tmp_tag id="wiz-table-range-border" contenteditable="false" style="display: none;">


猜你喜欢

转载自blog.csdn.net/m0_37870649/article/details/80757408