02-NLP-gensim中文处理案例

word2vec训练中文模型

 

1.准备数据与预处理

首先需要一份比较大的中文语料数据,可以考虑中文的维基百科(也可以试试搜狗的新闻语料库)。中文维基百科的打包文件地址为 
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

中文维基百科的数据不是太大,xml的压缩文件大约1G左右。首先用 process_wiki_data.py处理这个XML压缩文件,执行:python process_wiki_data.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

 
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # process_wiki_data.py 用于解析XML,将XML的wiki数据转换为text格式
  4. import logging
  5. import os.path
  6. import sys
  7. from gensim.corpora import WikiCorpus
  8. if __name__ == '__main__':
  9. program = os.path.basename(sys.argv[0])
  10. logger = logging.getLogger(program)
  11. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  12. logging.root.setLevel(level=logging.INFO)
  13. logger.info("running %s" % ' '.join(sys.argv))
  14. # check and process input arguments
  15. if len(sys.argv) < 3:
  16. print globals()['__doc__'] % locals()
  17. sys.exit(1)
  18. inp, outp = sys.argv[1:3]
  19. space = " "
  20. i = 0
  21. output = open(outp, 'w')
  22. wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
  23. for text in wiki.get_texts():
  24. output.write(space.join(text) + "\n")
  25. i = i + 1
  26. if (i % 10000 == 0):
  27. logger.info("Saved " + str(i) + " articles")
  28. output.close()
  29. logger.info("Finished Saved " + str(i) + " articles")

得到信息:

 
  1. 2016-08-11 20:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
  2. 2016-08-11 20:40:08,329: INFO: Saved 10000 articles
  3. 2016-08-11 20:40:45,501: INFO: Saved 20000 articles
  4. 2016-08-11 20:41:23,659: INFO: Saved 30000 articles
  5. 2016-08-11 20:42:01,748: INFO: Saved 40000 articles
  6. 2016-08-11 20:42:33,779: INFO: Saved 50000 articles
  7. ......
  8. 2016-08-11 20:55:23,094: INFO: Saved 200000 articles
  9. 2016-08-11 20:56:14,692: INFO: Saved 210000 articles
  10. 2016-08-11 20:57:04,614: INFO: Saved 220000 articles
  11. 2016-08-11 20:57:57,979: INFO: Saved 230000 articles
  12. 2016-08-11 20:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)
  13. 2016-08-11 20:58:16,622: INFO: Finished Saved 232894 articles

Python的话可用jieba完成分词,生成分词文件wiki.zh.text.seg 
接着用word2vec工具训练: 
python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector

 
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # train_word2vec_model.py用于训练模型
  4. import logging
  5. import os.path
  6. import sys
  7. import multiprocessing
  8. from gensim.corpora import WikiCorpus
  9. from gensim.models import Word2Vec
  10. from gensim.models.word2vec import LineSentence
  11. if __name__ == '__main__':
  12. program = os.path.basename(sys.argv[0])
  13. logger = logging.getLogger(program)
  14. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  15. logging.root.setLevel(level=logging.INFO)
  16. logger.info("running %s" % ' '.join(sys.argv))
  17. # check and process input arguments
  18. if len(sys.argv) < 4:
  19. print globals()['__doc__'] % locals()
  20. sys.exit(1)
  21. inp, outp1, outp2 = sys.argv[1:4]
  22. model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
  23. workers=multiprocessing.cpu_count())
  24. # trim unneeded model memory = use(much) less RAM
  25. #model.init_sims(replace=True)
  26. model.save(outp1)
  27. model.save_word2vec_format(outp2, binary=False)

运行信息

 
  1. 2016-08-12 09:50:02,586: INFO: running python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector
  2. 2016-08-12 09:50:02,592: INFO: collecting all words and their counts
  3. 2016-08-12 09:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
  4. 2016-08-12 09:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types
  5. 2016-08-12 09:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types
  6. 2016-08-12 09:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types
  7. ...
  8. 2016-08-12 09:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types
  9. 2016-08-12 09:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types
  10. 2016-08-12 09:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types
  11. 2016-08-12 09:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences
  12. 2016-08-12 09:52:13,672: INFO: total 278291 word types after removing those with count<5
  13. 2016-08-12 09:52:13,673: INFO: constructing a huffman tree from 278291 words
  14. 2016-08-12 09:52:29,323: INFO: built huffman tree with maximum node depth 25
  15. 2016-08-12 09:52:29,683: INFO: resetting layer weights
  16. 2016-08-12 09:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
  17. 2016-08-12 09:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s
  18. 2016-08-12 09:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s
  19. 2016-08-12 09:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
  20. 2016-08-12 09:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s
  21. 2016-08-12 09:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s
  22. 2016-08-12 09:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s
  23. 2016-08-12 09:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s
  24. ......
  25. 2016-08-12 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s
  26. 2016-08-12 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s
  27. 2016-08-12 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s
  28. 2016-08-12 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s
  29. 2016-08-12 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
  30. 2016-08-12 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None
  31. 2016-08-12 19:22:13,884: INFO: not storing attribute syn0norm
  32. 2016-08-12 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy
  33. 2016-08-12 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy
  34. 2016-08-12 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

测试模型效果:

 
  1. In [1]: import gensim
  2. In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
  3. In [3]: model.most_similar(u"足球")
  4. Out[3]:
  5. [(u'\u8054\u8d5b', 0.6553816199302673),
  6. (u'\u7532\u7ea7', 0.6530429720878601),
  7. (u'\u7bee\u7403', 0.5967546701431274),
  8. (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
  9. (u'\u4e59\u7ea7', 0.5840631723403931),
  10. (u'\u8db3\u7403\u961f', 0.5560152530670166),
  11. (u'\u4e9a\u8db3\u8054', 0.5308005809783936),
  12. (u'allsvenskan', 0.5249762535095215),
  13. (u'\u4ee3\u8868\u961f', 0.5214947462081909),
  14. (u'\u7532\u7ec4', 0.5177896022796631)]
  15. In [4]: result = model.most_similar(u"足球")
  16. In [5]: for e in result:
  17. print e[0], e[1]
  18. ....:
  19. 联赛 0.65538161993
  20. 甲级 0.653042972088
  21. 篮球 0.596754670143
  22. 俱乐部 0.587228953838
  23. 乙级 0.58406317234
  24. 足球队 0.556015253067
  25. 亚足联 0.530800580978
  26. allsvenskan 0.52497625351
  27. 代表队 0.521494746208
  28. 甲组 0.51778960228
  29. In [6]: result = model.most_similar(u"男人")
  30. In [7]: for e in result:
  31. print e[0], e[1]
  32. ....:
  33. 女人 0.77537125349
  34. 家伙 0.617369174957
  35. 妈妈 0.567102909088
  36. 漂亮 0.560832381248
  37. 잘했어 0.540875017643
  38. 谎言 0.538448691368
  39. 爸爸 0.53660941124
  40. 傻瓜 0.535608053207
  41. 예쁘다 0.535151124001
  42. mc 0.529670000076
  43. In [8]: result = model.most_similar(u"女人")
  44. In [9]: for e in result:
  45. print e[0], e[1]
  46. ....:
  47. 男人 0.77537125349
  48. 我的某 0.589010596275
  49. 妈妈 0.576344847679
  50. 잘했어 0.562340974808
  51. 美丽 0.555426716805
  52. 爸爸 0.543958246708
  53. 新娘 0.543640494347
  54. 谎言 0.540272831917
  55. 妞儿 0.531066179276
  56. 老婆 0.528521537781
  57. In [10]: result = model.most_similar(u"青蛙")
  58. In [11]: for e in result:
  59. print e[0], e[1]
  60. ....:
  61. 老鼠 0.559612870216
  62. 乌龟 0.489831030369
  63. 蜥蜴 0.478990525007
  64. 0.46728849411
  65. 鳄鱼 0.461885392666
  66. 蟾蜍 0.448014199734
  67. 猴子 0.436584025621
  68. 白雪公主 0.434905380011
  69. 蚯蚓 0.433413207531
  70. 螃蟹 0.4314712286
  71. In [12]: result = model.most_similar(u"姨夫")
  72. In [13]: for e in result:
  73. print e[0], e[1]
  74. ....:
  75. 堂伯 0.583935439587
  76. 祖父 0.574735701084
  77. 妃所生 0.569327116013
  78. 内弟 0.562012672424
  79. 早卒 0.558042645454
  80. 0.553856015205
  81. 胤祯 0.553288519382
  82. 陈潜 0.550716996193
  83. 愔之 0.550510883331
  84. 叔父 0.550032019615
  85. In [14]: result = model.most_similar(u"衣服")
  86. In [15]: for e in result:
  87. print e[0], e[1]
  88. ....:
  89. 鞋子 0.686688780785
  90. 穿着 0.672499775887
  91. 衣物 0.67173999548
  92. 大衣 0.667605519295
  93. 裤子 0.662670075893
  94. 内裤 0.662210345268
  95. 裙子 0.659705817699
  96. 西装 0.648508131504
  97. 洋装 0.647238850594
  98. 围裙 0.642895817757
  99. In [16]: result = model.most_similar(u"公安局")
  100. In [17]: for e in result:
  101. print e[0], e[1]
  102. ....:
  103. 司法局 0.730189085007
  104. 公安厅 0.634275555611
  105. 公安 0.612798035145
  106. 房管局 0.597343325615
  107. 商业局 0.597183346748
  108. 军管会 0.59476184845
  109. 体育局 0.59283208847
  110. 财政局 0.588721752167
  111. 戒毒所 0.575558543205
  112. 新闻办 0.573395550251
  113. In [18]: result = model.most_similar(u"铁道部")
  114. In [19]: for e in result:
  115. print e[0], e[1]
  116. ....:
  117. 盛光祖 0.565509021282
  118. 交通部 0.548688530922
  119. 批复 0.546967327595
  120. 刘志军 0.541010737419
  121. 立项 0.517836689949
  122. 报送 0.510296344757
  123. 计委 0.508456230164
  124. 水利部 0.503531932831
  125. 国务院 0.503227233887
  126. 经贸委 0.50156635046
  127. In [20]: result = model.most_similar(u"清华大学")
  128. In [21]: for e in result:
  129. print e[0], e[1]
  130. ....:
  131. 北京大学 0.763922810555
  132. 化学系 0.724210739136
  133. 物理系 0.694550514221
  134. 数学系 0.684280991554
  135. 中山大学 0.677202701569
  136. 复旦 0.657914161682
  137. 师范大学 0.656435549259
  138. 哲学系 0.654701948166
  139. 生物系 0.654403865337
  140. 中文系 0.653147578239
  141. In [22]: result = model.most_similar(u"卫视")
  142. In [23]: for e in result:
  143. print e[0], e[1]
  144. ....:
  145. 湖南 0.676812887192
  146. 中文台 0.626506924629
  147. 収蔵 0.621356606483
  148. 黄金档 0.582251906395
  149. cctv 0.536769032478
  150. 安徽 0.536752820015
  151. 非同凡响 0.534517168999
  152. 唱响 0.533438682556
  153. 最强音 0.532605051994
  154. 金鹰 0.531676828861
  155. In [26]: result = model.most_similar(u"林丹")
  156. In [27]: for e in result:
  157. print e[0], e[1]
  158. ....:
  159. 黄综翰 0.538035452366
  160. 蒋燕皎 0.52646958828
  161. 刘鑫 0.522252976894
  162. 韩晶娜 0.516120731831
  163. 王晓理 0.512289524078
  164. 王适 0.508560419083
  165. 杨影 0.508159279823
  166. 陈跃 0.507353425026
  167. 龚智超 0.503159761429
  168. 李敬元 0.50262516737
  169. In [28]: result = model.most_similar(u"语言学")
  170. In [29]: for e in result:
  171. print e[0], e[1]
  172. ....:
  173. 社会学 0.632598280907
  174. 人类学 0.623406708241
  175. 历史学 0.618442356586
  176. 比较文学 0.604823827744
  177. 心理学 0.600066184998
  178. 人文科学 0.577783346176
  179. 社会心理学 0.575571238995
  180. 政治学 0.574541330338
  181. 地理学 0.573896467686
  182. 哲学 0.573873817921
  183. In [30]: result = model.most_similar(u"计算机")
  184. In [31]: for e in result:
  185. print e[0], e[1]
  186. ....:
  187. 自动化 0.674171924591
  188. 应用 0.614087462425
  189. 自动化系 0.611132860184
  190. 材料科学 0.607891201973
  191. 集成电路 0.600370049477
  192. 技术 0.597518980503
  193. 电子学 0.591316461563
  194. 建模 0.577238917351
  195. 工程学 0.572855889797
  196. 微电子 0.570086717606
  197. In [32]: model.similarity(u"计算机", u"自动化")
  198. Out[32]: 0.67417196002404789
  199. In [33]: model.similarity(u"女人", u"男人")
  200. Out[33]: 0.77537125129824813
  201. In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
  202. Out[34]: u'\u4e2d\u5fc3'
  203. In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
  204. 中心
+
 
 

猜你喜欢

转载自www.cnblogs.com/Josie-chen/p/9096301.html