This article is the "word vector 'with Word2Vec training Chinese word vector (a) - the use of search dogs news data set based on the two corpora merge, and then trained better word vector model.
Reference: Based word2vec wiki using the Chinese word training corpus vector
small projects (Gensim Library) - Wikipedia Chinese data processing
table of Contents
Download dataset
Click here to download: Wiki Corpus selected date.
The file is downloaded zhwiki-latest-pages-articles.xml.bz2, size 1.75GB (especially slow download speed).
Data Processing Set
(A) text extraction
There are two ways to extract text from a compressed file:
- Use gensim.corpora of WikiCorpus directly to Wikipedia corpus processing. Save to wikiCorpus / wikiCorpus.txt
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
print('主程序开始...')
input_file_name = 'zhwiki-latest-pages-articles.xml.bz2'
output_file_name = 'wikiCorpus/wikiCorpus.txt'
print('开始读入wiki数据...')
input_file = WikiCorpus(input_file_name, lemmatize=False, dictionary={})
print('wiki数据读入完成!')
print('处理程序开始...')
count = 0
with open(output_file_name, 'wb') as output_file:
for text in input_file.get_texts():
output_file.write(' '.join(text).encode("utf-8"))
output_file.write('\n'.encode("utf-8"))
count = count + 1
if count % 10000 == 0:
print('目前已处理%d条数据' % count)
print('处理程序结束!')
output_file.close()
print('主程序结束!')
- Download Wikipedia Extractor extracts text.
Processing the data set is completed, a total of 35w + pieces of data, the following effects. We found that punctuation has been removed, simply do not like too! ! !
(B) simplify
Download opencc , perform character conversion.
The wikiCorpus.txt copied to the folder, the indicator used cd to the folder input opencc -i wikiCorpus.txt -o wikiCorpusSim.txt -c t2s.json
to complete the conversion, the converted wikiCorpusSim.txt saved to Python Project , results as shown:
(C) the word document
If done Sogou data set before it, this step should be very familiar.
import jieba
filePath = 'wikiCorpus/wikiCorpusSim.txt' # 简化后文本
fileSegWordDonePath = 'wikiCorpus/wikiCorpusSegDone.txt' # 分词处理后文本
# 将每一行文本依次存放到一个列表
fileTrainRead = []
with open(filePath, encoding='utf-8') as fileTrainRaw:
for line in fileTrainRaw:
fileTrainRead.append(line)
# 用jieba进行分词
fileTrainSeg = []
file_userDict = 'dict.txt' # 自定义的词典
jieba.load_userdict(file_userDict)
for i in range(len(fileTrainRead)):
fileTrainSeg.append([' '.join(jieba.cut(fileTrainRead[i], cut_all=False))])
if i % 100 == 0: # 每处理100个就打印一次
print(i)
# 处理后写入文件
with open(fileSegWordDonePath, 'wb') as fW:
for i in range(len(fileTrainSeg)):
fW.write(fileTrainSeg[i][0].encode('utf-8'))
fW.write('\n'.encode("utf-8"))
(D) combining data sets
The good search dog before processing the data sets and merge wiki, generate Fried Chicken invincible Chinese corpus.
filePath1 = 'wikiCorpus/wikiCorpusSegDone.txt' # wiki分词后语料库
filePath2 = 'sougouCorpus/sougouCorpusSegDone.txt' # 搜狗语料库
filePath3 = 'corpusFinal.txt' # 最终语料库
fileFinal = []
countS = 0 # sougou计数
countW = 0 # wiki计数
# 打开搜狗语料库
with open(filePath2, encoding='utf-8') as ff1:
print("---成功导入搜狗语料库---")
for line in ff1:
fileFinal.append(line)
countS = countS + 1
if countS % 10000 == 0:
print("---------已导入%d篇搜狗文章---------" % countS)
# 打开维基语料库
with open(filePath1, encoding='utf-8') as ff2:
print("---成功导入wiki语料库---")
for line in ff2:
fileFinal.append(line)
countW = countW + 1
if countW % 10000 == 0:
print("---------已导入%d篇维基文章---------" % countW)
# 打开最终文档,逐行写入
with open(filePath3, 'wb') as ff:
for i in range(len(fileFinal)):
ff.write(fileFinal[i].encode('utf-8'))
if i % 10000 == 0:
print("---------已合并%d篇文章---------" % i)
print("---------一共读入%d行---------" % len(fileFinal))
print("---------完成合并---------")
A total of 212w + rows of data (which contribute 71w row + Sogou 141w line)
size of 3.35GB
Trainer
Or the old code is used
import logging
import sys
import gensim.models as word2vec
from gensim.models.word2vec import LineSentence, logger
# import smart_open
def train_word2vec(dataset_path, out_vector):
# 设置输出日志
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# 把语料变成句子集合
sentences = LineSentence(dataset_path)
# sentences = LineSentence(smart_open.open(dataset_path, encoding='utf-8')) # 或者用smart_open打开
# 训练word2vec模型(size为向量维度,window为词向量上下文最大距离,min_count需要计算词向量的最小词频)
model = word2vec.Word2Vec(sentences, size=128, sg=1, window=6, min_count=5, workers=4, iter=6)
# (iter随机梯度下降法中迭代的最大次数,sg为1是Skip-Gram模型)
# 保存word2vec模型
model.save("word2vec.model")
model.wv.save_word2vec_format(out_vector, binary=False)
# 加载模型
def load_word2vec_model(w2v_path):
model = word2vec.Word2Vec.load(w2v_path)
return model
# 计算词语最相似的词
def calculate_most_similar(self, word):
similar_words = self.wv.most_similar(word)
print(word)
for term in similar_words:
print(term[0], term[1])
# 计算两个词相似度
def calculate_words_similar(self, word1, word2):
print(self.wv.similarity(word1, word2))
# 找出不合群的词
def find_word_dismatch(self, list):
print(self.wv.doesnt_match(list))
if __name__ == '__main__':
dataset_path = 'corpusFinal.txt'
out_vector = 'corpusFinal.vector'
train_word2vec(dataset_path, out_vector) # 训练模型
# model = load_word2vec_model("word2vec.model") # 加载模型
# calculate_most_similar(model, "病毒") # 找相近词
# calculate_words_similar(model, "法律", "制度") # 两个词相似度
# print(model.wv.__getitem__('男人')) # 词向量
# list = ["早饭", "吃饭", "恰饭", "嘻哈"]
# find_word_dismatch(model, list)
Associated with the "virus" of words:
Reference article
Based word2vec using the Chinese word training corpus wiki vector
small projects (Gensim Library) - Wikipedia Chinese data processing