Use gensim in word2vec training examples - Analysis of the Three Kingdoms in character relationships

Disclaimer: the author is limited, blog inevitably a lot of flaws and even serious mistakes, I hope you correct. While writing the biggest goal is also to exchange learning, not paying attention and spread. Long way to go, and you encourage each other. https://blog.csdn.net/yexiaohhjk/article/details/85086503

Foreword

Everything Jieke Embedding

After watching the second week into the pit cs224N and related papers. Word2vec very interesting that, with the short period context (the entity) Semantic Verbal Learning embedded into a vector space, and then two words is determined (entities) correlation. I noticed a big wheel made good gensim, why not do some simple and interesting experiment, further in-depth study.

I had wanted to climb watercress user history, with word2Vec make a recommendation, but recently entered the test period, net brush prep class matters. First he buried a foreshadowing, after a time to get another bar, skills are more than body. So the first use of "Three Kingdoms", as corpus, practice your hand gensim, embedding can learn and see what interesting conclusions.

Ready corpus

The use of analytical tools beautifoup request and start a website on my elementary school and down love the historical novel "Romance of Three Kingdoms" (review reptile skills). Elimination of simple texts and use punctuation jiebapoints good word, because the novel is classical approximation, the effect is not good separation of words.

def crawl(url = None):
    """
     从http://www.purepen.com/sgyy/爬下《三国演义》到本地txt文件
    :param url:
    :return:
    """
    print('Waitting for crawling sentence!')
    url  = 'http://www.purepen.com/sgyy/'
    contents = ''
    for num in range(1,121):
        num_str = str(num)
        if len(num_str) == 1 : num_str = '00'+ num_str
        if len(num_str) == 2 : num_str = '0'+ num_str
        urls = url + num_str + '.htm'
        html = requests.get(urls)
        html.encoding = html.apparent_encoding # 保证中文编码方式的正确
        soup =  BeautifulSoup(html.text,'lxml')
        title  = soup.find(align = 'center').text
        contents += title
        content = soup.find(face = '宋体').text
        contents += content

    with open('三国演义.txt','w') as f:
        f.write(contents)
        
def segment(document_path= '三国演义.txt'):
    """

    :param document_path:
    :return:
    """
    with open(document_path) as f:
        document = f.read()
        document = re.sub('[()::?“”《》,。!·、\d ]+', ' ', document)  # 去标点
        document_cut = jieba.cut(document)
        result = ' '.join(document_cut)
        with open('segement.txt', 'w') as f2:
            f2.write(result)
        print('Segement Endding')        

Word2vec training model in use gensim

First install gensim is very easy to use "pip install gensim"can be. However, it notes that there is demand for gensim version of numpy, so there may be secretly upgrade your numpy version of the installation process. The windows version of numpy installed or upgraded directly is problematic. At this point we need to uninstall numpy, and re-download mkl comply with the requirements gensim version of numpy.
 
In gensim in, word2vec related API in the package gensim.models.word2vec. And algorithm parameters are related to class gensim.models.word2vec.Word2Vec, it is recommended to read the official documentation .
Note algorithm parameters are:

  • sentences: we have to analyze the corpus can be a list, or read from a file traversal. The examples that follow we will read from the file.

  • size: Dimensions word vector, the default value is 100. This dimension values ​​generally associated with the size of our corpus, if the corpus is small, such as less than 100M of the text corpus, the default value is usually fine. If the corpus is large, it is recommended to increase the dimensions.

  • window: the word that is the maximum distance vector context, this parameter markers in our algorithm principle chapter for c
    , the larger the window, the word far and some also have a word context. The default is 5. In actual use, the size may be dynamically adjusted according to actual needs window. If it is small corpus then this value can be set smaller. For a general corpus recommended value between [5,10].

  • sg: namely choosing our word2vec the two models. If 0, it is CBOW model is a model is Skip-Gram, i.e. 0 is the default CBOW model.

  • hs: i.e. two word2vec our choice of solution, if it is 0, it is Negative Sampling, and is negative, then a negative number of samples is greater than 0, it is Hierarchical Softmax. The default is 0, ie Negative Sampling.

  • negative: negative samples i.e. using Negative Sampling number, default 5. Recommended between [3,10]. This parameter is marked as neg in principle chapter in our algorithm.

  • cbow_mean: CBOW only when doing the projection, it is 0, then the algorithm xw
    is a vector sum word and the context, compared to an average of term vectors context. In our article in principle, in accordance with the average term to describe the vector. Personally like to use to represent the average xw, the default value of 1 is not recommended to modify the default values.

  • min_count: need to calculate the minimum word word vector. This value can remove some of the very low frequency of uncommon words, the default is 5. If the corpus is small, you can lower this value.

  • iter: maximum number of stochastic gradient descent method iterations, default 5. For large corpus, you can increase this value.

  • alpha: iterative stochastic gradient descent method in the initial step. Algorithm principle articles marked as η, the default is 0.025.

  • min_alpha: Since the iterative algorithm supports the gradual reduction step, min_alpha gives the minimum value of the step size. Stochastic gradient descent in each round of iteration steps may, alpha, derived from iter min_alpha together. This is due in part to the core content is not word2vec algorithm, so we did not mention in principle chapter. For large corpus, the need for alpha, min_alpha, iter with parameter adjustment, to select the appropriate three values.
      
    Pretreatment with good corpus, training model:

 sentences = word2vec.LineSentence(sentence_path)
 model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)

Save the model:

# 保存模型,以便重用
model.save("test_01.model")
 #不能利用文本编辑器查看,但是保存了训练的全部信息,可以在读取后追加训练可以再次读取训练
model.wv.save_word2vec_format('test_01.model.txt',binary=False) 
# 将模型保存成文本格式,但是保存时丢失了词汇树等部分信息,不能追加训练

Additional training:

model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)

Load Model:

model = gensim.models.Word2Vec.load('/tmp/mymodel')

The use of models - term relationship between characters vector analysis

  • For the most associated with a word of words:
    model.most_similar()

Example:
We tested in accordance with the "Three Kingdoms" is a good model training corpus in what we have been learning vector (word vector) study the potential correlation between what heroes are:

First test white hair early (favorite) of 周瑜:
print('Nearest 周瑜:',model.most_similar('周瑜'))

I learned very satisfied with the results:

Nearest 周瑜:
 [('孙策', 0.6876850128173828), ('孔明', 0.6875529289245605), ('司马懿', 0.6733481287956238), ('孟获', 0.6705329418182373), ('先主', 0.6662196516990662), ('鲁肃', 0.6605409383773804), ('孙权', 0.6458742022514343), ('孙夫人', 0.643887996673584), ('姜维', 0.6326993703842163), ('有人', 0.6321758031845093)]

Correlation is ranked first with good total angular Zhou Yu, the same as the double-walled Koto bully 孙策. His history with 周瑜only small befriend, sympathetic, put down Koto grew up together in times of trouble, meritorious deeds. At the same time the three sisters married in the famous one pair Aromatic flowers 大乔,小乔, pass for a story!

The second is 即生瑜,何生亮the size of Zhuge Liang: Ming. "Three Kingdoms" in order to highlight Zhuge Liang kept his crafty plan, mad black Zhou Yu into a narrow-minded villain. Meanwhile Chibi war, the two men on behalf of Sun and Liu camp counselors contribute most generals. So the two men very close contact inevitable.

Third 司马懿, post-intellectual, political advisers highest Cao Wei, commander in chief!

Back 鲁肃, , 孙权, 吴老太and Zhou Yu is also the most intimate relationship between the characters.

Other results also learned:

Nearest 刘备:, [('东吴', 0.7638486623764038), ('袁绍', 0.6992679238319397), ('刘表', 0.6835019588470459), ('吴侯', 0.6756551265716553), ('司马懿', 0.6602287888526917), ('曹', 0.6518967747688293), ('曹操', 0.6457493305206299), ('刘玄德', 0.6447073817253113), ('蜀', 0.6380304098129272), ('诸葛亮', 0.6250388026237488)]
Nearest 曹操: [('袁绍', 0.6900763511657715), ('刘备', 0.6457493901252747), ('孙策', 0.6446478962898254), ('司马懿', 0.6381756067276001), ('吴侯', 0.6193397641181946), ('孙权', 0.6192417144775391), ('蜀', 0.6191484928131104), ('周瑜', 0.6183933019638062), ('东吴', 0.6114454865455627), ('马超', 0.5959264039993286)]
Nearest 孙策: [('姜维', 0.6926037073135376), ('周瑜', 0.687684953212738), ('邓艾', 0.687220573425293), ('孙坚', 0.6793218851089478), ('司马懿', 0.6556568741798401), ('钟会', 0.6528347730636597), ('郭淮', 0.6527595520019531), ('孔明自', 0.6470344066619873), ('曹操', 0.6446478962898254), ('王平', 0.6399298906326294)]
Nearest 貂蝉: [('卓', 0.7048295140266418), ('允', 0.6404716968536377), ('身', 0.6323765516281128), ('妾', 0.6265878677368164), ('瑜', 0.6257222890853882), ('吴', 0.6242125034332275), ('父', 0.6216113567352295), ('众官', 0.6189900636672974), ('后主', 0.6172502636909485), ('干', 0.6154900789260864)]
Nearest 诸葛亮: [('亮', 0.7160214185714722), ('贤弟', 0.7146532535552979), ('子敬', 0.6765022277832031), ('此人', 0.6603602766990662), ('表曰', 0.6592696905136108), ('既', 0.6532598733901978), ('奈何', 0.6503086090087891), ('大王', 0.6495622992515564), ('吾主', 0.6492528915405273), ('玄德问', 0.6449695825576782)]
  • Selected set of words of different types
    , such as:print(model.wv.doesnt_match(u"周瑜 鲁肃 吕蒙 陆逊 诸葛亮".split()))

周瑜,鲁肃,吕蒙,陆逊Awards are Soochow and served as commander in chief, only 诸葛亮a large Shu prime minister, the result is output 诸葛亮. Vectors that we learned this relationship.

Same thing:

print(model.wv.doesnt_match(u"曹操 刘备 孙权 关羽".split())) #关羽不是主公
print(model.wv.doesnt_match(u"关羽 张飞 赵云 黄忠 马超 典韦".split())) #典韦不是蜀国五虎将
print(model.wv.doesnt_match(u"诸葛亮 贾诩  张昭 马超".split()))#马超是唯一武将

Of course, not all relationships can be learned, study results also depend on the quality of our training corpus of quantity and quality. "Three Kingdoms" novel was 1.8Mb, it is not possible to learn in detail the relationship between more entities related.

The complete code

#!/usr/bin/env python
# encoding: utf-8
'''
@author: MrYx
@github: https://github.com/MrYxJ
'''

import jieba
import jieba.analyse
from gensim.models import word2vec
import requests
from bs4 import BeautifulSoup
import re

def crawl(url = None):
    """
     从http://www.purepen.com/sgyy/爬下《三国演义》到本地txt文件
    :param url:
    :return:
    """
    print('Waitting for crawling sentence!')
    url  = 'http://www.purepen.com/sgyy/'
    contents = ''
    for num in range(1,121):
        num_str = str(num)
        if len(num_str) == 1 : num_str = '00'+ num_str
        if len(num_str) == 2 : num_str = '0'+ num_str
        urls = url + num_str + '.htm'
        html = requests.get(urls)
        html.encoding = html.apparent_encoding # 保证中文编码方式的正确
        soup =  BeautifulSoup(html.text,'lxml')
        title  = soup.find(align = 'center').text
        contents += title
        content = soup.find(face = '宋体').text
        contents += content

    with open('三国演义.txt','w') as f:
        f.write(contents)


def segment(document_path= '三国演义.txt'):
    """

    :param document_path:
    :return:
    """
    with open(document_path) as f:
        document = f.read()
        document = re.sub('[()::?“”《》,。!·、\d ]+', ' ', document)  # 去标点
        document_cut = jieba.cut(document) # 结巴分词
        result = ' '.join(document_cut)
        with open('segement.txt', 'w') as f2:
            f2.write(result)
        print('Segement Endding')


def train_model(sentence_path ,model_path):
    sentences = word2vec.LineSentence(sentence_path)
    model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)
    print('Word2Vec Training Endding!')
    model.save(model_path)

def analyse_wordVector(model_path):
    model =  word2vec.Word2Vec.load(model_path)
    print('Nearest 周瑜:',model.most_similar('周瑜'))
    print('Nearest 刘备:,',model.most_similar(['刘备']))
    print('Nearest 曹操:',model.most_similar(['曹操']))
    print('Nearest 孙策:',model.most_similar(['孙策']))
    print('Nearest 貂蝉:',model.most_similar(['貂蝉']))
    print('Nearest 诸葛亮:', model.most_similar(['诸葛亮']))
    print(model.wv.doesnt_match(u"周瑜 鲁肃 吕蒙 陆逊 诸葛亮".split()))
    print(model.wv.doesnt_match(u"曹操 刘备 孙权 关羽".split()))
    print(model.wv.doesnt_match(u"关羽 张飞 赵云 黄忠 马超 典韦".split()))
    print(model.wv.doesnt_match(u"诸葛亮 贾诩  张昭 马超".split()))
    print(model.wv.similarity('周瑜','孙策'))
    print(model.wv.similarity('周瑜','小乔'))
    print(model.wv.similarity('吕布', '貂蝉'))


if __name__ == '__main__':
    crawl()
    segment()
    train_model('segement.txt','model1')
    analyse_wordVector('model1')

Guess you like

Origin blog.csdn.net/yexiaohhjk/article/details/85086503