NLP learning (eight) text vectorization word2vec and case implementation-Python3 implementation

Vectorization algorithm word2vec

Bag of words model

The earliest text vectorization method with words as the basic processing unit
:

  1. Build a dictionary based on the words that appear (unique index)
  2. Count the frequency of each word to form a vector

The problem

  1. Dimensional disaster
  2. Unable to preserve word order information
  3. There is a problem of semantic gap

Neural Network Language Model (NNLM)

Features
Different from traditional methods for estimating P, NNLM directly estimates the conditional probability of nn elements through a neural network structure.
basic structure
basic structure

General operation
Collect a series of text sequences of length nn from the corpus. Assuming that the set of text sequences of length nn is DD, then the objective function of NNLM is:

Network model

  1. Input layer: low-dimensional, compact word vectors, splicing each word vector in the word sequence in order:
    Insert picture description here

  2. Input the obtained xx into the hidden layer to obtain hh:
    h=tanh(b+Hx)
    where HH is the weight matrix from the input layer to the hidden layer, and the dimension is ∣h∣×(n−1)∣e∣.

  3. The hh of the hidden layer is connected to the output layer to get yy:
    y=b+Uh
    where UU is the weight matrix from the hidden layer to the output layer, and the dimension is ∣V∣×∣h∣∣V∣×∣h∣,∣V∣∣ V∣ represents the size of the vocabulary.

  4. Normalize
    the output layer Add the softmaxsoftmax function before the output layer to convert yy into the corresponding probability value:
    Insert picture description here

  5. Training method
    Use the stochastic gradient descent method to train. When training each batch, randomly select a number of samples from the corpus DD for training. The iterative formula:

    where αα is the learning rate and θθ is all the parameters involved in the model.

C&W model

Features
The goal of NNLM is to build a language probability model, and C&W is the
core mechanism for generating word vectors .
If the nn meta-phrase has appeared in the corpus, give the phrase a high score, and if it has not appeared, give it a lower score.
Model structure
Insert picture description here
goal Function
Insert picture description here
where (w,c)(w,c) is a positive sample, nn meta-phrases extracted from the corpus, nn is a singular number, ww is the target word, cc is the context of the target word, w′w ′ is from the dictionary A randomly selected word in, (w′,c) is a negative sample.

CBOW model and Skip-gram model

CBOW model

Introduction Use
the middle word of a text as the target word,
remove the hidden layer, and improve the running speed of the
model structure
Insert picture description here

CBOW's conditional probability calculation formula for target words
Insert picture description here

CBOW objective function
Insert picture description here

Skip-gram model

Introduction
Similar to CBOW, CBOW does not have a hidden layer to
input the intermediate word vector of the context word. Skip-gram selects a word
model structure from the context of the target word ww.
Insert picture description here
Skip-gram objective function
Insert picture description here

Actual combat: vectorization of webpage text

Corpus: The download address is https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2, or find here https://dumps.wikimedia.org/zhwiki/. This file only contains the title and text, not the link information between the entries, and the size is about 1.3G

1. Word vector training
1.1 Chinese corpus preprocessing
Use xml->txt traditional->simplified to segment words by stuttering

# -*- coding: utf-8 -*-
from gensim.corpora import WikiCorpus
import jieba
from langconv import *
 
def my_function():
    space = ' '
    i = 0
    l = []
    zhwiki_name = './data/zhwiki-latest-pages-articles.xml.bz2'
    f = open('./data/reduce_zhiwiki.txt', 'w')
    wiki = WikiCorpus(zhwiki_name, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        for temp_sentence in text:
            temp_sentence = Converter('zh-hans').convert(temp_sentence)
            seg_list = list(jieba.cut(temp_sentence))
            for temp_term in seg_list:
                l.append(temp_term)
        f.write(space.join(l) + '\n')
        l = []
        i = i + 1
 
        if (i %200 == 0):
            print('Saved ' + str(i) + ' articles')
    f.close()
 
if __name__ == '__main__':
    my_function()

1.2 Use gensim module to train word vectors

# -*- coding: utf-8 -*-
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import logging
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
def my_function():
    wiki_news = open('./data/reduce_zhiwiki.txt', 'r')
    model = Word2Vec(LineSentence(wiki_news), sg=0,size=192, window=5, min_count=5, workers=9)
    model.save('zhiwiki_news.word2vec')
 
if __name__ == '__main__':
    my_function()

2. Calculate webpage similarity
2.1 The
basic method of word2vec to calculate webpage similarity : extract keywords in the text (tfidf keyword extraction in the stuttering toolkit), vectorize the keywords, and then add the word vectors obtained, and finally Obtain a vectorized representation of the sum of word vectors representing the text, and use the total vector to calculate the text similarity.

# -*- coding: utf-8 -*-
import jieba.posseg as pseg
from jieba import analyse
 
def keyword_extract(data, file_name):
   tfidf = analyse.extract_tags
   keywords = tfidf(data)
   return keywords
 
def getKeywords(docpath, savepath):
 
   with open(docpath, 'r') as docf, open(savepath, 'w') as outf:
      for data in docf:
         data = data[:len(data)-1]
         keywords = keyword_extract(data, savepath)
         for word in keywords:
            outf.write(word + ' ')
         outf.write('\n')
def word2vec(file_name,model):
    with codecs.open(file_name, 'r') as f:
        word_vec_all = numpy.zeros(wordvec_size)
        for data in f:
            space_pos = get_char_pos(data, ' ')
            first_word=data[0:space_pos[0]]
            if model.__contains__(first_word):
                word_vec_all= word_vec_all+model[first_word]
 
            for i in range(len(space_pos) - 1):
                word = data[space_pos[i]:space_pos[i + 1]]
                if model.__contains__(word):
                    word_vec_all = word_vec_all+model[word]
        return word_vec_all
 
def simlarityCalu(vector1,vector2):
    vector1Mod=np.sqrt(vector1.dot(vector1))
    vector2Mod=np.sqrt(vector2.dot(vector2))
    if vector2Mod!=0 and vector1Mod!=0:
        simlarity=(vector1.dot(vector2))/(vector1Mod*vector2Mod)
    else:
        simlarity=0
    return simlarity
 
if __name__ == '__main__':
    model = gensim.models.Word2Vec.load('data/zhiwiki_news.word2vec')
    p1 = './data/P1.txt'
    p2 = './data/P2.txt'
    p1_keywords = './data/P1_keywords.txt'
    p2_keywords = './data/P2_keywords.txt'
    getKeywords(p1, p1_keywords)
    getKeywords(p2, p2_keywords)
    p1_vec=word2vec(p1_keywords,model)
    p2_vec=word2vec(p2_keywords,model)
 
    print(simlarityCalu(p1_vec,p2_vec))

Guess you like

Origin blog.csdn.net/qq_30868737/article/details/108282496