Vectorization algorithm word2vec
Bag of words model
The earliest text vectorization method with words as the basic processing unit
:
- Build a dictionary based on the words that appear (unique index)
- Count the frequency of each word to form a vector
The problem
- Dimensional disaster
- Unable to preserve word order information
- There is a problem of semantic gap
Neural Network Language Model (NNLM)
Features
Different from traditional methods for estimating P, NNLM directly estimates the conditional probability of nn elements through a neural network structure.
basic structure
General operation
Collect a series of text sequences of length nn from the corpus. Assuming that the set of text sequences of length nn is DD, then the objective function of NNLM is:
Network model
-
Input layer: low-dimensional, compact word vectors, splicing each word vector in the word sequence in order:
-
Input the obtained xx into the hidden layer to obtain hh:
h=tanh(b+Hx)
where HH is the weight matrix from the input layer to the hidden layer, and the dimension is ∣h∣×(n−1)∣e∣. -
The hh of the hidden layer is connected to the output layer to get yy:
y=b+Uh
where UU is the weight matrix from the hidden layer to the output layer, and the dimension is ∣V∣×∣h∣∣V∣×∣h∣,∣V∣∣ V∣ represents the size of the vocabulary. -
Normalize
the output layer Add the softmaxsoftmax function before the output layer to convert yy into the corresponding probability value:
-
Training method
Use the stochastic gradient descent method to train. When training each batch, randomly select a number of samples from the corpus DD for training. The iterative formula:
where αα is the learning rate and θθ is all the parameters involved in the model.
C&W model
Features
The goal of NNLM is to build a language probability model, and C&W is the
core mechanism for generating word vectors .
If the nn meta-phrase has appeared in the corpus, give the phrase a high score, and if it has not appeared, give it a lower score.
Model structure
goal Function
where (w,c)(w,c) is a positive sample, nn meta-phrases extracted from the corpus, nn is a singular number, ww is the target word, cc is the context of the target word, w′w ′ is from the dictionary A randomly selected word in, (w′,c) is a negative sample.
CBOW model and Skip-gram model
CBOW model
Introduction Use
the middle word of a text as the target word,
remove the hidden layer, and improve the running speed of the
model structure
CBOW's conditional probability calculation formula for target words
CBOW objective function
Skip-gram model
Introduction
Similar to CBOW, CBOW does not have a hidden layer to
input the intermediate word vector of the context word. Skip-gram selects a word
model structure from the context of the target word ww.
Skip-gram objective function
Actual combat: vectorization of webpage text
Corpus: The download address is https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2, or find here https://dumps.wikimedia.org/zhwiki/. This file only contains the title and text, not the link information between the entries, and the size is about 1.3G
1. Word vector training
1.1 Chinese corpus preprocessing
Use xml->txt traditional->simplified to segment words by stuttering
# -*- coding: utf-8 -*-
from gensim.corpora import WikiCorpus
import jieba
from langconv import *
def my_function():
space = ' '
i = 0
l = []
zhwiki_name = './data/zhwiki-latest-pages-articles.xml.bz2'
f = open('./data/reduce_zhiwiki.txt', 'w')
wiki = WikiCorpus(zhwiki_name, lemmatize=False, dictionary={})
for text in wiki.get_texts():
for temp_sentence in text:
temp_sentence = Converter('zh-hans').convert(temp_sentence)
seg_list = list(jieba.cut(temp_sentence))
for temp_term in seg_list:
l.append(temp_term)
f.write(space.join(l) + '\n')
l = []
i = i + 1
if (i %200 == 0):
print('Saved ' + str(i) + ' articles')
f.close()
if __name__ == '__main__':
my_function()
1.2 Use gensim module to train word vectors
# -*- coding: utf-8 -*-
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
def my_function():
wiki_news = open('./data/reduce_zhiwiki.txt', 'r')
model = Word2Vec(LineSentence(wiki_news), sg=0,size=192, window=5, min_count=5, workers=9)
model.save('zhiwiki_news.word2vec')
if __name__ == '__main__':
my_function()
2. Calculate webpage similarity
2.1 The
basic method of word2vec to calculate webpage similarity : extract keywords in the text (tfidf keyword extraction in the stuttering toolkit), vectorize the keywords, and then add the word vectors obtained, and finally Obtain a vectorized representation of the sum of word vectors representing the text, and use the total vector to calculate the text similarity.
# -*- coding: utf-8 -*-
import jieba.posseg as pseg
from jieba import analyse
def keyword_extract(data, file_name):
tfidf = analyse.extract_tags
keywords = tfidf(data)
return keywords
def getKeywords(docpath, savepath):
with open(docpath, 'r') as docf, open(savepath, 'w') as outf:
for data in docf:
data = data[:len(data)-1]
keywords = keyword_extract(data, savepath)
for word in keywords:
outf.write(word + ' ')
outf.write('\n')
def word2vec(file_name,model):
with codecs.open(file_name, 'r') as f:
word_vec_all = numpy.zeros(wordvec_size)
for data in f:
space_pos = get_char_pos(data, ' ')
first_word=data[0:space_pos[0]]
if model.__contains__(first_word):
word_vec_all= word_vec_all+model[first_word]
for i in range(len(space_pos) - 1):
word = data[space_pos[i]:space_pos[i + 1]]
if model.__contains__(word):
word_vec_all = word_vec_all+model[word]
return word_vec_all
def simlarityCalu(vector1,vector2):
vector1Mod=np.sqrt(vector1.dot(vector1))
vector2Mod=np.sqrt(vector2.dot(vector2))
if vector2Mod!=0 and vector1Mod!=0:
simlarity=(vector1.dot(vector2))/(vector1Mod*vector2Mod)
else:
simlarity=0
return simlarity
if __name__ == '__main__':
model = gensim.models.Word2Vec.load('data/zhiwiki_news.word2vec')
p1 = './data/P1.txt'
p2 = './data/P2.txt'
p1_keywords = './data/P1_keywords.txt'
p2_keywords = './data/P2_keywords.txt'
getKeywords(p1, p1_keywords)
getKeywords(p2, p2_keywords)
p1_vec=word2vec(p1_keywords,model)
p2_vec=word2vec(p2_keywords,model)
print(simlarityCalu(p1_vec,p2_vec))