On word2vec in natural language processing

Overview

Whether it is artificial intelligence or data science, the core is mathematical principles. In machine learning, how to turn everything in the world into numbers, so as to solve problems with mathematical methods is particularly important. Images are generally input based on pixel values, and how should text be processed into numbers?

Text representation

Word bag model

The most basic text representation model is the bag of words model. That is, treat each document as a bag of words, ignoring the order in which each word appears. Each document can be represented as a long vector, each dimension in the vector represents a word, and the weight corresponding to this dimension also reflects the importance of the word in the article. We often use TF-IDF to calculate the weight.

N-gram model

The above bag-of-words model will have an obvious teaching problem. If the phrase "natural language processing" is split into three words, these three words appear separately and the three words appear together in the semantics of the article is completely different. Therefore, the way of dividing the document will lose the association information between many words. Then the language model came into being. The phrase (N-gram) composed of n consecutive words (n <= N) can also be put into a vector representation as a single feature to form an N-gram model. Here we need to add one thing: in practical applications, we will perform word stemming on words (Word Stemming) to unify multiple part-of-speech changes of the same word.

Theme model

This is also one of the classic models in the field of early natural language processing, used to find representative topics from the text library (to get the distribution characteristics of the words on each topic)

Word embedding and deep learning model

Word embedding is a general term for a model that vectorizes words. The core idea is to map each word into a dense vector (Dense Vector) on a low-dimensional space (usually with dimensions of around 50 to 300). Hi, we can also regard each dimension as an implicit theme, but it is not as intuitive as the theme in the theme model (the theme model has a rigorous derivation).

Word2vec

In 2013, Google proposed the Word2vec model, which was widely used 15 years later, and is currently one of the most commonly used word embedding models. It is a shallow neural network model with two structures, namely Continuous Bag of Words and Skip-gram. Of course, there are many mainstream word embedding models after word2vec, which I will introduce in a later blog post. This blog asks only about word2vec and its principles.

principle

Insert picture description hereBoth CBOW and Skip-gram are a shallow neural network. Each word in the input layer is represented by a unique hot code. One is to use the surrounding words to predict the current word, and the other is to use the current words to predict the surrounding words. After passing through the shallow neural network, the output layer is also an N-dimensional vector. Finally, the sofmax activation function is applied to the output layer vector to calculate the generation probability of each word.

The difference and connection between word2vec and LDA

LDA uses the co-occurrence relationship of words in documents to cluster words by topic, which can be understood as decomposing the "document-word" matrix to obtain two probability distributions of "document-topic" and "topic-word". Word2vec actually learns the "context-word" matrix, where the context is composed of several words around it, so the resulting word vector represents more features that incorporate contextual co-occurrence. The biggest difference between the topic model and the word embedding method is the model itself. The topic model is a generative model based on the probability graph model. Its likelihood function can be written in the form of several conditional probability multiplications, including implicit Contains variables (that is, topics); while the word embedding model is generally expressed in the form of a neural network, the likelihood function is defined on the output of the network, and it is necessary to learn the weight of the network to obtain a dense vector representation of the word.

pytorch implements word2vec

First import the necessary packages

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import matplotlib.pyplot as plt
dtype = torch.FloatTensor

Next, construct a simple data set and generate a dictionary of the entire data.

sentences = [ "i like dog", "i like cat", "i like animal",
              "dog is animal", "cat is animal","dog like apple", "cat like fish",
              "dog like milk", "i like apple", "i hate apple",
              "i like movie", "i like book","i like music","cat hate dog", "cat like dog"]

word_sequence = " ".join(sentences).split()
word_list = " ".join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)}

Determine some basic parameters, determine the batch, etc .:

batch_size = 20  #
embedding_size = 5 
voc_size = len(word_list)
def random_batch(data, size):
    random_inputs = []
    random_labels = []
    random_index = np.random.choice(range(len(data)), size, replace=False)
    for i in random_index:
        random_inputs.append(np.eye(voc_size)[data[i][0]]) 
        random_labels.append(data[i][1])  
    return random_inputs, random_labels

Construct a skip_gram model of a window. For each word, each time it takes the previous word and the next word as features.

skip_grams = []
for i in range(1, len(word_sequence) - 1):
    target = word_dict[word_sequence[i]]
    context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]]
    for w in context:
        skip_grams.append([target, w])

Next, construct the model. Here we need to construct the weight matrix of the shallow neural network and the matrix obtained by multiplying the weights to obtain the output matrix, so that the number of words in the dimension is output:

class Word2Vec(nn.Module):
    def __init__(self):
        super(Word2Vec, self).__init__()
        self.W = nn.Parameter(-2 * torch.rand(voc_size, embedding_size) + 1).type(dtype) 
        self.WT = nn.Parameter(-2 * torch.rand(embedding_size, voc_size) + 1).type(dtype) 
    def forward(self, X):
        # X : [batch_size, voc_size]
        hidden_layer = torch.matmul(X, self.W) # hidden_layer : [batch_size, embedding_size]
        output_layer = torch.matmul(hidden_layer, self.WT) # output_layer : [batch_size, voc_size]
        return output_layer

During training, the input data dimension is [batch_size, voc_size], and the dimension becomes [batch_size, embedding_size] after the hidden layer, and the final output dimension after multiplying the matrix is ​​still [batch_size, voc_size].
Next, instantiate the model and determine the loss, optimizer, etc.

model = Word2Vec()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(5000):
    input_batch, target_batch = random_batch(skip_grams, batch_size)
    input_batch = Variable(torch.Tensor(input_batch))
    target_batch = Variable(torch.LongTensor(target_batch))
    optimizer.zero_grad()
    output = model(input_batch)
    loss = criterion(output, target_batch)
    if (epoch + 1)%1000 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
    loss.backward()
    optimizer.step()

Finally, we draw a picture to see the result:

for i, label in enumerate(word_list):
    W, WT = model.parameters()
    x,y = float(W[i][0]), float(W[i][1])
    plt.scatter(x, y)
    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()

Results:
Insert picture description here
Here we just show an example, the data is less, so the model does not fully converge, and the final result is not very good.

Word2vec in Tensorflow2

So how to implement word2vec in tensorflow2.0? Because the above has shown how to build word2vec from a theoretical perspective in pytorch, the principle perspective is similar. Here mainly shows how to use word2vec in tf2.

import tensorflow as tf
docs =[ "i like dog", "i like cat", "i like animal",
              "dog is animal", "cat is animal","dog like apple", "cat like fish",
              "dog like milk", "i like apple", "i hate apple",
              "i like movie", "i like book","i like music","cat hate dog", "cat like dog"]
# 只考虑最常见的15个单词
max_words = 15
# 统一的序列化长度
# 截长补短 0填充,当然这里没有超过3的句子,默认是从前面填充0,也可以修改成从后面填充
max_len = 3
# 词嵌入维度
embedding_dim = 3
# 分词
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words)
# fit_on_texts 获取训练文本的词表
tokenizer.fit_on_texts(docs)
# 字典索引
word_index = tokenizer.word_index
# 序列化
sequences = tokenizer.texts_to_sequences(docs)
# 统一序列长度
data = tf.keras.preprocessing.sequence.pad_sequences(sequences = sequences, maxlen= max_len)
# 添加Embedding层,传入字表长度,句子最大长度和嵌入维度
model = tf.keras.models.Sequential()
embedding_layer = tf.keras.layers.Embedding(input_dim=max_words, output_dim= embedding_dim, input_length=max_len)
model.add(embedding_layer)
model.compile('rmsprop', 'mse')
out = model.predict(data)
# 查看权重
layer = model.get_layer('embedding')
print(layer.get_weights())

Finally, the dimension of the output we output is 15 3 3. 15 is the length of the word table in this wave, 3 is the number of words in each sentence, and the last 3 is the embedded dimension.

Word2vec in Gensim

In fact, in our actual use process, we basically use the gensim package. Programs written by yourself are not necessarily as good as gensim. gensim is very simple. If you only want to get an embedded effect, you can use it. Here is how to use gensim:

from gensim.models import Word2Vec
import re
docs = [ "i like dog", "i like cat", "i like animal",
              "dog is animal", "cat is animal","dog like apple", "cat like fish",
              "dog like milk", "i like apple", "i hate apple",
              "i like movie", "i like book","i like music","cat hate dog", "cat like dog"]
sentences = []
# 去标点符号
stop = '[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'
for doc in docs:
    doc = re.sub(stop, '', doc)
    sentences.append(doc.split())
# size嵌入的维度,window窗口大小,workers训练线程数
# 忽略单词出现频率小于min_count的单词
# sg=1使用Skip-Gram,否则使用CBOW
model = Word2Vec(sentences, size=5, window=1, min_count=1, workers=4, sg=1)

So that we can train the word vector very conveniently. Of course, gensim function is not limited to word2vec, but also TF-IDF, LSA, LDA, including similarity calculation, information retrieval, etc., is an artifact of nlp entry.

Published 31 original articles · praised 20 · visits 2095

Guess you like

Origin blog.csdn.net/qq_34523665/article/details/105685941