Detailed explanation of the principle of word embedding (Word Embedding)

Word embedding model is a general term for language model and representation learning technology in natural language processing (NLP). In the process of natural language processing, we need to map words (word) to corresponding vectors, so that they can be used for model training. Usually, one-hot vectors can be used to represent words, but the length of one-hot vectors is the number of all words in the word list, the number is too large, and the similarity between each word is 0, which is very inconsistent with our daily life (Different words may be relatively similar and often appear together in the text; they may not be similar and are far away in the text). Therefore, it is particularly important to choose a new method to represent words.

For the above problems, word embedding methods can be used to solve them. The basic idea of using word embedding is: first use one-hot encoding and other methods to mark words, and then build a neural network including embedding layer. The input and output of the model are generally one-hot vectors of words with similar positions in the text. After the training is completed, the word one-hot vector is input into the embedding, and the output vector of the embedding is the new embedding representation of the word. These vectors are generally much smaller than the length of one-hot vectors, and can be used to measure the similarity and analogy between words.

One: Basic principles

Tutu first takes the following text as an example:

“rabbit are very lovely animals, and we should not hurt them ”

For such a text, it contains 11 kinds of words, so if the words are arranged in a certain order, each word can be represented by a vector with a length of 11.

If the word list: [rabbit, are, very, lovely, animals, and, we, should, not, hurt, them], the one-hot expression of rabbit at this time is [1,0,0,0,0,0, 0,0,0,0,0]

After that, we need to construct a model that uses adjacent words in the text as the input and output of the model for training. However, the number of inputs and outputs of the model needs to be fixed. Here, Tutu introduces two models commonly used in Word2vec: skip word model and continuous word bag model.

1. Jump word model

Word skipping model, which uses a certain word in the text to infer the preceding and following words. For example, according to 'rabbit' to infer the words before and after may be 'a', 'is', 'eating', 'carrot'. When training the model, we select a number of continuous fixed-length word sequences in the text, and use some words before and after as output, and a word at a certain position in the middle as input.

2. Continuous bag of words model

The continuous bag-of-words model is just the opposite of the word-skipping model, which predicts the central word based on the surrounding words in the text sequence. When training the model, the surrounding words in the sequence are used as input, and the central word is used as output.

For word embedding, the emdedding layer is the core part of the model, which is generally the first layer of the entire network. There is a special nn.embedding layer in Pytorch to implement this part, but in fact, the structure of the embedding layer can be very diverse. The simplest use of only one fully connected layer is also feasible, and the embedding layer can also be a layer theoretically. A lot of networks. After the training is over, input the one-hots of a certain word into embedding, and the output of embedding is the embedding representation of the word. (The number of input words for embedding is fixed during training, but one word can be input during prediction, and multiple words can also be input. This also shows that the embedding structure is relatively special, and Tutu will describe it in detail later)

Two: method implementation

Tutu uses the continuous bag-of-words model in the following case. The length of the sequence selected from the text is 9 each time, the number of input words is 8, the number of output words is 1, and the central word is located in the middle of the sequence. And use emdedding in pytorch and design embedding by yourself, the word embedding dimension is 50. The text is excerpted from "The Tale Of Peter Rabbit" (the story of Peter Rabbit). The total number of words in this article: 793, the number of vocabulary: 401, students who need it can download it in the resources.

For text data, we do not consider punctuation marks in the text. In order to avoid differences in the case of the same word, all words are converted to lowercase.

Text data processing part:

import torch
import re
import numpy as np

txt=[] #文本数据
with open('peter_rabbit.txt',encoding='utf-8') as f:
    for line in f.readlines():
        l=line.strip()
        spilted_sentence=re.split(" |;|-|,|!|\'",l)
        for w in spilted_sentence:
            if w !='':
                txt.append(w.lower())
vol=list(set(txt)) #单词表
n=len(vol) #单词表单词数
vol_dict=dict(zip(vol,np.arange(n))) #单词索引

data=[]
label=[]

for i in range(784):
    in_words=txt[i:i+4]
    in_words.extend(txt[i+6:i+10])
    out_word=txt[i+5]
    in_one_hot=np.zeros((8,n))
    out_one_hot=np.zeros((1,n))
    out_one_hot[0,vol_dict[out_word]]=1
    for j in range(8):
        in_one_hot[j,vol_dict[in_words[j]]]=1
    data.append(in_one_hot)
    label.append(out_one_hot)

class dataset:
    def __init__(self):
        self.n=784 #训练样本数
    def __len__(self):
        return self.n
    def __getitem__(self, item):
        traindata=torch.tensor(np.array(data),dtype=torch.float32) #运行model1此处用long，model2用float32.
        trainlabel=torch.tensor(np.array(label),dtype=torch.float32)
        return traindata[item],trainlabel[item]

This part of the code Tutu is saved in the dataset.py file. It splits the original text into input data (data) and output data (label) that need to be trained in a certain order, and expresses it in one-hot.

1. Use nn.Embedding to build a model

For nn.Embedding(), at least two parameters are required: num_embeddings and embedding_dim, which represent the vocabulary size and word embedding dimension. Here the parameter is (401, 50).

In theory, nn.Embedding should be similar to a fully connected layer whose input dimension is the vocabulary length and the output is the word embedding dimension. We can verify by the following method.

embed=nn.Embedding(401,200)
print(list(embed.parameters())[0].data)
print(list(embed.parameters())[0].data.shape)

The final parameters do indicate that it is a single layer fully connected network. The difference from the ordinary fully connected network is that each neuron input is a word vector instead of a number, so its input dimension can be [batch_size, num_input_word, one_hot_dim]. So it is more like num_word parallel fully connected layers sharing the same parameters, similar to the channel channel in the convolutional neural network. So the dimensions of the embedding layer output [batch, num_input_word, embed_dim]. When inputting this into the next fully connected layer, it is necessary to change the embedding output data dimension. This part of the code is saved in model1.py.

import torch
from torch import nn
from torch.utils.data import DataLoader
from dataset import dataset

class model(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed=nn.Embedding(401,50)
        self.fc1=nn.Linear(160400,100)
        self.act1=nn.ReLU()
        self.fc2=nn.Linear(100,401*1)
        self.act2=nn.Sigmoid()
    def forward(self,input):
        b,_,_=input.shape
        out=self.embed(input).view(b,1,-1)
        out=self.fc1(out)
        out=self.act1(out)
        out=self.fc2(out)
        out=self.act2(out)
        return out
if __name__=='__main__':
    model=model()
    optim=torch.optim.Adam(params=model.parameters())
    Loss=nn.MSELoss()
    traindata=DataLoader(dataset(),batch_size=5,shuffle=True)
    for i in range(100):
        print('the {} epoch'.format(i))
        for d in traindata:
            yp=model(d[0])
            loss=Loss(yp,d[1])
            optim.zero_grad()
            loss.backward()
            optim.step()
    torch.save(model,'model_1.pkl')

2. Construct embedding yourself

Tutu here defines a fully connected layer with a layer number of 2 as the embedding layer. The method is similar to the effect of the previous embedding. This part of the code is saved in model2.py.

import torch
from torch import nn
from torch.utils.data import DataLoader
from dataset import dataset
import numpy as np
class embedding(nn.Module):
    def __init__(self,in_dim,embed_dim):
        super().__init__()
        self.embed=nn.Sequential(nn.Linear(in_dim,200),
                                 nn.ReLU(),
                                 nn.Linear(200,embed_dim),
                                 nn.Sigmoid())
    def forward(self,input):
        b,c,_=input.shape
        output=[]
        for i in range(c):
            out=self.embed(input[:,i])
            output.append(out.detach().numpy())
        return torch.tensor(np.array(output),dtype=torch.float32).permute(1,0,2)


class model(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed=embedding(401,50)
        self.fc1=nn.Linear(400,4000)
        self.act1=nn.ReLU()
        self.fc2=nn.Linear(4000,401*1)
        self.act2=nn.Sigmoid()
    def forward(self,input):
        b,_,_=input.shape
        out=self.embed (input).reshape(b,-1)
        out=self.fc1 (out)
        out=self.act1(out)
        out=self.fc2(out)
        out=self.act2(out)
        out=out.view(b,1,-1)
        return out
if __name__=='__main__':
    model=model()
    optim=torch.optim.Adam(params=model.parameters())
    Loss=nn.MSELoss()
    traindata=DataLoader(dataset(),batch_size=5,shuffle=True)
    for i in range(100):
        print('the {} epoch'.format(i))
        for d in traindata:
            yp=model(d[0])
            loss=Loss(yp,d[1])
            optim.zero_grad()
            loss.backward()
            optim.step()
    torch.save(model,'model_2.pkl')

Three: Summary

As a method of natural language processing, the word embedding model has a wide range of types and is not limited to the methods described in this article. The ideas of these word embedding methods are basically the same. In essence, word embedding has many similarities with autoencoders. It reduces the data dimension and usually represents data in a more reasonable way. and optimization are important.