【PyTorch】nn.Embedding

When doing machine learning tasks, it is often necessary to convert some sentences understood by humans into numerical symbols understood by machines. A common method is to convert words or sentences into the form of vectors. In this conversion process, the torch.nn.Embedding interface provided in PyTorch can be used to realize the vectorization of words. The use of the torch.nnEmbedding interface will be demonstrated below.

1. Interface introduction

Official interface address: Embedding — PyTorch 1.13 documentation

CLASS
torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, device=None, dtype=None)

The torch.nn.Embedding interface module is essentially a dictionary (A lookup table), and each index in the dictionary corresponds to the Embedding vector form of a word. This vector is randomly initialized (satisfying the normal distribution $N (0, 1)$ random value), that is, it does not represent any meaning, so there will be no effect of training by word2vec and other methods, but you can use this method to assign values first, and then learn.

Common parameters of the interface

num_embeddings ( int ) – the number of words in the dictionary
embedding_dim ( int ) – dimension of the embedding vector
padding_idx ( int , optional ) – if given, padding_idx fills the position corresponding to the index in padding_idx with 0

2. Interface use

# 词语集  
words = ['A', 'B', 'C', 'D', 'E']  
# 词号映射  
word_idx = {
    
    w: idx for idx, w in enumerate(words)}  
# 语句集  
sentences = [  
    'AABBC',  
    'AADDE',  
    'EECAA'  
]  
num_embed = len(words)  # 词典的单词数  
embed_dim = 3  # 嵌入向量的维度（自定义）  
  
embed = nn.Embedding(num_embed, embed_dim)  

# 将句子中的每个单词转成对应的编号
input = []  
for s in sentences:  
    s_idx = [word_idx[w] for w in s]  
    input.append(s_idx)  

# 对语句集中的语句进行 embedding  
input = torch.LongTensor(input)  
print(embed(input))

Among them, the input input is the number set obtained by numbering the sentence set, and the output result is the embedded representation of each sentence:

Input：
tensor([[0, 0, 1, 1, 2],
        [0, 0, 3, 3, 4],
        [4, 4, 2, 0, 0]])

Embedding Result:
tensor([[[ 0.3174, -0.1958, -1.1196],
        [ 0.3174, -0.1958, -1.1196],
        [ 0.5466, -0.6627, -2.0538],
        [ 0.5466, -0.6627, -2.0538],
        [ 2.2772, -0.3313,  0.3458]],

       [[ 0.3174, -0.1958, -1.1196],
        [ 0.3174, -0.1958, -1.1196],
        [ 1.0199,  0.1071,  2.3243],
        [ 1.0199,  0.1071,  2.3243],
        [-0.0983, -1.4985, -0.1011]],

       [[-0.0983, -1.4985, -0.1011],
        [-0.0983, -1.4985, -0.1011],
        [ 2.2772, -0.3313,  0.3458],
        [ 0.3174, -0.1958, -1.1196],
        [ 0.3174, -0.1958, -1.1196]]], grad_fn=<EmbeddingBackward0>)

It can be seen from the Embedding Result that each word has been converted into a vector representation with a dimension of 3 at this time , and since the initially set sentence consists of 5 words, each sentence is mapped to consist of 5 word vectors two-dimensional matrix. When you need to query the specific vector corresponding to each word, you can check embedding.weight, where each vector corresponds to the word pointed to by index, that is, the corresponding word vector is stored in embedding.weight, and you can directly query in embedding.weight when necessary.

print(embed.weight)

Parameter containing:
tensor([[ 0.3174, -0.1958, -1.1196],
        [ 0.5466, -0.6627, -2.0538],
        [ 2.2772, -0.3313,  0.3458],
        [ 1.0199,  0.1071,  2.3243],
        [-0.0983, -1.4985, -0.1011]], requires_grad=True)