Pytorch Neural Network Practical Study Notes_24 Recurrent Neural Network Training Language Model: Language Model Overview + NLP Polynomial Overview

1 Language Model

Recurrent neural network models can learn from sequence fragments and find sequential features between samples. This feature is very suitable for use in language processing.

1.1 Introduction to Language Models

Language models include grammatical language models and statistical language models, generally referring to statistical language models.

1.1.1 Statistical language model

Statistical language model refers to: regard language (sequence of words) as a random event, and assign a corresponding probability to describe the possibility of it belonging to a certain language set, and measure the rationality of a sentence. The higher the probability, the better the sentence. More like a natural sentence.

The role of the statistical language model is to determine a probability distribution P(w1,w2,...,wm) for a string of length m, representing the possibility of its existence. Among them, "w1~wm" represents each word in this text in turn, and this model can retain certain word order information and obtain context information of a word through these methods.

2 Vocabulary and word vector

2.1 Vocabulary and word vector

    Vocabulary refers to encoding each word (or word), that is, representing the word (or word) with numbers, so that the sentence can be input into the neural network for processing.
    A simpler vocabulary is to number each word (or word) in sequence, or to represent this number with one_hot encoding. However, this simple numbering method can only describe different words (or characters), and cannot express the internal meaning of words (or characters).

So people began to use vectors to map words (or words), which can express more information. This vector used to represent each word is called word vector (also called word embedding). Word vectors can be understood as an upgraded version of one-hot encoding, which uses multidimensional vectors to better describe the relationship between words.

2.2 The principle and implementation of word vector

The biggest advantage of word vectors is that they can better represent contextual semantics.

2.2.1 The meaning of word vector

The word vector represents the distance between words and the distance between the words is mapped to the distance between the vectors, so as to retain the original characteristics of the word (or word) to the greatest extent, and it is based on the distributional hypothesis. The semantics is determined by its context, and words with similar contexts have similar semantics.

2.2.2 Composition of word vector

(1) Choose a way to describe the context;

(2) Choose a model to describe the relationship between a target word and its context.

2.3 The principle and implementation of word vector

  The mapping method of one_hot encoding is essentially a word vector, that is, each word is represented as a very long vector. The dimension of this vector is the size of the vocabulary, and only one dimension has a value of 1, and the rest of the dimensions are 0. This degree of 1 represents the current word.
    The only difference between one_hot encoding and word vector is that it only symbolizes the character without considering any semantic information. If one_hot encodes each element from an integer to a floating-point type, and then compresses and embeds the original sparse huge dimension into a smaller-dimensional space, then it is equivalent to a word vector.

 2.4 Implementation of word vector

In the implementation of neural networks, word vectors are more called word embeddings. The specific method is to map a two-dimensional tensor to a multi-dimensional space, that is, the elements in the embedding will no longer be words, but It has become a multi-dimensional vector converted by words, and all vectors have a distance relationship between them.

3 Distribution of polynomials in NLP

In natural language, a certain word in a sentence is not unique. For example, the last word "awesome" in the sentence "code doctor studio is awesome" can also be replaced with "good" without affecting the semantics of the whole sentence.

3.1 Problems in the RNN model

In the RNN model, when a model trained with language samples is used to generate text, it will be found that the model will always take out the word with the highest probability of appearing at the next moment, that is, only the design of one language is implemented, which generates text. way of losing the diversity of the language itself.

3.2 Solutions

In order to solve this problem, the final result of the RNN model is regarded as a multinomial distribution, and the word vector of the next sequence is predicted by means of distribution sampling. The sentences generated by this method are more in line with the characteristics of the language.

3.2.1 Multinomial distribution

Polynomial distribution The polynomial distribution is an extension of the binomial distribution.

A typical example of the binomial distribution is "tossing a coin": the probability of the coin being heads is P, and the probability of getting k heads is a binomial distribution probability of n repeated coin tosses. By extending the binomial distribution formula to multiple states, a polynomial distribution is obtained.

3.2.2 Application of polynomial distribution in RNN model

The application of polynomial distribution in the RNN model In the RNN model, the predicted result is no longer a specific word that appears in the next sequence, but the distribution of the word, which is the core of using polynomial distribution in the RNN model Thought. After obtaining the multinomial distribution of the word, sampling operations can be performed in the distribution to obtain specific words. This method is more in line with the diversity of the language itself in the NLP task, that is, a word in a sentence is not unique. .

3.2.3 Implementation steps in the RNN model

    (1) The result predicted by the RNN model is converted into an array with the same dimension as the dictionary through full connection or convolution.
    (2) Use this array to represent the polynomial distribution of the results predicted by the model.
    (3) Use the torch.multinomial() function to sample from the prediction results to get the real prediction results.

3.3 torch.multinomial()

torch.multinomial(input, num_samples,replacement=False, out=None) → LongTensor
  • Function: Take n_samples times for each row of the input, and the output tensor is the subscript of the row corresponding to the input tensor when each value is taken. The input is an input tensor, a number of samples, and a boolean replacement.
  • input: The tensor can be regarded as a weight tensor, and each element represents its weight in the row. If an element is 0, this element will not be fetched until other elements that are not 0 are fetched.
  • n_samples: is the number of times each row is fetched, the value cannot be greater than the number of elements in each row, otherwise an error will be reported.
  • replacement: Refers to whether there is a replacement sample when sampling, True means there is replacement, False no replacement.

3.3.1 Code implementation (the results of multiple distribution sampling are different each time)

import torch
# 生成一串0-1的随机数
data = torch.rand(2,4)
print("生成的数据列表",data)
#生成的数据列表 tensor([[0.8998, 0.6512, 0.9156, 0.8575],[0.8455, 0.4838, 0.6859, 0.2629]])
a = torch.multinomial(data,1)
print("第一次采样结果:",a)
# 第一次采样结果: tensor([[0],[0]])
b = torch.multinomial(data,1)
print("第二次采样结果:",b)
# 第二次采样结果: tensor([[0],[1]])

4 Implementation of Recurrent Neural Network

 4.1 The underlying class of RNN

The torch.nn.LSTM class and the torch..nn.GRU class do not belong to the single-layer network structure. They are essentially the secondary encapsulation of the RNNCell, which connects the basic RNN Cells according to the specified parameters to form a complete RNNs.

In the torch.nn.LSTM class and the torch.nn.GRU class, the torch.nn.LSTMCel class and the torch.nn.GRUCell class will be called respectively for specific implementation.

Guess you like

Origin blog.csdn.net/qq_39237205/article/details/123607515