Deep Learning - Natural Language Processing (1)

The basics of natural language processing
This article is organized according to "Analysis of Aliyun Tianchi Competition Questions - Deep Learning", which can be regarded as simple notes. It is recommended to read in conjunction with "Aliyun Tianchi Competition Questions Analysis - Deep Learning" for better results .
1.1 Word vector
Usually, words and words are mapped into a set of real number vectors reflecting their semantic features , which is called word vector . Commonly used word vectors include One-Hot Representation and Distribution Representation.

(1) One-hot representation
One-hot encoding uses N bits of 0 and 1 encoding to represent N states, and only one state is valid at any time.
One-hot encoding is equivalent to assigning a unique id to each word, which cannot reflect the underlying semantic information and takes up a large memory space.
(2) Distribution representation
Words are expressed as a fixed-length dense vector, which can reflect the semantic information behind the word. Since the dense vector is not set randomly, it is also necessary to model the sentence, which is the language model.
1.2 Language Model
The language model defines the probability distribution of token sequences in natural language. In layman's terms, the language model is to model the sentence and solve the probability distribution of the sentence (modeling the probability distribution of the sentence).
(1) Traditional language model
① Bag-of-Words Model A
model that uses the frequency of words to replace the 0,1 encoding in one-hot encoding—the Bag-of-Words Model.
The bag-of-words model has nothing to do with the order of words in the original sentence, but only reflects the frequency of words. In general, the frequency of a word reflects its importance in a sentence.
②n-gram model
Since the bag-of-words model cannot reflect the order information of words in a sentence, the semantic information is one-sided. To better reflect the semantics, an n-gram model is proposed.
Use the Markov assumption to simplify the calculation of word occurrence probabilities.
Markov assumes that the state at a certain moment is only related to the state at (n-1) moments before it.
The bag-of-words model is a 1-gram model, and the n-gram model is usually calculated using maximum likelihood estimation.
(2) Neural language model
The neural language model obtains the distribution representation of words through neural network training, which is usually called word embedding (Word Embedding). The essence is to learn the neural network in an unsupervised way. After the training is completed, the hidden layer features in the middle of the network are output, and the hidden layer features are the word vectors we want to get.
The neural language model is essentially a classification model. The training efficiency of the neural language model can be greatly improved through negative sampling technology and its corresponding loss function.
①Skip-Gram model
The Skip-Gram model predicts the words in the context window through the central word, takes the sentence processed as the word index as input, and converts the index into the corresponding word vector (bs, len, dim) through the Embedding layer. Among them, bs—batchsize, group size; len—the length of the sentence; dim—the dimension of the word vector.
②The CBOW model
is contrary to the Skip-Gram model. The CBOW model predicts the central word through all the words in the context. The CBOW model accepts the sentence that will be processed as a word index as input, and converts the index into a corresponding word vector (bs, len, dim).
1.3 Deep learning in natural language processing
(1) Convolutional neural network
Convolutional neural network is a type of neural network that contains convolutional computing units. The convolution calculation unit performs weighted summation on the corresponding data areas by continuously sliding the position of the convolution kernel. Common convolution computing units include one-dimensional convolution CNN1D, two-dimensional convolution CNN2D, and three-dimensional convolution CNN3D.
Computer vision mainly uses two-dimensional convolution, and natural language processing mainly uses one-dimensional convolution.
The model for text classification using convolutional neural network is TextCNN.
CNN1D is very effective for mining the information of sequence data in the context window, but it is difficult to maintain information in long-distance context information (not suitable for long text).
(2) Recurrent neural network
The cyclic neural network performs the same cyclic unit calculation for each position of the sequence, which maintains long-distance context information, naturally conforms to sequence tasks, and is widely used in natural language processing.
Common structures: LSTM (long-term memory), GRU (gated recurrent unit). Solving exploding or vanishing gradients for simple RNNs
GRUs reduce computational complexity relative to LSTMs.
(3) Encoder-decoder framework and attention mechanism The
main tasks of natural language processing can be regarded as many-to-many tasks, that is, the tasks of sequence input and sequence output (text classification is the output of sequence 1). Therefore, the encoder-decoder framework naturally fits the task requirements of natural language processing.
The encoder-decoder framework, also known as the Seq2Seq framework, can be viewed as a conditional language model. Encoders and decoders usually use recurrent neural networks.
The recurrent neural network sometimes preserves too long information, and the attention mechanism can selectively retain information.
Through the attention mechanism, the decoder can selectively obtain the hidden state information of the encoder, thus improving the training efficiency.
The attention mechanism can be regarded as a query (Query) key-value pair (Key, Value) relationship.
In natural language processing, the attention mechanism generally treats key-value pairs as equivalent, that is, K=V. Furthermore, if the self-attention mechanism is adopted, Q=K=V.
The above mechanism of using dense vectors to calculate attention weights is collectively called soft attention mechanism, and the mechanism of using one-hot encoded vectors to calculate attention weights is called hard attention mechanism.

Guess you like

Origin blog.csdn.net/weixin_47970003/article/details/123623865