「X」Embedding in NLP|Neural Network and Language Model Embedding Vector Introduction

In the "X" Embedding in NLP advanced series, we introduced the basic knowledge of natural language processing - Token, N-gram and bag-of-word language models in natural language. Today, we will continue to "practice" with you, delve into neural network language models, especially recurrent neural networks, and briefly understand how to generate Embedding vectors.

01. Deep understanding of neural networks

First, let’s briefly review the components of neural networks, namely neurons, multi-layer networks, and the backpropagation algorithm. If you want to learn more about these basic concepts in more detail, you can refer to other resources, such as CS231n course notes .

In machine learning, neurons are the basic units that make up all neural networks. Essentially, a neuron is a unit in a neural network that takes a weighted sum of all its inputs, plus an optional bias term. The equation representation is as follows:

Here, represents the output of the neuron in the previous layer and represents the weight used by this neuron to synthesize the output value.

If a multi-layer neural network consisted only of the weighted sums in the above equation, we could combine all terms into a single linear layer - which is not very ideal for modeling relationships between tokens or encoding complex text. This is why all neurons contain a nonlinear activation function after the weighted sum, the most familiar example of which is the rectified linear unit (ReLU) function:

For most modern neural network language models, the Gaussian Error Linear Unit (GELU) activation function is more common:

Here, represents the Gaussian cumulative distribution function, which can be used to express. This activation function is applied after the weighted summation described above. All in all, a single neuron looks like this:

To learn more complex functions, we can stack neurons - one above the other to form a layer. All neurons in the same layer receive the same input; the only difference between them is the weight W and bias b. We can express the above equation in matrix notation for a single layer:

Here, w is a two-dimensional matrix containing all the weights applied to the input x ; each row of the matrix corresponds to the weight of a neuron. This type of layer is often called a dense layer or a fully connected layer because all inputs x are connected to all outputs y .

We can concatenate these two layers to create a basic feedforward network:

Here we introduce a new hidden layer h1 , which is not directly connected to the input x nor to the output y . This layer effectively increases the depth of the network, increasing the total number of parameters (multiple weight matrices w ). At this time, it is important to note that as more hidden layers are added, the hidden values ​​(activation values) close to the input layer are more "similar" to x , while the activation values ​​close to the output are more similar to y .

We will discuss Embedding vectors based on this principle in subsequent articles. The concept of hidden layers is crucial to understanding vector search.

The parameters of individual neurons in a feedforward network can be updated through a process called backpropagation, which is essentially the repeated application of the chain rule in calculus. You can search for some courses that specifically explain backpropagation. These courses will introduce why backpropagation is so effective for training neural networks. We won’t go into details here. The basic process is as follows:

  1. Feed a batch of data through a neural network.

  2. Calculate losses. This is typically the L2 loss (squared difference) for regression and the cross-entropy loss for classification.

  3. Use this loss to calculate the loss gradient with the last hidden layer weights .

  4. Calculate the loss through the last hidden layer, ie .

  5. Backpropagate this loss to the weights of the penultimate hidden layer .

  6. Repeat steps 4 and 5 until the partial derivatives of all weights are calculated.

After calculating the partial derivatives of the loss associated with all weights in the network, a massive weight update can be performed based on the optimizer and learning rate. This process is repeated until the model reaches convergence or all epochs are completed.

02. Recurrent Neural Network

All forms of text and natural language are sequential in nature, meaning words/tokens are processed one after the other. Seemingly simple changes, such as adding a word, reversing two consecutive tokens, or adding punctuation, can lead to huge differences in interpretation. For example, the phrases "let's eat, Charles" and "let's eat Charles" are completely different things. Due to the sequential nature of natural language, recurrent neural networks (RNNs) have naturally become the best choice for language modeling.

Recursion is a unique form of recursion where the function is a neural network rather than code. RNN also has biological origins - the human brain can be compared to an (artificial) neural network, and the words we input or the words we speak are the results of biological processing.

RNN consists of two components: 1) a standard feed-forward network and 2) a recursive component. Feedforward networks are the same ones we discussed in the previous section. For the recursive component, the last hidden state is fed back into the input so that the network can maintain the previous context. Therefore, previous knowledge (in the form of hidden layers from the previous time step) is injected into the network at each new time step.

Based on the above macro definition and explanation of RNN, we can roughly understand how it is implemented and why RNN performs well in semantic modeling.

First, the recurrent structure of RNNs enables them to capture and process data sequentially, similar to how humans speak, read, and write. In addition, RNNs can effectively access "information" from earlier times and understand natural language better than n-gram models and pure feed-forward networks.

You can try using PyTorch to implement an RNN. Note that this requires an in-depth understanding of the basics of PyTorch; if you are not familiar with PyTorch, it is recommended that you read this link first .

First define a simple feedforward network, then extend it into a simple RNN, first defining the layers:

from torch import Tensor
import torch.nn as nn
class BasicNN(nn.Module):    
   def __init__(self, in_dims: int, hidden_dims: int, out_dims: int):        
       super(BasicNN, self).__init__()        
       self.w0 = nn.Linear(in_dims, hidden_dims)
       self.w1 = nn.Linear(hidden_dims, out_dims)

Note that since we are only outputting raw logical values, we have not defined the style of the loss. During training, some standards can be added based on the actual situation, such as nn.CrossEntropyLoss.

Now, we can implement the forward pass:

    def forward(self, x: Tensor):
        h = self.w0(x)
        y = self.w1(h)
        return y

These two code snippets combined form a very basic feedforward neural network. To turn this into an RNN, we need to add a feedback loop from the last hidden state back to the input:

    def forward(self, x: Tensor, h_p: Tensor):
        h = self.w0(torch.cat(x, h_p))        
        y = self.w1(h)        
        return (y, h)

The above are basically all the steps. Since we now increase w0the number of inputs to the neuron layer defined by , we need to __init__update its definition in . Now let's do this and combine everything into a code snippet:

import torch.nn as nn
from torch import Tensor

class SimpleRNN(nn.Module):
    def __init__(self, in_dims: int, hidden_dims: int, out_dims: int):
        super(RNN, self).__init__()
        self.w0 = nn.Linear(in_dims + hidden_dims, hidden_dims)
        self.w1 = nn.Linear(hidden_dims, out_dims)
        
    def forward(self, x: Tensor, h_p: Tensor):
        h = self.w0(torch.cat(x, h_p))
        y = self.w1(h)
        return (y, h)

In each forward pass, hthe activation values ​​of the hidden layer are returned along with the output. These activation values ​​can then be passed back into the model again with each new token in the sequence. Such a process is as follows (the following code is for illustration only):

model = SimpleRNN(n_in, n_hidden, n_out)

...

h = torch.zeros(1, n_hidden)
for token in range(seq):
    (out, h) = model(token, )

At this point, we have successfully defined a simple feedforward network and extended it into a simple RNN.

03.Language model Embedding

The hidden layer we saw in the example above effectively encodes everything that has been input to the RNN (all the tokens). More specifically, all the information needed to parse the text that the RNN has seen should be contained in the activation value h. In other words, h encodes the semantics of the input sequence, and the ordered set of floating-point values ​​defined by h is the Embedding vector, referred to as Embedding.

These vector representations widely form the basis of vector searches and vector databases. Although today's natural language embeddings are generated by another class of machine learning models called Transformers rather than RNNs, the underlying concept is basically the same: encoding text content into computer-understandable embedding vectors. We'll discuss using Embedding vectors in detail in our next blog post.

04. Summary

We implemented a simple recurrent neural network in PyTorch and briefly introduced language model Embedding. Although recurrent neural networks are powerful tools for understanding language and can be used in a wide variety of applications (machine translation, classification, question answering, etc.), they are still not the type of ML model used to generate Embedding vectors.

In the next tutorial, we will use the open source Transformer model to generate Embedding vectors and demonstrate the power of vectors by performing vector searches and operations on them. Additionally, we will also return to the concept of bag-of-words models and see how the two can be used together to encode vocabulary and semantics. Stay tuned!

Tang Xiaoou, founder of SenseTime, passed away at the age of 55. In 2023, PHP stagnated . Hongmeng system is about to become independent, and many universities have set up "Hongmeng classes". The PC version of Quark Browser has started internal testing. ByteDance was "banned" by OpenAI. Zhihuijun's startup company refinanced, with an amount of over 600 million yuan, and a pre-money valuation of 3.5 billion yuan. AI code assistants are so popular that they can't even compete in the programming language rankings . Mate 60 Pro's 5G modem and radio frequency technology are far ahead No Star, No Fix MariaDB spins off SkySQL and forms as independent company
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4209276/blog/10322265