Do it yourself chatgpt: analyze the input processing of the gpt underlying model transformer

We have completed some basic concepts before. If you don’t know the basic principles of deep learning, you can get more information here . Since there are so many tutorials on deep learning, I will not repeat them here, but focus on chatgpt Analysis of model principles, implementation and practice. ChatGPT is based on a deep learning model called transformer, which is composed of a series of components, which we will analyze one by one. First, let's look at the basic structure of the transformer model:
Please add a picture description

I wonder if you feel that this picture has a cyberpunk sci-fi feel. The one on the left of the two big squares is called an encoder, also called an encoder. A very common mode in deep learning is to convert the input data into a specific vector after a series of operations. In terminology, it is called a lentent vector. This vector is often The specific attributes of the input data are recorded, and the right square is called a decoder. Its function is to analyze the intermediate vector generated by the encoder, and then generate a specific output. To give a specific example, when the police investigate a case, they often ask witnesses to describe the appearance characteristics of the suspect. At this time, the witness is equivalent to an encoder, and the characteristics he describes, such as "round face, curly hair, high forehead", etc., are equivalent to The output vector of the above encoder, and then the public security department has specific criminal investigators to draw the appearance of the suspect through these features. Usually the drawn portrait is quite different from the appearance of the real suspect, but because it can capture specific features, so this Portraits are also of great help to the police in tracking down suspects.

First of all, let's see that the first step Please add a picture description
inputs are the input data of the model. For chatgpt, the input is a word or sentence. Input embedding is a preprocessing of the input. It converts the input word or sentence into a vector. This step is An important topic in the NLP algorithm. We have described earlier that any object that is difficult to describe with traditional data structures can be represented by a vector. When a word is converted into a vector in a multidimensional space, we can study the vector in the space. Distribution to understand its characteristics in the language. At the same time, if two words are converted into vectors respectively, if their corresponding vectors are closer in space, we think that the relationship between them is closer.

First we look at how to convert words into vectors. Here we use the BERT model, which is one of the early basic models of Google Brain, through which we can directly convert words into vectors. First we look at a sentence:

The man is king and he loves dog, the woman is queen and she loves cat

There are several keywords in the sentence, (man, woman), (king, queen), (dog, cat), and one can see two words in each group, which are close to each other in meaning, so it is reasonable to expect that if Convert them to vectors, then the distance between the vectors corresponding to the words in the brackets will be close to each other in space. Let's check it with code:

from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load the BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

words = ["man", "woman", "king", "queen", "cat", "dog"]
#embeddings = np.array([word2Vector(word).detach().numpy().reshape(1, -1).flatten() for word in words])
sentence = "The man is king and he loves dog, the woman is queen and she loves cat"
word1 = "man"
word2 = "woman"
word3 = "king"
word4 = "queen"
word5 = "cat"
word6 = "dog"

# Tokenize the sentence and words
tokens = tokenizer(sentence, return_tensors='pt')
word1_tokens = tokenizer(word1, return_tensors='pt')
word2_tokens = tokenizer(word2, return_tensors='pt')
word3_tokens = tokenizer(word3, return_tensors='pt')
word4_tokens = tokenizer(word4, return_tensors='pt')
word5_tokens = tokenizer(word5, return_tensors='pt')
word6_tokens = tokenizer(word6, return_tensors='pt')

# Get the embeddings for the sentence and words
sentence_embedding = model(**tokens).last_hidden_state.mean(dim=1).squeeze()
word1_embedding = model(**word1_tokens).last_hidden_state.mean(dim=1).squeeze()
word2_embedding = model(**word2_tokens).last_hidden_state.mean(dim=1).squeeze()
word3_embedding = model(**word3_tokens).last_hidden_state.mean(dim=1).squeeze()
word4_embedding = model(**word4_tokens).last_hidden_state.mean(dim=1).squeeze()
word5_embedding = model(**word5_tokens).last_hidden_state.mean(dim=1).squeeze()
word6_embedding = model(**word6_tokens).last_hidden_state.mean(dim=1).squeeze()

embeddings = [word1_embedding.detach().numpy().reshape(1, -1).flatten(), word2_embedding.detach().numpy().reshape(1, -1).flatten(),
              word3_embedding.detach().numpy().reshape(1, -1).flatten(), word4_embedding.detach().numpy().reshape(1, -1).flatten(),
              word5_embedding.detach().numpy().reshape(1, -1).flatten(), word6_embedding.detach().numpy().reshape(1, -1).flatten()]

The above code downloads the trained BERT model, and then divides the given sentence into words. Each word has a corresponding number in the corpus trained by the model. We find its number, which is the corresponding word1_tokens above, and input it into the model to obtain its number. Corresponding to the vector word1_embedding, we don't need to care about the logic of the above code, we only need to know that the code converts words into vectors, and then we print out the vectors to see:

import numpy as np
print(np.shape(word1_embedding.detach().numpy().reshape(1, -1).flatten()))
print(f"word embedding : {
      
      word1_embedding}")

The result after running the above code is as follows:
Please add a picture description
You can see that after the word is converted into a vector, it corresponds to a one-dimensional vector containing 768 elements. Next, we draw the six word vectors to see their correspondence:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)
fig, ax = plt.subplots()
ax.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
for i, word in enumerate(words):
    ax.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))
plt.show()

The result after running the above code is as follows:
insert image description here
You can see that the words in each parenthesis mentioned above are close to each other, which means that these words are relatively semantically related to each other. In addition, after the words are vectorized We can also calculate their semantic similarity, as can be seen from the figure above, (man, women) are semantically similar, (queen, king) are semantically similar, (cat, dog) are semantically similar, In the algorithm, we judge the similarity of the two vectors pointing to the object by calculating the cosine value of the angle between the two vectors. Let's take a look at the corresponding code implementation:

man_king_sim = torch.cosine_similarity(word1_embedding, word3_embedding, dim=0)
print(f"sim for man and king: {
      
      man_king_sim}")

The output of the above code is:
sim for man and king: 0.861815869808197

woman_queen_sim = torch.cosine_similarity(word2_embedding, word4_embedding, dim=0)
print(f'sim for woman and queen: {
      
      woman_queen_sim}')

The code output is:
sim for woman and queen: 0.8941507935523987

man_woman_sim = torch.cosine_similarity(word1_embedding, word2_embedding, dim=0)
print(f'sim for man and woman: {
      
      man_woman_sim}')

The output is:
im for man and woman: 0.9260298013687134

man_dog_sim = torch.cosine_similarity(word1_embedding, word6_embedding, dim=0)
print(f'sim for man and dog: {
      
      man_dog_sim}')

woman_dog_sim = torch.cosine_similarity(word2_embedding, word5_embedding, dim=0)
print(f'sim for woman and cat : {
      
      woman_dog_sim}')

dog_cat_sim = torch.cosine_similarity(word6_embedding, word5_embedding, dim=0)
print(f'sim for dog and cat: {
      
      dog_cat_sim}')

The output of the above code is:

sim for man and dog: 0.8304506540298462
sim for woman and cat : 0.876427948474884
sim for dog and cat: 0.900851309299469

From the output results, we can see that dog and cat are semantically close, and man and woman are semantically close. In a specific text, the factors that affect the similarity of the meaning of two words are their distance in the text. Assuming that man and women are very close in a sentence, then it is possible that they refer to a couple or lovers. But if the distance is very far, it is possible that the meaning of the article is talking about two irrelevant people, so the distance of words in the text will also affect their semantic composition, so when chatgpt recognizes text, it will also put words distance is taken into consideration.

Going back to the above architecture diagram, we will see that there is another thing called positional encoding in the input part, as shown in the figure below: the
Please add a picture description
essence of this positional encoding is to encode the distance of words in the sentence, and the encoding result of the position here is also one-dimensional Vector, and the length of the vector must be consistent with the corresponding vector of the word, because they need to be added before the result can be input to chatgpt for the next step of processing, so how does chatgpt "encode" the word position? It uses the following method Calculation:
Please add a picture description

In the formula, pos represents the position of the word in the sentence, i represents the subscript of the component in the vector, when i=0, PE(pos, 0) is the value of the 0th component, PE(pos, 2*0+1) It is the value of the first component, and d_model represents the length of the vector. Since the length of the vector encoding the word above is 768, the value here is also 768. Let’s look at the implementation code:

import numpy as np

def positional_encoding(max_len, d_model):
    """
    Generates positional encoding for a given sequence length and model dimension.
    
    Args:
        max_len (int): Maximum sequence length.
        d_model (int): Model dimension.
        
    Returns:
        np.array: Positional encoding matrix of shape (max_len, d_model).
    """
    pos_enc = np.zeros((max_len, d_model))
    for pos in range(max_len):
        for i in range(0, d_model, 2):
            # Compute positional encoding values
            div_term = np.power(10000, (2 * i) / d_model)
            pos_enc[pos, i] = np.sin(pos / div_term)
            pos_enc[pos, i + 1] = np.cos(pos / div_term)
    return pos_enc

max_len = len(sentence)
d_model = 768
pos_enc = positional_encoding(max_len, d_model)
print("Positional Encoding:\n", pos_enc)

After the above code is executed, the output is as follows:

Positional Encoding:
 [[ 0.00000000e+00  1.00000000e+00  0.00000000e+00 ...  1.00000000e+00
   0.00000000e+00  1.00000000e+00]
 [ 8.41470985e-01  5.40302306e-01  8.15250650e-01 ...  1.00000000e+00
   1.04913973e-08  1.00000000e+00]
 [ 9.09297427e-01 -4.16146837e-01  9.44236772e-01 ...  1.00000000e+00
   2.09827946e-08  1.00000000e+00]
 ...
 [-8.55519979e-01 -5.17769800e-01  8.57295439e-01 ...  1.00000000e+00
   7.02923619e-07  1.00000000e+00]
 [-8.97927681e-01  4.40143022e-01  9.16178088e-01 ...  1.00000000e+00
   7.13415016e-07  1.00000000e+00]
 [-1.14784814e-01  9.93390380e-01  2.03837160e-01 ...  1.00000000e+00
   7.23906413e-07  1.00000000e+00]]

Next, we add the vector corresponding to the position code to the word vector according to the above figure:

pe_tensor = []
for tensor in pos_enc:
  pe_tensor.append(torch.from_numpy(tensor))

man_index = sentence.index("man")
man_pe_tensor = word1_embedding + pe_tensor[man_index]

woman_index = sentence.index("woman")
woman_pe_tensor = word2_embedding + pe_tensor[woman_index]

dog_index = sentence.index("dog")
dog_pe_tensor = word6_embedding + pe_tensor[dog_index]

cat_index = sentence.index("cat")
cat_pe_tensor = word5_embedding + pe_tensor[cat_index]

The final vector obtained now not only contains the semantics of the word, but also contains the position information of the word in the sentence, so chatgpt has to integrate these two kinds of information to recognize the word, then what will happen after the word vector adds the position information As for the impact, a visible impact is that if the distance between two words is farther in position, then their semantic similarity will be reduced, that is, after the word vector is added to its corresponding position vector, the result is cosine calculation. The corresponding result should be reduced accordingly. Let's experiment to see:

man_woman_pe = torch.cosine_similarity(man_pe_tensor, woman_pe_tensor, dim = 0)
print(f'man woman sim with pe: {
      
      man_woman_pe}')  #0.9260298013687134

man_dog_pe = torch.cosine_similarity(man_pe_tensor, dog_pe_tensor, dim = 0)
print(f'man dog sim with pe: {
      
      man_dog_pe}')  #0.8304506540298462

woman_cat_pe = torch.cosine_similarity(woman_pe_tensor, cat_pe_tensor, dim = 0)
print(f'woman cat sim with pe: {
      
      woman_cat_pe}') #0.876427948474884

dog_cat_pe = torch.cosine_similarity(dog_pe_tensor, cat_pe_tensor, dim = 0)
print(f'dog cat sim with pe: {
      
      dog_cat_pe}') #0.900851309299469

The result of running the above code is as follows;

man woman sim with pe: 0.8062191669897198
man dog sim with pe: 0.8047418505220771
woman cat sim with pe: 0.7858729168616464
dog cat sim with pe: 0.7830654688550558

Through the results, we can see that after considering the distance factor, the semantic similarity between the corresponding two words does decrease accordingly. So far, we have completed the analysis of the input processing part of the chatgpt model. In the next section, we analyze the theoretical cornerstone of the large-scale model algorithm: multi-attention. It is precisely because of the introduction of this mechanism or algorithm that the language generation ability of chatgpt based on the transfomer architecture That's why it's so powerful.

Guess you like

Origin blog.csdn.net/tyler_download/article/details/130049392