The large model LLM explains things in simple terms and is easy to understand.

        AI (artificial intelligence) is a technological capability that uses machines to simulate human cognitive abilities. The core ability of AI is to make judgments or predictions based on given inputs. Analyze the data to summarize the internal laws of the research object. Generally, through the use of appropriate statistics, machine learning, deep learning and other methods, the large amounts of data collected are calculated, analyzed, summarized and organized in order to maximize the value of the data and play the role of the data.

 Currently, AI is divided into two major modules, computer vision and natural language processing.

   1. Computer vision application scenarios: face recognition, autonomous driving, vehicle recognition, medical imaging, industrial robot image classification, picture augmented reality, etc.

   2. Natural language processing application scenarios: intelligent customer service, speech recognition, machine translation, natural language generation, intelligent assistant, information extraction, etc.

    NLP (Natural Language Processing) is the study of how to make computers understand human language, that is, convert human natural language into instructions that computers can read. NLP is a branch of the fields of artificial intelligence and linguistics .

    LLM is an important component of NLP. It is mainly used to predict the probability distribution of the next word or character in natural language text. It can be regarded as a kind of learning and abstraction of language rules.

    This article mainly describes LLM (Large Language Model) .

    LLM is an artificial intelligence model, mainly designed to understand and generate human language. It is trained on a large amount of text data and can perform a large number of tasks, including text summary, machine translation, sentiment analysis, etc., among which the most common The applications are intelligent customer service, speech recognition, machine translation, natural language generation, etc.

    LLM is characterized by its large scale and contains hundreds or hundreds of billions of parameters. The model can capture the complex patterns of language, including syntax, semantics and some contextual information, thereby generating coherent and meaningful text.

    At present, there are many mature large models at home and abroad, as follows:

Among them, ChatGLM is a bilingual conversational robot developed by Zhipu AI, a company that transforms Tsinghua University’s technological achievements. According to the training parameters, they are classified into several large models such as ChatGLM-130B, ChatGLM-6B, ChatGLM2-6B (parameter unit 1B = 1 billion).

      In Stanford University's 2022 comprehensive evaluation of the world's 30 mainstream large models, ChatGLM-130B is the only large model selected in Asia. Its accuracy and maliciousness indicators are close to or equal to GPT-3-175B.

   Among the ChatGLM products, ChatGLM-6B is an open source conversational language model that supports Chinese and English bilinguals. It is based on the GLM (General Language Model) architecture and has 6.2 billion parameters. ChatGLM-6B uses technology similar to ChatGPT and is optimized for Chinese question and answer and dialogue. After bilingual training in Chinese and English with about 1T identifiers, supplemented by supervised fine-tuning, feedback self-service, human feedback reinforcement learning and other technologies, ChatGLM-6B with 6.2 billion parameters has been able to generate answers that are quite consistent with human preferences. And ChatGLM-6B can be deployed locally on consumer-grade graphics cards (a minimum of 6GB of video memory is required at the INT4 quantization level). Even if the quantization parameter is FP16 precision, inference only requires 13GB of video memory. This time an NVIDIA GeForce RTX 3090 GPU is used. , the video memory size is 24G.

       At present, mainstream large models are pre-trained based on LLaMA and chatglm, so several pre-training architectures have been born, as follows:

The autoencoding autoencoding model (AE model), autoregressive autoregressive model (AR model), and encoder-decoder (Seq2seq model) in the figure have their own advantages and disadvantages, and none of them has the advantages of "natural language understanding (NLU)". ), unconditional generation and conditional generation" perform best in these three areas. T5 once tried to use MTL to unify the above frameworks. However, the goals of "autoencoding" and "autoregression" are naturally different, and simple fusion cannot inherit the advantages of each framework.

     In the stalemate situation where the world was divided into three parts, GLM was born. The GLM model is based on the autoregressive blank infilling method and combines the ideas of the above three pre-training models. The main technologies used by GLM are: bidirectional attention and autoregressive blank filling targets. Embedding gradient shrinkage strategies can significantly improve training stability . (For details on the model structure, see the paper: Glm: General language model pretraining with autoregressive blank infilling

     Then I will explain it in detail based on this "GLM mask principle" diagram.

1. Simple annotation of the schematic: how the prompt input to the model is masked, and the one-way and two-way attention mechanisms are implemented at the same time, so as to fully understand the training objectives and GLM structure.       

2. Assumption explanation: Assume that there is a piece of original data, which is parsed into 6 spans after text segmentation, namely x1-x6. At this time, two spans are randomly covered, namely x3, x5 and x6, and marked with mask. Assume that the divided text x = [x1, x2..xn], and the multiple sampling spans are marked {s1, s2 ...sm}, where the selected span is marked with [mask], which forms the marked text Xcorrupt. Now use Zm to represent all possibilities of a label sequence of length m, then the pre-training goal can be expressed as:

                        

 3. Detailed explanation: It can be seen from (a) and (b) in the figure that the original data is x1-x6. After sampling x3, x5 and x6 respectively, it becomes two parts, Part A and Part B. Part A is the damaged text, the sampled data is marked with [mask], and Part B is composed of sampled data. At this time, Part A and Part B will be spliced ​​as input to the model.

     Figure (c) shows that there are more [E] and [S], which means that each span segment is filled with [S] at the beginning as input and [E] is filled at the end as output. Moreover, we can see that x5, x6 and x3 have swapped positions. We can know that the sampled fragments are in random order, which can ensure that the model can fully learn the dependencies between the fragments. At the same time, you can see Position 1 and Position 2, where Position 1: represents the position of each Token in the original text. You can see that the span position of Part B in Part A represents the position corresponding to [M]. The coding is the same, but the relative position of Position 2 in the span to be filled. The Token in Part A is encoded with 0. You can see that [s] x5 x6 is a fragment to be filled in, so it is encoded as 1, 2, 3.

    Figure (d) shows that there is both bidirectional attention and unidirectional attention. The clever thing is: the tokens in Part A are visible to each other, but any tokens in B are not visible. Part B tokens are visible to Part A, Part B tokens are visible to the past tokens in B, and future tokens (blue, yellow) in B are not visible. , the data range circled in green).

    To summarize and understand: Part A is equivalent to an auto-encoding MLM language model, which naturally uses bidirectional contextual information, while each Token in Part B needs to be predicted in the form of autoregression from left to right, so naturally it can only be seen One-way information.  The model can automatically learn two-way encoder (Part A) and one-way decoder (Part B).  In fact, the main purpose of GLM is to use unoccluded data to predict obscured information from regression.

    If you want to build a large model in a vertical field , you need to do pre-training based on the knowledge of the respective industry. The current large models of vertical fields in various industries are as follows:

Let’s give a brief introduction to the training method in the above figure. There are currently two ways to pre-train large private vertical domain models. The first is fine-tuning the model (LoRA, p-tuning v2), and the second is based on the knowledge base principle of the LangChain framework. .

      1. The LoRA (Low-Rank Adaptation of Large Language Models) method is a new technology introduced by Microsoft researchers. It is mainly used to deal with the problem of fine-tuning large models. When using large models to adapt to downstream tasks, only a small amount of training is required. parameters can achieve good results. Core principle:  By freezing the pre-training weights in the large language model, and adding a trainable low-rank decomposition matrix to each layer of the Transformer architecture, and operating through the low-rank matrix, the amount of network operation parameters can be greatly reduced.

In the picture above ( paper address: https://arxiv.org/abs/2106.09685 ), the blue on the left represents the pre-trained model parameters, and the orange on the right initializes the A and B models respectively. As we all know, if you want to directly train the blue original model, it will be very resource-intensive, requiring 8 A100 graphics cards (the current market price of A100 is 130,000 each), and a lot of power resources will be consumed during training (OpenAI’s chatGPT-3 trains once The cost is approximately US$1.4 million), and in order to save resources, the LoRA idea is to initialize the two orange models on the right to Gaussian distribution and 0 respectively, and fix the parameters of the pre-trained language model (blue part) during training. Only the dimensionality reduction matrix A and the dimensionality enhancement matrix B are trained, while the input and output dimensions of the model remain unchanged. When outputting, BA and the parameters of the pre-trained language model are superimposed. Initialize A with a random Gaussian distribution and initialize B with a 0 matrix. This ensures that the newly added channel BA=0 at the beginning of training, thus having no impact on the model results. During inference, the idea of ​​reparametrization can be used to combine AB with W merge so that no additional computation is introduced during inference

     Assuming that fine-tuning is performed on the original full parameter model, an increment of W=W0+ΔW needs to be added. Referring to this formula, Lora freezes the original parameter W0 and further reduces the parameter magnitude ΔW=B*A by using the incremental part through low-rank decomposition. The dimension of the original parameter is d*d (blue part), then The parameter magnitude after low-rank decomposition is 2*r*d. Because r here is much smaller than d, it has the effect of greatly reducing the magnitude of fine-tuning parameters, that is, the formula changes to W = W0 + B*A.

    In addition, if you are interested, you can learn about prompt-tuning (prompt tuning), p-tuning v2 (Prefix-tuning) and other algorithms. I will summarize them when you have time.

     2. LangChain Knowledge Base

     The significance of the knowledge base: to establish a knowledge base question and answer solution that is friendly to Chinese scenarios and open source models and can be run offline.

To understand the knowledge base, take the following three steps based on the picture above:

  1. Divide a piece of text into multiple paragraphs (split), Embedding (vectorize) the split document, and save it to the vector library (Figure Vector Store logo).

  2. Obtain the user's question, first embed the question, generate a Query Vector, and then go to the Vector Store to match the closest TOP K. Each piece of content has a score, and data with low scores can be filtered based on the threshold (similar to Elasticsearch technology).

  3. Organize the obtained K pieces of content into prompts, summarize user questions, adjust the LLM interface, and generate answers.

Take an example to understand:

There is a text with a paragraph in it

    "Xiao Shi likes to play mobile phones, play games, and watch movies. Amin likes to complain about Xiao Shi."

Ask two sentences separated by commas:

    (1) "Xiao Shi likes to play mobile phones, play games, and watch movies."

    (2) "Amin likes to complain about Xiaoshi."

You can ask this question:

     "What does Xiao Shi like to do?"

Using this question to query a text vector library will return the following based on similarity:

     "Xiao Shi likes to play mobile phones, play games, and watch movies."

Because both this sentence and your question contain "xiaoshi" and "like" 

Then combine this paragraph and your question into prompts:

"""Known information:

Xiao Shi likes to play mobile phones, play games and watch movies.

Answer user questions concisely and professionally based on the above known information. If you cannot get an answer from it, please say "the question cannot be answered based on the known information" or "insufficient relevant information is not provided". No fabrication is allowed in the answer. Please use Chinese for the answer. The question is: What does Xiao Shi like to do?

"""

Throw these prompts to LLM, and the model will reorganize the language and give the answer!

Guess you like

Origin blog.csdn.net/SmallTenMr/article/details/133066350