Do you know what is the core technology behind ChatGPT?

Do you know what is the core technology behind ChatGPT?

Update: GPT-4 principle: https://yunyaniu.blog.csdn.net/article/details/129573291?spm=1001.2014.3001.5502

On the weekend, I sorted out the core technology of ChatGPT and the underlying principles; learning and learning.

some test experiments

Workplace PUA: How to evaluate OpenAI's super dialogue model ChatGPT? - Young's answer - Zhihu https://www.zhihu.com/question/570189639/answer/2788083617

Writing a novel: How to evaluate OpenAI's super dialogue model ChatGPT? - Tian Yuandong's answer - Zhihu https://www.zhihu.com/question/570189639/answer/2787584375

Virtual Machine: How to evaluate OpenAI's super dialogue model ChatGPT? - Malt's answer - Zhihu https://www.zhihu.com/question/570189639/answer/2788647814


◎Source|Intelligent Learning and Thinking

Enter a few simple keywords, and AI can help you generate a short story or even a professional paper. The recent fire of ChatGPT’s strong performance in tasks such as email writing, text translation, and code writing has made Elon Musk claim to feel the “danger” of AI. The calculation logic of ChatGPT comes from an algorithm called transformer, which is derived from a scientific research paper "Attention is all your need" in 2017. Originally, this paper focused on the field of natural language processing, but due to its excellent interpretation and computing performance, it has been widely used in various fields of AI and has become the most popular AI algorithm model in recent years, whether it is this paper or the transformer model. , are a microcosm of the development of AI technology today. Based on this premise, this paper analyzes the core points and main innovation intentions of this paper.

origin

From the proposal of Transformer to the birth of the "large-scale and training model" GPT (Generative Pre-Training), to the iterative sign of GPT2 that Open AI has become a for-profit company, and the "out of the circle" of GPT3 and ChatGPT; look at the industry, more An important field such as biomedicine and intelligent manufacturing has produced technologies based on transformers. Under this wave, my thinking is:

First, in the field of intelligence for a long time in the future, we will all experience rapid iterations of the cycle of "scientific research, computing power, infrastructure, engineering, data, and solutions"; liquidity and innovation will not stabilize in the short term. Instead, it will become stronger and stronger. It is difficult for us to wait until the technology is packaged and all this knowledge is shielded before polishing the product. **Winning the competition in the future will be the team that "solves the balance between productization, scientific research and engineering innovation". **Our general understanding of research and development is actually engineering, but the practical scientific nature of AI requires the team to better accept this "fluidity". Therefore, it has become a rigid need for all practitioners or small partners who are interested in intelligence to understand the full stack knowledge.

Second, through the discussion of this paper, we can understand more intuitively: what happened at the scientific research end, at what speed and rhythm; what are the milestones? It was the Messi of the scientific world who came out and led us to discover the truth; what are micro-innovations? Maybe the direction is clear, but there is still a lot of room to expand; which ones are more like alchemy? Still figuring it out, it will take a long time, or it will remain so.

Third, in the field of AI, due to technical reasons, more papers are open source codes. On the one hand, more people are encouraged to participate in improvement iterations; on the other hand, scientific research and engineering are seamlessly connected, and a paper can drive From the core code to the platform platform , to the value diffusion of a wide range of specific applications. **A thesis is likely to be a field, a track, and even directly drive a substantial increase in business value and customer value.

Fourth, there are many fields in the development of AI technology (perception, cognition, perception is divided into images, voice, text, etc., and cognition can also be divided into many levels). There are great differences in the algorithm logic of these fields before. The emergence of **transformer There are signs of promoting the convergence of various fields to a certain extent. ** A clear introduction to this paper may be useful for grasping the whole. In addition, ChatGPT is a phenomenon-level application, and everyone has a more intuitive experience. In the future, the experience improvement and update speed of this type of application will only be faster. Understanding the logic behind it will help us grasp this trend.

paper introduction

Let’s get down to the topic and start introducing this paper, which will involve some technical details and formulas. You may need to take a closer look (collect it first, and it’s better to set aside 15-20 minutes). I believe that once you read it, you will understand AI. Deepens a lot.

Overall grasp

The structure of this paper is very refined, asking questions, analyzing problems, solving problems, and giving test data. Top issue articles are concise and concise, with descriptions, codes, and results; the core of which is the following picture, where the author team proposes the core algorithm structure of Transformer:

insert image description here

The entire article is explained around this picture. Due to space limitations, we focus on one main line: 1. What is the main problem that the article wants to solve 2. How to solve it 3. The solution proposed from the article is used as a case To trigger overall thinking, so we simplify the content and focus on the core parts.

insert image description here

If you understand the content expressed in this picture, then basically you have mastered 85% of the content of this paper, which is also the most critical part.

"Attention is all your need" was written mainly to consider NLP tasks, and it was completed by several Google researchers. One of the backgrounds is that Google is also promoting its own parallel computing chips and AI TensorFlow development platform. The main feature of the platform is parallel computing, and the algorithm in this article is also maximizing the realization of parallel computing. Let's take a simple example to string this algorithm together.

core content

The requirement is that we need to train a model for Chinese to English translation.

Background knowledge: This requirement is to transpose "translation: I love you to I love you" into a y=f(x) problem, x represents Chinese, y is English, we need to obtain f() through training, once the training is successful f(), the translation can be realized. What everyone is fighting is whose training method is more accurate and efficient, and whose f() is better to use.

The main algorithm of natural language processing before is called RNN (Recurrent Neural Network), and its main implementation logic is to inherit the result to the second word after each "word" is calculated. The disadvantage of the algorithm is that it requires a lot of serial calculations and is inefficient. And when a relatively long sentence is encountered, the previous information is likely to be diluted, resulting in an inaccurate model, that is, the effect for long sentences will attenuate. This is the problem that this article is dedicated to solving, which means that this article has a better f() method for training. Imagine that ChatGPT can be used as a thesis and feel it.

In Transformer, the author proposes to calculate each word and all the words in the sentence, and calculate the correlation between the word and each word, so as to determine the more accurate meaning of the word in the sentence.

Here, to start to enter some technical details, before we start, we need to familiarize ourselves with one of the core concepts in the field of machine learning - "vector". In the digital age, the smallest unit of mathematical operations is often a natural number. But in the age of AI, this smallest unit becomes a vector. This is one of the most important differences between computing in the digital age and the age of intelligence.

For example, in a bank, to judge a person's credit limit, we use a vector to represent

insert image description here

A vector is a collection of data, which can also be imagined as a point in a very high-dimensional space. A specific credit limit vector is a point in a high-dimensional space composed of 8 features. Data in high-dimensional space will show more mathematical properties such as linear separability, which makes it easier for us to grasp more hidden laws.

The addition, subtraction, multiplication and division of vectors is the most important calculation logic for the computer to perform sample training.

**The main significance of the Transformer model is to find an algorithm, divide a word into a high-dimensional space gradually in three steps, and give the word better information than other algorithms in the process. **In many cases, this high-dimensional space has different meanings. Once the information given by this vector is more accurate and closer to the real situation, the subsequent machine learning work will be easy to carry out. Take the example of the credit limit vector just now

insert image description here

These two vectors exist in two different vector spaces. The main difference is that the former has one more vector feature: "annual salary". You can think about it if you judge a person's credit limit, is "annual salary" a very important factor?

The above example is still very simple, just adding an eigenvalue, which is much more complicated in the transformer. It is to comprehensively calculate multiple vector information through matrix addition, subtraction, multiplication and division, so as to give a vector a new meaning.

Ok, now that we understand the importance of vectors, let’s look back at the three steps of the transformer. These three steps are: 1. Embedding 2. Positional encoding 3. Self-Attention.

For example, translate the sentence Smart John is singing to Chinese.

First, vectorize each word of the sentence.

Let's look at the word "John" first. We need to convert the expression of the letter arrangement of "John" into a 512-dimensional vector John, so that the computer can start to recognize it. Explain that John is a point in this 512-dimensional space, this is the first step: encoding (Embedding).

Again, the second step: **Positional encoding, **use the following formula (this is the innovation of this paper)

insert image description here

Fine-tune a new high-dimensional space and generate a new vector.

insert image description here

We don't need to worry too much about this formula. Its core meaning is: 1. In this new vector, each bit is represented by the original 0 and 1, and replaced by sin and cos respectively. This purpose is to pass the law of sin and cos , let this new vector not only represent the meaning of the word John, but also represent the position information of John in the sentence Smart John is singing. If you don't understand, you can ignore it directly, just remember that the second step is to add John's position information in the sentence to the "vector expressing the word John". John is no longer an isolated word, but a word in a specific sentence, although the meaning of other words in the sentence is not yet known.

If the first step computer understands what John is, the second step computer understands "*John**".

Finally, the third step: self-attention mechanism (Self-Attention) , through an Attention (Q, K, V) algorithm, put John in a new spatial information again, we set

img

In this new vector, it not only contains the meaning of John, the position information of John in the sentence, but also contains the relationship and value information between John and the meaning of each monad in the sentence . We can understand that John as a word is a general term, but Smart John is much more specific, and singing Smart John is one step closer. Moreover, the Attention (Q, K, V) algorithm does not calculate around a word, but calculates the word and all the words in the sentence. Adjust the position of the word in space by calculation.

This method can play an advantage in a super long sentence, and the most important thing is to break through the barrier of time series in one fell swoop . The previous division of images and NLP algorithms is largely due to the obvious time series characteristics of NLP. That is, each word has a more obvious temporal relationship with the next and the next. But the Transformer algorithm breaks this constraint, and it pays more attention to the value weight of a word and each word in a sentence. This is the main reason why Transformer can be used everywhere.

img

calculation process

For the specific calculation process, use the translation sentence "I love you" to "I love you" as an example (this sentence is simpler). First, vectorize and absorb the sentence position information to obtain an initial vector group of a sentence.

img

(Because the length of each sentence in the sample is different, each sentence will be a 512*512 matrix. If the length is not enough, replace it with 0. In this way, no matter how long the sentence is, you can use a matrix of the same size during training. To represent. Of course, 512 is a super parameter, which can be adjusted before training.)

Next, the initial vector of each word is multiplied by three random initial matrices WQ, Wk, Wv respectively to obtain three quantities Qx, Kx, Vx. The figure below uses "I" as an example.

img

Then, calculate the attention value of each word. For example, the attention value of the word "I" is to multiply the QI of the word "I" by the K value of other words in the sentence. The mathematical meaning of multiplying two matrices is to measure two matrix similarity. Then through a SoftMax conversion (you don't have to worry about how to calculate it), calculate the weight of it and each word, and the weight ratio must be equal to 1 when added together. Each weight is then multiplied by the corresponding V value. All the products are added to get this Attention value.

img

This attention value is the correlation information of each word in the sentence, in addition to its own information and location information of the word "I".

You can find that in the calculation logic of all attention coefficients, only the initial matrix WQ, Wk, and Wv of each word are unknowns (these three matrices are shared by all words). Then we can simplify this transformer into an equation about the input, output and this W matrix: where X is the input text information and Y is the translation information.

img

Here it is necessary to introduce the basics of machine learning: the Transformer algorithm is essentially a feedforward neural network model, and its basic calculation logic, regardless of the complex hidden layer, is to assume Y=f(x)=wx, ( The goal is still to calculate a f()) and then randomly set a w0, start to calculate the cost function of y=w0x, then change w0 into w1, calculate the cost function of y=w1x, and so on to calculate countless w (not Countless, it will also converge), and then compare which w has the smallest cost function, which is the f() we trained. Then in the transformer, these three initial matrices are the w0.

Going back to the transformer, after calculating the Attention, each word is entered into a new high-dimensional space according to the semantic relationship. This is the Self-attention (self-attention mechanism).

But in the transformer, instead of substituting one space, multiple high-dimensional spaces are substituted, called the multi-head attention mechanism, (the article does not give a clearer theoretical support, why is it multi-head).

img

The main reason is that it works well when training . This is also a feature of AI scientific research papers. They often find some directions with very high scientific research literacy and sensitivity, and they are indeed effective through testing, but they may not be able to give perfect theoretical support. This often gives follow-up researchers some room for further improvement.

Facts have proved that how to improve the efficiency of Attention (Q, K, V) is the fastest part of the transformer field iteration. Afterwards, the Bert algorithm proposed a pre-training mechanism and became the mainstream, which will be further introduced later.

Of course, we can understand afterwards that the logical relationship in this sentence is put into different high-dimensional spaces for training. The purpose is to capture more information. This part can give researchers a deeper understanding of the application of space.

In addition to the above content, there are some technical points such as Mask mechanism, layer norm, neural network excitation function saturation region control, etc., which are not introduced one by one due to space constraints and technical details.

If you understand the multi-head self-attention mechanism, you have basically mastered 85% of the important content of this paper, and you have a more intuitive understanding of the transformer model that is still rapidly expanding its influence.

inspirational harvest

From the perspective of theoretical research progress

1. Transformer broke the logic of time series calculation and began to quickly go out of the circle. Multiple AI fields that were originally relatively independent began to integrate technically. Looking further in, **Transformer can break the timing. It is very important that the computing power mode of parallel computing brings the possibility of cost performance to more complex computing. The further improvement of computing power will surely bring integration in various subdivisions of AI, and more infrastructure-level models and algorithms will continue to be launched. **In the field of AI, image and NLP; the professional division of labor in the field of perception and cognition will gradually become blurred.

Second, AI research does have some experimental nature. In addition to the core idea, there are indeed many technical solutions that have been clarified, but there is still a lot of room for improvement. It is foreseeable that micro-innovations around the transformer will continue to accelerate and prosper.

3. "Attention is all your need" is well-known in the industry, but if you take a closer look, you will find that a lot of content is also borrowed. For example, the most important Attention (Q, K, V) in Query, Key, Value is an Internet recommendation system The standard methodology; the entire Transformer algorithm is also a large neural network, and the algorithm is iteratively developed step by step on the basis of the predecessors, but the iteration speed is obviously accelerating.

From the perspective of theory, algorithm, architecture and engineering

4. The field of AI algorithm research is experiencing a flywheel of growth in algorithms, open source code, engineering, and computing power.

img

The figure below shows the proportion of open source papers in academic papers on top journals. This data has grown at a faster rate in recent years. The scientific research process and the engineering process are increasingly intersecting. The open source community and open source culture itself is also driving the rapid development of algorithms and engineering.

img

More people participate, and people from more fields get involved. As the cost of computing power, AI infrastructure and code, and open source knowledge sharing gradually decrease, the boundaries between scientific research and engineering have also become blurred. This is like football. According to the law, in addition to the increase in the football population, the probability of the appearance of the talented player Messi will also increase.

From the perspective of data and subsequent development

5. The success of ChatGPT is due to a large amount of data training, but in addition to simple dialogue interaction or translation, large-scale answers and even paper-level answers still lack sample data (the sample data required for algorithm training requires clarity X and Y). Moreover, the Transformer algorithm requires a larger amount of data than other algorithms because it needs to randomly generate three matrices at the initial stage and optimize them step by step. In addition to Transformer, another technology, Bert, is also a phenomenon-level algorithm that is very important for technological development. Its core is a simplified Transformer. Bert does not translate from A to B. It randomly covers some words or sentences in X to allow the algorithm to optimize the prediction of the covered part. This line of thinking makes Bert the best partner for Transformer pre-training.

img

If pre-training is performed through Bert, it is equivalent to adding prior knowledge to the matrix (the previous training logic did not give any hints to the machine, and the basic knowledge of the latter rules), which improves the accuracy of the initial matrix during formal training and greatly improves the subsequent transformer. Computational efficiency and data volume requirements. In reality, for example, if I want to train the National Library of China book, I need the information of each book and the explanation of this book, or the English book corresponding to the Chinese book. But now we can just train a lot of content without labeling, and then only need to fine-tune the sample data through the transformer. This gives ChatGPT a lot of room for improvement, and it is foreseeable that more such large models will spring up quickly.

6. Since Transformer is a more advanced neural network deep learning algorithm, it has high requirements on the amount of data, which has also given birth to algorithms for how to quickly generate big data from small data, such as GAN against the network. This is the core technology in the field of AIGC. To solve the problem of insufficient data volume, in addition to more efficient abstraction of small data information, there are also more methods to supplement small data into big data, and these methods are rapidly maturing.

7. We found that there are a large number of hyperparameters in the machine learning algorithm. For example, in the transformer, the multi-head mechanism needs several N heads, whether the text becomes a vector is 512 or more, and the learning rate needs to be set in advance before training. Due to the long training time and complex parameters, it takes a very long time to explore better calculation effects. This gave birth to AutoML. Taking Transformer as an example, many routes are required for automatic machine learning; such as Bayesian calculation (to find the probability of better parameter configuration); reinforcement learning ideas (greedy algorithm quickly approaches the optimal in an uncertain environment) ); In addition, there are ways to find a new training network (transformer, RNN, MLP, etc. jointly use permutations and combinations), etc.

The development of scientific research emphasizes parameterization, while the development of industry emphasizes automation. The two seem to be unified, but they are often quite painful and contradictory in the actual operation process. This is also an important area where the balance between productization and scientific research mobility mentioned at the beginning.

Guess you like

Origin blog.csdn.net/sinat_36458870/article/details/129659344