Overview of Artificial Intelligence Principles - The Story Behind ChatGPT

Hello everyone, I am Bittao. If there is any hottest thing in 2023, it is undoubtedly the AI ​​wave led by ChatGPT. This year, whether it is the various media on weekdays, the projects you come into contact with at work, or the hot topics that everyone discusses in your life, AI is inseparable. In fact, for the Internet industry, it has been very popular since the advent of deep learning. However, since the most widely used AI in terms of monetization ability is the recommendation algorithm, the general public is also a bit boring to the word AI. However, ChatGPT was born in November 2022, and it quickly broke through the circle in just two months, with monthly active users reaching 100 million, becoming the world's top product. Some people say this is the singularity of AI technology, and AI will soon be able to replace more jobs; others say that it will always talk cliche, just a smarter chat robot. In any case, it is undeniable that artificial intelligence is the beginning of the next technological revolution. AI will not eliminate humans, AI will only eliminate humans who cannot use AI.

1. History of artificial intelligence

Although AI has not been in the public eye for a long time, related theories have already taken shape in the last century.

  • In 1940, Cybernetics described the interdisciplinary study of exploring regulatory systems, which is used to study the structure, limitations and development of control systems. It is the scientific study of how people, animals and machines control and communicate with each other.
  • In 1943, American neuroscientists McCulloch and Pitts proposed neural networks and created a model called MP model.
  • In 1950, with the development of computer science, neuroscience, and mathematics, Turing published a paper that crossed the era and proposed a very philosophical test, The Imitation Gamealso known as the Turing test. The general idea is: in the process of chatting between a human and a machine, if the other party cannot be found to be a machine, it is called passing the Turing test.
  • In 1956, Marvin Sky, John McCarthy, and Claude Shannon (the founder of information theory) held a conference: the Dartmouth Conference. The main issue is whether people can think like people, and the word AI appeared.
  • In 1966, MIT's chat robot Eliza, the previous system was based on PatternMatching pattern matching, based on rules.
  • In 1997, IBM Deep Blue defeated the chess champion. Sinton of the University of Toronto introduced the backpropagation algorithm BP into artificial intelligence; Yang Likun of New York University, whose famous contribution is the convolutional neural network CNN; Bengio of the University of Montreal (neural probability language model, generation of confrontational networks).
  • In 2010, a field of machine learning, Artificial Neural Networksthe artificial neural network, began to shine.
    insert image description here

2. Machine Learning

The common task of machine learning is to automatically discover the laws behind the data through training algorithms, continuously improve the model, and then make predictions. There are many algorithms in machine learning, among which the most classic algorithm is: 梯度下降算法. It can help us deal with classification and regression problems. Through y=wx+blinear fitting of this formula, the result is close to the correct value.

2.1 Prediction function

Suppose we have a set of causal sample points, each representing a set of causal variables. For example, the price and area of ​​the house, the height and pace of a person, and so on. Common sense tells us that their distributions are directly proportional. First, the gradient descent algorithm determines a small target-prediction function, which is a straight line through the origin y = wx. Our task is to design an algorithm so that this machine can fit these data and help us calculate the parameter w of the straight line.
insert image description here
A simple way is to randomly select a straight line passing through the origin, and then calculate all sample points and its deviation. Then adjust the slope of the line w according to the size of the error
. By adjusting the parameters, the smaller the loss function is, the more accurate the prediction is. In this case y = wxis the so-called prediction function.
insert image description here

2.2 Cost function

The process of finding the error is to calculate the cost function. By quantifying the degree of deviation of the data, that is, the error, the most common is the mean square error (the average of the sum of squares of the error). For example, the error value is e, because the coefficient of finding the error is the formula of the sum of squares, so the function image of e is shown on the right side of the figure below. We will find that when the function of e is at the lowest point, the error in the left figure will be smaller, that is, the fitting will be more accurate.
insert image description here

2.3 Gradient calculation

The goal of machine learning is to fit the straight line closest to the training data distribution, that is, to find the parameters that minimize the error cost, which corresponds to the lowest point on the cost function. This process of finding the lowest point is called ** 梯度下降**.
insert image description here
Using the gradient descent algorithm to train this parameter is very similar to the human learning and cognitive process. Piaget's theory of cognitive development, the so-called assimilation and adaptation, is exactly the same as the process of machine learning.

3. Deep Learning

There was a lot of controversy in the early days about whether the AI ​​​​algorithm should be implemented using a human-like brain-like operation method. And before deep learning came out, most computer scientists devoted themselves to the research direction similar to pattern matching. Now it seems that that method is of course very difficult to make machines as intelligent as humans. But we can't look at the people at that time from the current perspective. At that time, data and computing power were scarce, so naturally there was a set of theories to refute the idea of ​​​​using a human-like brain to achieve it.
How can a computer work on the same principle as a human brain? We still have to use traditional algorithms to solve the problem. This also indirectly led to the stagnation of AI at that time. For the Ph.D.s who studied this direction, the reality is cruel. That's why there is that saying: Human effort is important, but it also depends on the direction .
In 1943, neuroscientists explored the operating principle of the human brain. In the human brain, more than 10 billion neurons are connected through a network to judge and transmit information.
insert image description here
Each neuron has multiple inputs and single output. The signal can be obtained through multiple neurons, and the signal can be processed comprehensively, and the signal can be output downstream if necessary. This output has only two signals, either 0 or 1, very similar to a computer. So they proposed a model called the MP model.
Artificial neural network is an algorithmic mathematical model that imitates the behavior characteristics of animal neural networks and performs distributed parallel information processing. Deep learning is an algorithm that uses artificial neural networks as the framework to perform representation learning on data.
insert image description here

3.1 Neural network

As shown in the figure below, a circle is a neuron, and these circles form a neural network. Give the neural network enough data, tell the neural network whether it is doing well or not, and keep training the neural network, it can do better and better, and complete complex tasks such as recognizing images.
insert image description here
In fact, the calculation of neurons is a bunch of additions and multiplications, but because there are enough of them, it becomes very complicated. A neuron may have multiple inputs and only one output, but can activate multiple neurons. For example, the figure below is one of the Sigmoid activation functions, and its value range can be found to be (0, 1).
insert image description here
If it is only to judge whether it is X, then one layer is enough, but in practice, we need to understand other people's voice and image recognition. So people study multiple layers of neurons. One input as shown in the figure, and then the input terminal is connected to each neuron of the first hidden layer. After the first hidden layer outputs the data, it chooses to output to the second hidden layer, and the second hidden layer The output of the layer enters the third hidden layer. This is called a multilayer neural network. There are a large number of parameters between each two layers, and we adjust a large number of parameters to the optimum so that the final error function is minimized.
insert image description here
Although the operations performed by neurons are not complicated, once the scene is complex, the order of magnitude will be very large. For example, a 5*5 picture has 25 neurons in each layer, 625 parameters in each layer, and more than 2,000 in three layers. If it is a color picture, it is more complicated to recognize, and it is very slow to calculate. This is also the reason why artificial intelligence has been underestimated in the past few times, neither the computing power nor the algorithm can keep up. Later, the BP algorithm and backpropagation appeared, and the last layer can be adjusted first. After the last layer is adjusted, it is adjusted forward. The complexity of this algorithm is lower than that of the previous one. The BP algorithm mainly solves the error loss and error calculation in the process of information transmission between multiple layers of the neural network, leading the third wave of artificial intelligence.

3.2 CNN

Here we still take a more classic algorithm in the neural network algorithm: CNN convolutional neural network as an example. The process is similar to the brain recognition of animals. When an image is reflected in the brain, it goes from point to line to object and finally recognizes what it is. The same is true for computers, which realize image recognition through pixel points-edge direction-contour-details-judgment.
insert image description here
For example, if we want to identify whether a picture is Xsuch a character, that picture is a two-dimensional array for the computer, for example, black is 1 and white is 0. As shown in the figure below:
insert image description here
After being given to the computer, a series of training processes can be used to find a large number of parameters to judge whether it is an X. Find a function with the smallest loss, that is a successful training. From then on, I can use this bunch of parameters to judge whether a picture is X or not.
Specifically, we can use the convolution kernel to perform convolution operations by extracting the features in the image. For example, the convolution kernel is a slanted vertical line (we think this is one of the characteristics of the X image).
insert image description here
The convolution (a slanted vertical line) kernel is applied to the picture, and the operation is performed, and the result of the operation is placed in the middle of the picture coverage. Then the combination is the feature map. The larger the calculated feature, the more it can express this feature.
insert image description here
Because the amount of calculation is too large, we use the convolution kernel to scan one area by one area. Multiply each corresponding number and add the sum. The regional numerical features are extracted. The data is then pooled, the maximum value in the area is taken, and the amount of feature data is concentrated and flattened. Input the full neural network, because it involves convolution operations, it is also called convolutional neural network. The size, pace, and number of convolutional layers of the convolution kernel can all be adjusted in advance. The value output by the machine is compared with the preset value for the target result. If it meets expectations, it is a success. If it does not meet expectations, a series of calculations will be performed to reversely adjust the parameters (BP) of each link, calculate again, and repeat until it meets expectations. This is the principle of machine learning. Convolution -> Pooling -> Activation.
insert image description here
Through the feature data after convolution, we can see that the closer the number is to 1, the more it satisfies the characteristics of the convolution kernel.
insert image description here
The convolution kernel may be artificially set at the beginning, but later it will reversely adjust the convolution kernel according to its own data. Similar to the training method, to adjust the parameters, the most suitable convolution kernel will be found during the training process. There are several convolution kernels, and there are several feature maps (three-dimensional). When these feature maps are moved together, it becomes a three-dimensional figure.
insert image description here
The scientist's design is amazing, almost perfectly simulating the human thought process.
We give a lot of data to artificial intelligence, and then artificial intelligence adjusts its convolution kernel and parameters through a method, and finally can distinguish what each different object is. Although we don't know how it designs the convolution kernel and these parameters.

3.3 Model = black box

We now know that through continuous training of the neural network, we can make the recognition error smaller. In order to achieve an intelligent model, it is used to do some practical work. Although the model is trained by us, in fact, every time the model is specifically recognized. We don't know how it works, it's still a black box to us. Just like Newton did not explain why the apple fell to the ground, he established a mathematical model of gravity, but expressed it quantitatively with methods. As for the reason, it is still difficult to express it in human words. The same is true for models trained by artificial intelligence. The features we see are actually different from the features used by machines, whether it is the number of features or the content of features. We think an object may be judged by 4 features, but the computer may use 10. The same is true for the content, the content of our human brain and the 0 and 1 of the computer are also difficult to be equivalent. You must know that the neural network is self-adjusting and self-optimizing training, so it is difficult for you to tell how it did it at the end of the training. Just like we teach a child to recognize the difference between a cat and a dog, if you show him a lot of cats and dogs, the child will finally recognize the difference. But you can know how children are specifically identified, it is actually difficult to explain. This is why everyone says that the model trained by AI is a black box.

3.4 Graphics card = computing power

As mentioned above, although the research on neural networks had a certain foundation in the 1960s. But the reason why it has not been developed is because of the lack of two things: computing power and data. Although each neuron in the neural network does not need to be calculated very finely, it requires a large number of simultaneous calculations. Make bricks without straw. Calculations are not complicated, all are addition and multiplication, but the amount of calculation is particularly complicated. For example, a picture 800 600 3 (pixels) = 144000 pixels. If a three-layer convolution kernel (because RGB is 3) is used for convolution, about 13 million multiplications + 12 million additions are required. This was incompetent for the CPU at the time, and even the current CPU cannot do it. This requires the GPU to show its skills. We know that the GPU is used for graphics calculations. For example, to play a 4k video, the minimum is 10 million pixels, assuming 30 frames per second. The CPU supports 64 cores and 128 cores, and the GPU can have tens of thousands of cores. Although the calculation of a pixel is very simple, it is still suitable for a device with a large number of concurrent operations such as GPU. The picture below is a very vivid example. The CPU is like a high-precision spray gun, pointing to where to shoot:
insert image description here
Due to the high concurrency of the GPU, it can render the entire graphics in an instant:
insert image description here
this is why we often hear that it is necessary to buy a graphics card for AI, because We need a lot of such concurrent operations (including mining) during the training process.
At present, AI training is basically monopolized by Nvidia graphics cards, which is because Lao Huang's layout is very early. As early as 2006, Nvidia launched CUDA, which successfully made GPU programmable. In this way, in the past, a graphics card specially designed for 3D processing graphics would have required a large number of top engineers in order to use it for computing programming, but now it can be done only based on the CUDA library. Nvidia has expanded the boundaries of its graphics cards from games and 3D image processing to the entire field of accelerated computing. Such as aerospace, biopharmaceuticals, weather forecasting, energy exploration and so on. When deep learning is very mature in 12 years, it is natural to use this platform of Nvidia. As a result, when it comes to AI training, it is equivalent to buying a graphics card, and buying a graphics card is Nvidia.

4. Principle of ChatGPT

Presumably everyone has used ChatGPT directly or indirectly. It is completely different from the Siri and Xiaoai students we usually use. When chatting with the former, we will use it as an artificial mental retardation, but in the process of talking with ChatGPT, we can really solve some practical problems. For example, let it analyze key technical points in unknown fields, write algorithm questions to find bugs, and so on. So why does ChatGPT become so smart, and what technology is used behind it, let's explore together below.

4.1 LLM

A language model is a natural language processing technique based on statistical and machine learning methods that is used to evaluate and predict the probability distribution of a given sequence, usually a sequence of words or characters. The main applications of language models are tasks such as text generation, machine translation, and speech recognition. In recent years, the language model parameters of the neural network architecture have reached hundreds of billions. In order to show the difference from the traditional language model, people are used to calling it the large language model (LLM).

In machine learning, Recurrent Neural Network (RNN) is generally used to process text. It needs to be read word by word, and there is no way to process a large number of them at the same time. And the sentences should not be too long, or else they will be forgotten after learning.
Until 2017, Google published a paper proposing a new learning framework called: Transformer. He can let the machine learn a large amount of words at the same time, just like the difference between series and parallel. Many NLP models are now based on Transformer. The T in Google BERT and the T in ChatGPT both refer to this Transformer.
insert image description here
Based on Transformer, the GPT team published a paper in 2018 introducing a new language model, Generative Pre-trained Transformer, or GPT. Large language models (LLMs) generate human-like text by predicting the likelihood of words based on words previously used in the text.
Previous language learning models basically required human supervision or artificially set some labels for him. But GPT is basically not needed much, just put a bunch of data in it, and you can learn it after a while. Such a large language model mainly depends on the algorithm and the amount of parameters. The same data can be learned faster than anyone else, and the amount of parameters requires a lot of calculations. To put it bluntly, it is a waste of money. After GPT3, artificial feedback reinforcement learning is added, and each of his words is calculated based on the relevance and context of the previous text.

4.2 Generation process

We know that the core of ChatGPT is the LLM Large Language Model large language model. The Oracle model is a neural network-based model that is trained on large amounts of text data to understand and generate human language. The model uses training data to learn statistical patterns and relationships between words in a language, and then uses this knowledge to predict subsequent words, one word at a time. The GPT 3.5 largest model has 175 billion parameters spread across 96 layers of neural networks, making it one of the largest deep learning models ever built.
insert image description here
The input and output of the model in ChatGPT are organized by Token, which is the digital representation of words. More precisely, part of a word. In fact, it is based on the context of each word in the sentence to judge what the next word is more suitable for output.
insert image description here
Use numbers instead of words to represent tokens, as numbers can be handled more efficiently. GPT-3.5 is trained based on a large amount of Internet data, and the original data set contains 500 billion Tokens. That is, the model was trained on hundreds of billions of words.
insert image description here
The model is trained to predict the next token given a sequence of input tokens. It is able to generate structured text that is syntactically correct and semantically similar to the internet data it was trained on.
insert image description here

4.3 Training process

Although after the above process, ChatGPT can already organize sentence answers autonomously. But without proper guidance, the model can also generate unreal or negative output.
insert image description here
To make the model more secure and able to ask and answer in a chatbot fashion. After further fine-tuning, this model became the version currently used in ChatGPT. Fine-tuning is to transform a model that does not conform to human values ​​into a controllable ChatGPT. This process of fine-tuning the model is called reinforcement training with human feedback (RLHF).
insert image description here
OpenAI explained how they run RLHF on their model, fine-tuning GPT 3.5 with RLHF is like improving the skills of a chef to make their dishes more delicious.
Initially, chefs were trained on a large dataset of recipes and cooking techniques. However, sometimes the chef doesn't know to make that dish according to the customer's custom request. To help solve this problem, we collect real user feedback to create a new dataset. The first step is to create a comparison dataset, where we ask chefs to prepare multiple dishes based on given requirements, and then ask people to rank the dishes based on taste and appearance. This helps chefs understand what dishes customers like.
The next step is reward modeling, where chefs use this feedback to create reward models that act like guides for understanding customer preferences. The higher the reward, the better the dishes. Next, we train the model using PPO (i.e. Proximity Policy Optimization), in this analogy, the chef practices making dishes while following the reward model. They use a technique called "proximal strategy optimization" to improve their skills. It's like a chef comparing their current dish to a slightly different version and learning which one is better based on a reward model.
This process is repeated several times, with the chefs refining their skills based on the latest client feedback. With each iteration, the chef gets better at preparing dishes that meet customer preferences. Viewed from another perspective, GPT-3.5 fine-tunes RLHF by collecting people's feedback, creating a reward model based on their preferences, and then using PPO to iteratively improve model performance. This enables GPT-3.5 to generate better responses to specific user requests.

4.4 Prompt

After the GPT trainer teaches it, we can use ChatGPT. However, because the GPT parameters based on the large language model are too complicated, it is also very important to accurately express our needs. In other words, if you want to have a better dialogue with AI, you need Prompt "language". Now there are many tutorials on the Internet to teach you how to use Prompt to communicate with AI more efficiently.
insert image description here
The following figure is the specific logic of Prompt. In fact, the more accurate the description, the more accurate ChatGPT will give you.
insert image description here
Conceptually, Prompt is as simple as feeding an input to a ChatGPT model and returning an output. In fact, the situation is more complicated. First ChatGPT understands the chat dialogue context, which is done by the ChatGPT UI feeding the entire dialogue to the model each time a new prompt is entered.
insert image description here
This is called session prompt injection, and this is how ChatGPT is context-aware.

Second, ChatGPT includes implicit prompt content, which are instructions injected before and after user prompts to guide the model to use conversational tone. These prompts are invisible to the user. For example, it will analyze the tone and language of your input in advance.
insert image description here
Third, Prompts are passed to the Moderation API to warn or block certain types of insecure content. Prompts are passed to the Moderation API to warn or block certain types of insecure content. Note: If your Prompt is powerful enough, you can actually make it output some special content.
insert image description here
The generated results may also be passed to a moderation API before being returned to the user.
insert image description here
Creating the models used by ChatGPT required a lot of engineering and the technology behind it is constantly evolving, opening the door to new possibilities and reshaping the way we communicate. . ChatGPT is revolutionizing the way software developers work, showing how it can enhance our daily tasks and increase efficiency. In order not to fall behind, we should understand how to utilize the power of ChatGPT and stay ahead in this fast-moving world of software development.

V. Summary

There have been several industrial revolutions in history, and each industrial revolution is based on scientific breakthroughs and the development of root technologies. For example, the first industrial revolution, Newtonian classical mechanics and thermodynamics in the 18th century had a breakthrough. Watt improved the steam engine and led mankind into the steam age. He made Britain an empire on which the sun never sets. At the end of the 19th century and the beginning of the 20th century, Faraday discovered the phenomenon of electromagnetic induction, and Maxwell explained the principle of electromagnetic waves. Humans invented generators, electric motors and radio communications. This is the second industrial revolution, which made the United States the number one power in the world. In the middle of the 20th century, due to the development of electronic technology and computer technology, mankind quickly entered the electronic age, which is the third industrial revolution. Seizing this opportunity, Japan quickly emerged from the shadow of war and became one of the most developed countries in the world. China failed to catch up with the first three industrial revolutions, and now the world is in the middle of the fourth industrial revolution represented by wireless Internet, artificial intelligence, new energy and biotechnology. This time the Chinese are not absent, whether it is 5G or artificial intelligence, or new energy or biotechnology. It took Chinese scientists and engineers more than 20 years to catch up, and they are at the forefront of the world in many new sciences and technologies.

Guess you like

Origin blog.csdn.net/u012558210/article/details/132244109