What exactly is ChatGPT doing? Why can it do this? (1)

ChatGPT can automatically generate something that looks like a human-written text, which is a very powerful and unexpected thing. So how does it do it? And why can it be done? I would like to briefly introduce the internal mechanism of ChatGPT here, and then explore why it can generate text that we think is meaningful. I'll focus on the big picture now, and while I'll mention some engineering details, I won't go into depth. (The substance of what I will say below is also applicable to other current "large language model" LLMs like ChatGPT).

The first thing to explain is that, fundamentally, ChatGPT tries to make a "reasonable continuation" of any text it currently gets, and by "reasonable" I mean "after seeing what people have written on billions of web pages , what one might expect someone to write".

So, suppose we've got the text "The best thing about AI is its ability to" ("The best thing about AI is its ability to"). Imagine, (on the web and in digitized books, for example), scanning billions of pages of human-written text and finding all instances of that text -- and seeing what proportion of what words occur over the next period of time. ChatGPT is efficiently doing something similar, except (as I'll explain) it doesn't look at literal text; it looks for content that "meaning matches" in some sense. But the end result is that it produces a sorted list of possible successor words, along with "probabilities".

Note that when ChatGPT is at work, say writing an essay, all it does is basically repeatedly asking "given the text so far, what should the next word be?" -- and incrementing each time one word. (More precisely, as I also explain below, it's adding a "marker", which may just be part of a word, which is why it can sometimes "make up new words").

However, at each step, it gets a list of words with probabilities. But which one should it choose to add to the article (or whatever) it's writing? One might think that it should be the "top" word (ie, the word assigned the highest "probability"). However, at this time, a little magic will quietly unleash. Because for some reason—maybe one day we'll have a more scientific understanding—if we always pick the highest-ranked word, there's a good chance we'll end up with a very "flat" article that doesn't "show without any creativity" (even sometimes verbatim). However, if sometimes (happens randomly) we pick a word that ranks lower, we get a "more interesting" article.

This randomness means that if we use the same prompt multiple times, we are likely to get a different composition each time. And, in keeping with the results of this magic, there is also a specific so-called "temperature" parameter that determines with what frequency lower-ranked words are used. It turns out that a "temperature" of 0.8 seems to be the optimal value when generating papers. (To emphasize, no "theory" is used here; it's just a problem that has been found to work in practice). For example, the concept of "temperature" exists because it happens to use the familiar exponential distribution from statistical physics, but has no "physical" connection - at least so far we think so. )

Before going any further, I should explain that for purposes of discussion, I generally do not use the full system in ChatGPT; instead, I generally use the simpler GPT-2 system (https://resources.wolframcloud.com /NeuralNetRepository/resources/GPT2-Transformer-Trained-on-WebText-Data/), which has the nice feature of being small enough to run on a standard desktop computer. Thus, everything I've analyzed so far can be presented in explicit Wolfram Language code that you can run on your computer right away.

For example, the following shows how to obtain the above probability table. First, we must retrieve the underlying "language model" neural net:

Later, we'll take a look at the internals of this neural net and talk about how it works. But now we can apply this "network model" to text as a black box and ask to probabilistically compute the top five words that the model thinks should follow:

The result becomes an explicitly formatted "dataset":

What happens if you repeatedly "apply the model" - adding the word with the highest probability (designated as the model's "decision" in this code) at each step:

What will happen if it continues? In this case ("zero temperature"), a rather confusing and repetitive situation quickly ensues:

But what if instead of picking the "top" word all the time, we sometimes randomly pick the "bottom" word ("randomness" corresponds to a "temperature" of 0.8)? will build the following text:

Each time you do this, you will have a different random selection and the text will be different - like these 5 examples:

It is to be noted that even at the first step, there are many possible "next words" to choose from (at a temperature of 0.8), although their probabilities drop off rapidly (the straight line on this log plot corresponds to  n –1 The "power-law" decay, which is characteristic of the language in general statistics, https://www.wolframscience.com/nks/notes-8-8--zipfs-law/):

So what happens if it continues? Here's a random example. It's better than just using the number one word (zero temperature), but it's still a bit weird:

This is done with the simplest GPT-2 model (2019, https://resources.wolframcloud.com/NeuralNetRepository/resources/GPT2-Transformer-Trained-on-WebText-Data/). Better results with the newer and larger GPT-3 model (https://platform.openai.com/docs/model-index-for-researchers). Here is the top word (zero temperature) generated with the same "hint", but with the largest GPT-3 model:

Here's a random example for "Temperature 0.8":

Where do probabilities come from?

ChatGPT always chooses the next word based on probability. But where do these probabilities come from? Let's start with a simpler problem. Let's consider generating English text one letter (rather than word) at a time. How can we figure out the probability of each letter?

A very simple approach is to take a sample of English text and count how often different letters appear in it. As an example, this counts the letters in the Wikipedia article on "cats":

Then count the "dogs" again:

The results are similar, but not the same (the "o" is undoubtedly more common in "dogs" articles, since it appears in the word "dog" itself, after all). Nonetheless, if we take a large enough sample size of English text, we can expect to end up with at least fairly consistent results:

Here's an example of what we get if we generate a sequence of letters with these probabilities:

We can break this up into "words" by adding spaces, as if they were letters with certain probabilities:

We can improve the result of generating "words" by forcing the distribution of "word lengths" to match the distribution in English:

We don't happen to get any "actually existing words" here, but the results look slightly better. However, to further improve the results, one needs to do more than randomly pick each letter individually. For example, we know that if we have a "q", the next letter basically must be a "u":

Here is the probability plot for the letters themselves:

Here is a plot showing the probability of pairs of letters ("2-grams") in typical English text. The possible first letters are displayed horizontally and the second vertically:

For example, we see here that, except for row "u", column "q" is blank (probability zero). Now instead of generating "words" one letter at a time, we use these "2-gram" probabilities and look at two letters at a time to generate them. Here is a result - which includes some "actually existing words":

With enough English text, we can better estimate probabilities not only for single letters or pairs of letters (2-grams), but also for longer letters. If we generate "random words" with progressively longer n-gram probabilities, we find that they have a higher probability of being "real" words:

But now suppose -- somewhat like ChatGPT -- that we're dealing with whole words, not letters. There are about 40,000 reasonably common words in English (https://reference.wolfram.com/language/ref/WordList.html). By looking at a large corpus of English text (say millions of books with hundreds of billions of words in total), we can estimate how common each word is (https://reference.wolfram.com/language/ref/WordFrequencyData.html ). Using this, we can start generating "sentences" where each word is drawn independently at random with the same probability of occurrence as in the corpus. Here's a sample of what we got:

Unsurprisingly, this is sheer nonsense. So how can we do better? As with letters, we can start to consider not only the probabilities of single words, but also the probabilities of pairs or n-grams of longer words. In the pairwise case, here are the 5 examples we get, all starting with the word "cat":

The result "looks more reasonable". We can imagine that if we were able to use sufficiently long n-grams, we would basically "get a ChatGPT" - in the sense that what we get is the "correct overall probability of the paper" generating a paper-length sequence of words . But here's the problem: there isn't enough text in English to derive these probabilities.

In web crawls, there may be hundreds of billions of words; in digitized books, there may be hundreds of billions of words. But with 40,000 common words, even the number of possible 2-grams is already 1.6 billion, and the number of possible 3-grams is 60 trillion. So we have no way of estimating the probabilities of all of these from existing text. And there are more possibilities for 20-word "snippets" than there are particles in the universe, so in that sense they can never all be written down.

So what can we do? The best idea is to build a model that lets us estimate the probability of sequences occurring -- even if we've never explicitly seen those sequences in the text corpus we're looking at. At the heart of ChatGPT is a so-called "Large Language Model" (LLM), which can be built to estimate these probabilities well.

What is a model?

Suppose you want to know (as Galileo did at the end of the 15th century, https://archive.org/details/bub_gb_49d42xp-USMC/page/404/mode/2up), the shells falling from each level of the Tower of Pisa How long does it take to land. Well, you can measure it in each case and tabulate the results. Or you can follow the essence of theoretical science: build a model, give some sort of procedure for computing the answer, instead of measuring and recording every situation.

Imagine we have (somewhat idealized) data on how long it takes for shells to fall from different floors:

How do we figure out how long it takes to fall from a certain floor in a situation for which we don't have explicit data? In this particular case, we can use known laws of physics to calculate. But if all we have is data, we don't know what fundamental laws govern it. Then we can make a mathematical guess, for example, maybe we should use a straight line as the model:

We can choose different straight lines. But this is the one that, on average, comes closest to the data we were given. And according to this straight line, we can estimate the landing time from any floor.

How did we know to try to use a straight line here? In a way, we don't know. It's just mathematically simple stuff, and we're used to the fact that a lot of the data we measure is mathematically well fitted by simple models. We could try something more mathematically complex - say a + bx + cx^2, and in this case, fit better:

However, major problems can also arise. For example, here we use a + b/c + x sin(x) to fit at most as follows:

Understand that there is never a "model without a model". Any model you use has some specific underlying structure, and then has a set of "knobs you can turn" (i.e. parameters you can set) to fit your data. And ChatGPT uses many of these “knobs” — 175 billion of them, in fact.

But the remarkable thing is that the underlying structure of ChatGPT — “just” having so many parameters — is enough to make a model that calculates the probability of the next word “good enough” to give us reasonable article-length text.

Human-like task model

The examples we gave above involved modeling numerical data that essentially came from simple physics - we've known for centuries that "simple math works". But for ChatGPT, we have to build a model for human language text, the kind produced by the human brain. And we don't (at least not yet) have anything like "simple math" for such things. So what might its model look like?

Before we talk about language, let's talk about another human-like task: image recognition. As a simple example, consider an image of a number (this is a classic machine learning example):

We can get a bunch of sample images for each digit:

Then, to find out whether our input image corresponds to a certain number, we simply do an explicit pixel-by-pixel comparison with the samples we have. But as humans, we can do better - because we can recognize numbers even when they are handwritten, with all sorts of modifications and distortions:

When modeling the numerical data above, one takes a given numerical value x, and computes a + bx for a particular a and b. So, if we treat the gray value of each pixel here as some variable  xi , is there some function of all these variables that tells us what number this image is when we operate it? It turns out that it is possible to construct such a function. But we know it's not easy. A typical example might involve half a million math operations.

But the end result is that if we feed the set of pixel values ​​of an image into this function, we get a number that specifies which number our image is. Later, we will discuss how to construct such a function, and the concept of neural networks. Now let's think of this function as a black box, we input e.g. images of handwritten digits (as an array of pixel values), and then we get the digits corresponding to those digits:

But what exactly is going on here? Let's say we progressively blur a number. For a while at first, our function can still "recognize" that there is a "2". But soon it "loses" this ability and starts giving "wrong" results:

But why do we say this is a "wrong" result? In this case, we know we get all images by blurring a "2". But if our goal is to make a model for humans to recognize images, the real question to ask is what would humans do if they were confronted with these blurry images and didn't know their source.

If the results from our function agree with what a human would say, we have a "good model". And the scientific fact is that for image recognition tasks like this, we now basically know how to build such a function.

Can we "mathematically prove" that they work? cannot. Because to do that, we have to have a mathematical theory of what humans are doing. Taking the "2" image as an example, change a few pixels. We can imagine that if only a few pixels are "out of place", we should still consider this image to be "2". But to what extent? This is a question about human visual perception. And, for bees or octopuses, the answer would undoubtedly be different -- and for putative aliens, it might be quite different.

Neural Networks

So how exactly do the general models we use for tasks like image recognition work? Currently the most popular and successful approach is to use neural networks. Invented in the 1940s in a form so close to what it is used today, neural networks can be thought of as simply idealized ways of working the brain.

In the human brain, there are approximately 100 billion neurons (nerve cells), each capable of generating electrical impulses, perhaps a thousand times per second. These neurons are connected in a complex network, with each neuron having tree-like branches that allow it to pass electrical signals to thousands of other neurons. As a rough estimate, whether any given neuron generates an electrical spike at a certain moment depends on the spikes it receives from other neurons -- different connections have different "weights" to contribute.

What happens when we "see an image" is that when the image's photons land on cells ("photoreceptors") at the back of the eye, they generate electrical signals in nerve cells. These nerve cells connect to other nerve cells, and eventually the signal travels through an entire layer of neurons. And it is during this process that we "recognize" the image and eventually "form an idea" that we "see a 2" (and maybe end up with some other behavior, like saying the word "2" aloud).

The "black box" function in the previous section is a "mathematicalized" version of such a neural network. It has exactly 11 layers (though only 4 "core layers").

There is nothing "theoretically derived" about this neural net; it was just the result of an engineering project in 1998 (https://resources.wolframcloud.com/NeuralNetRepository/resources/LeNet-Trained-on-MNIST -Data/), and was found to be valid. (Of course, this is no different than when we describe our brains as having arisen through the process of biological evolution).

But how do neural networks like this "recognize things"? The key lies in the notion of attractors (https://www.wolframscience.com/nks/chap-6--starting-from-randomness#sect-6-7--the-notion-of-attractors). Suppose we have handwritten images of 1 and 2:

We want all 1's to be "drawn to one place" and all 2's to be "drawn to another place". Or, to put it another way, if an image is somehow "closer to 1" than to 2, we want it to end up "in the place of 1", and vice versa.

As a direct analogy, let's assume that there are certain locations on the plane, represented by points (in real life, they might be coffee shop locations). Then we can imagine that starting from any point on the plane, we always want to end at the nearest point (ie we always go to the nearest coffee shop). We can represent this by dividing the plane into regions ("attractor basins") separated by idealized "watersheds":

One can imagine this as performing a sort of "recognition task", where we're not doing something like identifying which number a given image "looks the most like" - but rather seeing directly which point a given point is closest to. (The "Voronoi diagram" we show here is set up as separation points in two-dimensional Euclidean space; the digit recognition task can be viewed as a very similar process -- but in a In the 784-dimensional space formed by gray levels.)

So, how do we make a neural network "do a recognition task"? Consider this very simple case:

Our goal is to take an "input" corresponding to a position {x,y}, and then "recognize" it as any of the three points it is closest to. Or in other words, we want the neural network to be able to compute a function of {x,y}, such as:

So, how do we do this with neural networks? At the end of the day, a neural net is a connected collection of idealized "neurons" -- usually arranged in layers -- and here's a simple example:

Each "neuron" is programmed to perform a simple numeric function. To "use" this network, we simply feed in numbers at the top (such as our coordinates x and y), then let the neurons in each layer "operate their function" and feed the results forward through the network - eventually at the bottom produces the final result:

In a traditional (biologically inspired) setup, each neuron actually has a set of "incoming connections" from neurons in the previous layer, and each connection is assigned a certain "weight" (which can be a positive number or negative number). The value of a given neuron is determined by multiplying the value of the "previous neuron" by its corresponding weight, these values ​​are then summed and multiplied by a constant, and finally a "threshold" (or " activate") function. In mathematical terms, if a neuron has input x = {x1, x2 ... }, then computes f[w . x + b], where each neuron chooses a different weight w and constant b in the network; the function f is usually the same.

Computing w . x + b is just a matter of matrix multiplication and addition. The "activation function" f introduces non-linearity (and ultimately leads to non-linear behavior). We can use various activation functions; here we just use Ramp (http://reference.wolfram.com/language/ref/Ramp.html, or ReLU):

Speaking of each task we want our neural network to perform (or, in other words, for each overall function we want it to perform), we give it a different choice of weights. (As we'll discuss later, these weights are typically determined by using machine learning to "train" a neural network on instances of the output we want).

Ultimately, each neural network corresponds to some overall mathematical function -- but it can be messy to write. For the example above, it would be:

ChatGPT's neural network just corresponds to such a mathematical function -- but the function actually has billions of terms.

Let's go back to individual neurons. Here are some examples of functions that a neuron with two inputs (representing coordinates x and y) compute after choosing different weights and constants (and Ramp as the activation function):

But what about the larger network above? Here's what it computes:

The result is not quite "correct", but it is close to the "nearest point" function we showed above.

Let's see what happens with some other neural networks. In each case, as explained later, we are using machine learning to find the best choice of weights. Then, here is the calculation of the neural network with these weights:

Larger networks generally better approximate the objective function. And "in the middle of each attractor basin" usually gets the answer we want. But at the border (https://www.wolframscience.com/nks/notes-10-12--memory-analogs-with-numerical-data/) -- where neural networks "have a hard time making decisions" -- the situation may will be more confusing.

In this simple math-style "identification task," it's clear what the "correct answer" is. But when it comes to recognizing handwritten digits, it's less clear. What if someone writes a "2" so badly that it looks like a "7", etc.? Still, we can ask how the neural network differentiates the numbers - which gives an indication:

Can we say "mathematically" how networks are differentiated? Does not. It's just "doing what neural networks do". But it turns out that it often seems to be fairly similar to the distinctions we humans make.

Take a more complicated example. Let's say we have images of cats and dogs. Train a neural network to distinguish them. Here's what the network might do when faced with some situations:

Now, what the "correct answer" is is even less clear. What about a dog in cat clothing? etc. No matter what input is given, the neural network will produce an answer. And, it does so in a way that is reasonably consistent with what humans might do. As I said above, this is not a fact that we can "deduce from first principles". It has only been found to be true empirically, at least in some areas. But this is a key reason why neural networks are useful: they somehow capture a "human-like" way of doing things.

Show yourself a picture of a cat and ask "Why is that a cat?". Maybe you say "I see its pointy ears, etc". But it's not so easy to explain how you recognize the picture as a cat. Because your brain somehow figured it out. But for the brain, there's no way (at least not yet) to "go inside" its guts and see how it's thinking up. So what about an (artificial) neural network? When you show it a picture of a cat, you can directly see what each "neuron" does. However, getting even a simple visualization is often very difficult.

There are 17 neurons in the final network we used to solve the "nearest point" problem above. In the network used to recognize handwritten digits, there are 2190. And in the network we use to recognize cats and dogs, there are 60,650 neurons. In general, it is quite difficult to visualize a space equivalent to 60,650 dimensions. But since this is a network set up to process images, its many neurons are organized into arrays like the array of pixels it looks at.

If using a typical cat image:

We can then represent the state of the neurons in the first layer with a set of derived images - many of which we can easily interpret as "picture of a cat without background image", or "silhouette of a cat", etc.:

By the tenth floor, it's harder to explain what these are:

But in general, we can say that the neural network "picks out certain features" (perhaps pointy ears are among them), and uses these features to determine what the image is. But would we name the traits, like "pointy ears"? Most of the time it doesn't.

Are our brains using similar traits? Most of the time we don't know. Note, however, that the first few layers of a neural network like the ones we've shown here seem to pick out certain features of images (such as the edges of objects) that seem to be different from what we know is picked out by the first layers of visual processing in the brain. The features are similar.

But, suppose we want a "cat recognition theory" for neural networks. We can say "this network can do it" - this gives us a sense of "how hard the problem is" (eg, how many neurons or layers it might take). But at least until now, we have no way to "narratively describe" the mechanism of the network. Perhaps this is because it is indeed computationally irreducible, and there's no general way to see what it's doing other than explicitly tracing each step. Or it could just be that we haven't "figured out the science", haven't established the "laws of nature", so we can't generalize what's going on.

We have the same problem when generating language with ChatGPT. And it's also not clear that there's a way to "summarize what it's doing". But the richness and detail of language (and our experience with it) may allow us to go much further here than we do with images.

The article is long, please check this series to continue reading~——Jump to the comment area

What exactly is ChatGPT doing? Why can it do this? (2)

Guess you like

Origin blog.csdn.net/m0_67129275/article/details/130135405