What is ChatGPT and what is its principle? I still don’t understand after reading this article, I’ll eat the screen live! ! Benefits: Chinese version CHAT GPT mirror pure sharing

Table of contents

Welfare: At the end of the article, purely share the Chinese version of the CHAT GPT image, there is no magic, pure sharing is free to use

Foreword:

 1. Where do probabilities come from?

2. What is a model

3. Human-like task model

3. Neural network

4. Machine Learning and Neural Network Training

5. Practice and theory of neural network training

6. The concept of embedding

 7. Inside ChatGPT

8. ChatGPT training

9. Beyond Basic Training

10. What really makes ChatGPT work?

11. Meaningful space and the law of semantic movement

12. Semantic Grammar and the Power of Computational Language

13. So what is ChatGPT doing and why does it work?


Welfare: At the end of the article , purely share the Chinese version of the CHAT GPT image, there is no magic, pure sharing is free to use

 

Foreword:

Author: Stephen Wolfram (Stephen Wolfram), a British and American computer scientist and physicist. He is the chief designer of Mathematica and the author of A New Kind of Science.

ChatGPT's ability to automatically generate something that reads superficially even like human-written text is remarkable and unexpected. But how does it do it? Why does it work? My purpose here is to give a little overview of what's going on inside ChatGPT, and then explore why it works so well at generating what we consider meaningful text.

I'll start by stating that I'm going to focus on the big picture of what's going on, and while I'll mention some engineering details, I won't delve into them. (The substance of what I said also applies to the other current "large language models" LLM and ChatGPT).

The first thing to explain is that ChatGPT is fundamentally always trying to make a "reasonable continuation" of whatever text it has so far, where "reasonable" means "after seeing what people have written on billions of web pages After something, one might expect someone to write something".

So, suppose we've got the text "The best thing about AI is its ability to do..." ("The best thing about AI is its ability to"). Imagine scanning billions of pages of human-written text (e.g. on the web and in digitized books) and finding all instances of that text -- and seeing what words appear how many over time.

ChatGPT effectively does something similar, except (as I'll explain) it doesn't look at literal text; it looks for things that "meaning match" in some sense . But the end result is that it produces a sorted list of words that are likely to appear later, along with "probability".

It's worth noting that when ChatGPT is doing something like writing an article, all it's doing is basically repeatedly asking "given the text so far, what should the next word be?" -- and incrementing each time one word. (More precisely, as I'll explain, it's adding a "marker", which may just be part of a word, which is why it can sometimes "make up new words").

At each step, it gets a list of words with probabilities. But which one should it choose to add to the article it's writing (or whatever) exactly? One might think that it should be the "top" word (ie, the one assigned the highest "probability").

But that's when a little witchcraft starts to creep in. Because for some reason - maybe one day we'll have a scientific understanding - if we always pick the top ranking words, we usually end up with a very "flat" article that never seems to "show Any creativity" (even sometimes repeated verbatim). However, if we sometimes (randomly) pick words that rank lower, we get a "more interesting" article.

The fact that there is randomness here means that if we use the same prompt multiple times, we are also likely to get a different article each time. And, in keeping with the voodoo idea, there is a specific so-called "temperature" parameter that determines with what frequency lower-ranked words are used, and for paper generation, it turns out, 0.8 The "temperature" seems to be the best. (It's worth emphasizing that no "theory" is being used here; it's just a problem that has been found to work in practice). For example, the concept of "temperature" exists because it happens to use the familiar exponential distribution from statistical physics, but has no "physical" connection - at least so far we think so. )

Before we go on, I should explain that for expository purposes, I mostly don't use the full system in ChatGPT; instead, I usually use the simpler GPT-2 system, which has a nice feature that it Small enough to run on a standard desktop computer.

So for everything I show, including explicit Wolfram Language code, you can run it on your computer right away. (Click on any picture here to copy the code behind it - Translator's Note: Please check the "Original Link" at the end of the article, where you can click on the picture to get the code).

For example, here's how to obtain the above probability table. First, we must retrieve the underlying "language model" neural net:

Later, we'll look inside this neural net and talk about how it works. But now we can apply this "network model" as a black box to the text we have so far, and ask to probabilistically compute the top five words the model thinks should be chosen:

This turns this result into an explicitly formatted "dataset":

What happens if you repeatedly "apply the model" - adding the word with the highest probability (designated as the model's "decision" in this code) at each step:

What will happen if it continues? In this case ("zero temperature"), a rather confusing and repetitive situation quickly ensues:

But what if instead of always picking the "top" words, we sometimes pick the "non-top" words at random ("Randomness" corresponds to a "Temperature" of 0.8)? One can build text again:

And each time you do this, you'll have a different random selection and the text will be different -- like these 5 examples:

It's worth pointing out that even at the first step, there are many possible "next words" to choose from (at a temperature of 0.8), although their probabilities drop off very quickly (yes, the straight line on this logarithmic plot corresponds to "power-law" decay of n-1, which is characteristic of the general statistics of languages):

So what happens if it continues? Here's a random example. It's better than the case of the top word (zero temperature), but still a bit weird at best:

This is done with the simplest GPT-2 model (from 2019). With the newer and larger GPT-3 model, the results are better. Here is the top text (zero temperature) generated with the same "hint", but with the largest GPT-3 model:

Here's a random example when "temperature is 0.8":

 1.  Where do probabilities come from?

Well, ChatGPT always chooses the next word based on probability. But where do these probabilities come from? Let's start with a simpler problem. Let's consider generating English text one letter (rather than word) at a time. How can we figure out the probability of each letter?

A very simple thing we can do is take a sample of English text and count how often different letters appear in it. So, for example, this counts the letters in the Wikipedia article on "cat":

And this does the same for "dog":

The results are similar, but not the same ("o" is undoubtedly more common in "dogs" articles, since it appears in the word "dog" itself after all). Nevertheless, if we take a sufficiently large sample of English texts, we can expect to end up with at least fairly consistent results.

Here's a sample of what we get if we generate a sequence of letters with these probabilities:

We can break this up into "words" by adding spaces, as if they were letters with certain probabilities:

We can do a slightly better job of making "words" by forcing the distribution of "word lengths" to match the distribution in English:

We don't happen to get any "actual words" here, but the results look slightly better. To go further, though, we need to do more than randomly pick each letter individually. For example, we know that if we have a "q", the next letter basically must be a "u":

Here's a probability plot for the letter itself:

Here is a plot showing the probability of pairs of letters ("2-grams") in typical English text. The possible first letters are displayed on the page, and the second letters are displayed below:

For example, we see here that, except for the "u" row, the "q" column is blank (with zero probability). Well, now instead of generating "words" one letter at a time, we use these "2-gram" probabilities and look at two letters at a time to generate them. Here's a sample of the results - which happen to include some "real words":

With enough English text, we can get good estimates of the probabilities not only for single letters or pairs of letters (2-grams), but also for longer letters. If we generate "random words" with progressively longer n-gram probabilities, we see that they become progressively "more realistic":

But for now let's assume - more or less like ChatGPT - that we're dealing with whole words, not letters. There are approximately 40,000 reasonably common words in the English language. By looking at a large corpus of English text (say millions of books with hundreds of billions of words in total), we can get an estimate of how common each word is. Using this, we can start generating "sentences" where each word is drawn independently at random with the same probability of occurrence as in the corpus. Here's a sample of what we got:

Obviously, this is nonsense. So how can we do better? As with letters, we can start to consider probabilities not just for single words, but for pairs or n-grams of longer words. In the pairwise case, here are the 5 examples we get, all starting with the word "cat":

It's become a little more "plausible". We can imagine that if we were able to use sufficiently long n-grams, we would basically "get a ChatGPT" - in the sense that we would get something that generates paper-length words with the "correct overall paper probability" sequence. But here's the problem: there isn't enough text in English to derive these probabilities.

In a crawl of the web, there may be hundreds of billions of words; in books that have been digitized, there may be another hundreds of billions of words. But with 40,000 common words, even the number of possible 2-grams is already 1.6 billion, and the number of possible 3-grams is 60 trillion.

So we have no way of estimating the probabilities of all of these from existing text. And by the time we get to the 20-word "fragment", there are more possibilities than there are particles in the universe, so in a sense they can never all be written down.

So what can we do? The big idea is to build a model that lets us estimate the probability of sequences occurring -- even if we've never explicitly seen those sequences in the text corpus we're looking at. At the heart of ChatGPT is a so-called "Large Language Model" (LLM), which is built to estimate these probabilities very well.

2. What is a model

Suppose you wanted to know (as Galileo did in the late 15th century) how long it would take for a cannonball dropped from each level of the Tower of Pisa to hit the ground. Well, you can measure it in each case and tabulate the results. Or you can do the quintessence of theoretical science: build a model that gives some sort of procedure for computing the answer, instead of just measuring and memorizing each case.

Let's imagine that we have (somewhat idealized) data on how long it takes for shells to fall from different floors.

How do we work out how long it takes for it to fall from a floor for which we don't have definitive data? In this particular case, we can use known laws of physics to calculate. But if all we have is data, we don't know what fundamental laws govern it. Then we can make a mathematical guess, say, maybe we should use a straight line as a model.

We can choose different straight lines. But this is the one that, on average, comes closest to the data we were given. And according to this straight line, we can estimate the descent time of any floor.

How did we know to try to use a straight line here? In a way we don't know. It's just mathematically simple stuff, and we're used to the fact that a lot of the data we measure is pretty well fitted by mathematically simple stuff. We can try something more mathematically complex - say a + bx + cx2, and in this case we do better:

Things could go wrong, though. For example, here is what we can do with a + b/c + x sin(x) at most:

It's worth understanding that there is never a "model without a model". Any model you use has some specific underlying structure, and then has a set of "knobs you can turn" (i.e. parameters you can set) to fit your data. And in the case of ChatGPT, many of these "knobs" are used—in fact, 175 billion of them.

But what's striking is that the underlying structure of ChatGPT -- which has "only" so many parameters -- is good enough to make a model that computes next-word probabilities "good enough" to give us reasonable article-length text.

3. Human-like task model

The examples we gave above involved modeling numerical data that essentially came from simple physics - we've known for centuries that "simple math works". But for ChatGPT, we have to build a model for human language text, the kind produced by the human brain. And we don't (at least not yet) have anything like "simple math" for something like this. So what might its model look like?

Before we talk about language, let's talk about another human-like task: recognizing images. And as a simple example, let's consider an image of a number (yes, this is a classic machine learning example):

One thing we can do is get a bunch of sample images for each digit:

Then, to find out whether our input image corresponds to a certain number, we simply do an explicit pixel-by-pixel comparison with the samples we have. But as humans, we seem to be able to do better -- because we can still recognize numbers, even when they're handwritten, with all kinds of modifications and distortions.

When we build a model for the numerical data above, we are able to take a given numerical value x and then compute a + bx for a particular a and b.

So if we take the gray value of each pixel here as some variable xi, is there some function of all these variables that when evaluated tells us what number this image is? It turns out that it is possible to construct such a function. Not surprisingly, this isn't particularly simple. A typical example might involve half a million math operations.

But the end result is that if we feed the set of pixel values ​​of an image into this function, we get a number that specifies which number our image is. Later, we will discuss how to construct such a function, and the concept of neural networks. But for now let's treat this function as a black box, we input e.g. images of handwritten digits (as an array of pixel values), and then we get the digits corresponding to those digits:

But what exactly is going on here? Let's say we progressively blur a number. For a while, our function still "recognizes" it, in this case a "2". But soon it "loses" and starts giving "wrong" results:

But why do we say this is a "wrong" result? In this case, we know we get all images by blurring a "2". But if our goal is to make a model for humans to recognize images, the real question to ask is what would humans do if they were confronted with these blurry images and didn't know their source.

If the results we get from our features generally agree with what a human would say, we have a "good model". The not trivial scientific fact is that for image recognition tasks like this we now basically know how to construct such a function.

Can we "mathematically prove" that they work? Well, can't. Because to do that, we have to have a mathematical theory of what we humans are doing. Taking the "2" image as an example, change a few pixels. We can imagine that only a few pixels are "out of place", and we should still consider this image to be "2". But how far should this go? This is a question about human visual perception. And, yes, for a bee or an octopus, the answer would undoubtedly be different -- and probably quite different for a putative alien.

3. Neural network

Alright, so how exactly do the typical models we use for tasks like image recognition work? Currently the most popular and successful approach is to use neural networks. Invented in the 1940s in a form so close to what it is used today, neural networks can be thought of as a simple idealization of the way the brain seems to work.

In the human brain, there are approximately 100 billion neurons (nerve cells), each capable of generating electrical impulses, perhaps a thousand times per second. These neurons are connected in a complex network, and each neuron has tree-like branches that allow it to pass electrical signals to potentially thousands of other neurons.

As a rough estimate, whether any given neuron generates an electrical spike at a certain moment depends on the spikes it receives from other neurons -- different connections have different "weights" to contribute.

What happens when we "see an image" is that when the image's photons land on cells ("photoreceptors") at the back of the eye, they generate electrical signals in nerve cells. These nerve cells connect to other nerve cells, and eventually the signal travels through an entire layer of neurons. And it is during this process that we "recognize" the image and eventually "form an idea" that we "see a 2" (maybe end up doing something like saying the word "2" aloud).

The "black box" function in the previous section is a "mathematicalized" version of such a neural network. It has exactly 11 layers (though only 4 "core layers").

There's nothing particularly "theoretical" about this neural net; it's just something that was built as a project in 1998 and found to work. (Of course, this is no different than when we describe our brains as having arisen through the process of biological evolution).

Okay, but how do neural networks like this "recognize things"? The key lies in the concept of attractors. Imagine we have handwritten images of 1 and 2:

We want all 1's to be "drawn to one place" and all 2's to be "drawn to another place". Or, to put it another way, if an image is somehow "closer to 1" than to 2, we want it to end up "in the place of 1", and vice versa.

As a direct analogy, let's assume that there are certain locations on the plane, represented by points (in real life, they might be coffee shop locations). Then we can imagine that starting from any point on the plane, we always want to end at the nearest point (ie we always go to the nearest coffee shop). We can represent this by dividing the plane into regions ("basins of attraction") separated by idealized "watersheds":

We can think of this as performing a sort of "recognition task", where we're not doing something like identifying which number a given image "looks the most like" -- rather, we're seeing which point a given point is closest to. (The "Voronoi diagram" setup we've shown here separates points in two-dimensional Euclidean space; the digit recognition task can be thought of as doing something very similar -- but in a In the 784-dimensional space formed by the gray level of the pixel.)

So, how do we make a neural network "do a recognition task"? Let's consider this very simple case:

Our goal is to take an "input" corresponding to a position {x,y}, and then "recognize" it as any of the three points it is closest to. Or, in other words, we want the neural network to compute a function like {x,y}:

So, how do we do this with neural networks? At the end of the day, a neural net is an idealized collection of "neurons" connected - usually arranged in layers - a simple example of which is:

Each "neuron" is effectively set up to evaluate a simple numeric function. To "use" this network, we simply feed in numbers at the top (like our coordinates x and y), then let the neurons in each layer "evaluate their capabilities" and feed the results forward through the network -- eventually at The bottom produces the final result.

In a traditional (biologically inspired) setup, each neuron actually has a set of "incoming connections" from neurons in the previous layer, and each connection is assigned a certain "weight" (which can be a positive number or negative number). The value of a given neuron is determined by multiplying the value of the "previous neuron" by its corresponding weight, these values ​​are then summed and multiplied by a constant, and finally a "threshold" (or " activate") function.

In mathematical terms, if a neuron has inputs x = {x1, x2 ... }, then we compute f[wx + b], where the weight w and constant b are usually chosen differently for each neuron in the network; the function f is usually the same.

Computing wx + b is just a matter of matrix multiplication and addition. The activation function "f introduces non-linearity (and ultimately leads to non-linear behavior). Usually various activation functions are used; here we just use Ramp (or ReLU):

For each task we want the neural network to perform (or, rather, for each overall function we want it to evaluate), we will have a different choice of weights. (As we'll discuss later, these weights are typically determined by using machine learning to "train" the neural network on instances of our desired output).

Ultimately, each neural network corresponds to some overall mathematical function -- messily written though it may be. For the example above, it would be:

ChatGPT's neural network only corresponds to such a mathematical function - but in fact there are billions of terms.

But let's go back to individual neurons. Here are some examples of functions that a neuron with two inputs (representing coordinates x and y) can compute after choosing different weights and constants (and Ramp as activation function):

But what about that larger network above? Well, here's what it calculates:

This isn't quite "correct", but it's close to the "closest point" function we showed above.

Let's see what happens with some other neural networks. In each case, as we explain later, we are using machine learning to find the best choice of weights. We then show here the computation of the neural network with these weights:

Larger networks generally better approximate our objective function. And in the "middle of each attractor basin" we usually get the answers we want. But at the border—where neural networks "have a hard time making up their minds"—things can be messier.

In this simple math-style "identification task," it's clear what the "correct answer" is. But when it comes to recognizing handwritten digits, it's less clear. What if someone writes a "2" so badly that it looks like a "7", etc.? Still, we can ask how the neural network differentiates the numbers - which gives an indication:

Can we say "mathematically" how networks are differentiated? it's not true. It's just "doing what neural networks do". But it turns out that often seems to line up fairly well with the distinctions we humans make.

Let's take a more complicated example. Let's say we have images of cats and dogs. We have a neural network which is trained to distinguish them. Here's what it might do in some examples:

Now, what the "correct answer" is is even less clear. What about a dog in a cat suit? etc. No matter what input it is given, the neural network will produce an answer. And, it turns out, doing so in a way that is reasonably consistent with what humans might do.

As I said above, this is not a fact that we can "deduce from first principles". It has only been found to be true empirically, at least in some areas. But this is a key reason why neural networks are useful: they somehow capture a "human-like" way of doing things.

Show yourself a picture of a cat and ask "Why is that a cat?". Maybe you'll start saying "well, I see its pointy ears, blah blah". But it's not so easy to explain how you recognize the picture as a cat. Just something your brain figured out somehow. But for the brain, there's no way (at least not yet) to "go inside" its guts and see how it's thinking up.

So what about an (artificial) neural network? Well, when you show a picture of a cat, you can directly see what each "neuron" does. However, it is often very difficult to get even a basic visualization.

There are 17 neurons in the final network we used to solve the "nearest point" problem above. In the network used to recognize handwritten digits, there are 2190. And in the network we use to identify cats and dogs, there are 60,650.

In general, it is quite difficult to visualize a space equivalent to 60,650 dimensions. But since this is a network set up to process images, its many layers of neurons are organized into arrays, like arrays of pixels it looks at.

If we take a typical cat image:

We can then represent the state of the neurons in the first layer with a set of derived images - many of which we can easily interpret as "cat without background", or "cat silhouette", etc.:

By the tenth floor, it's even harder to explain what's going on:

But in general, we can say that the neural network is "picking out certain features" (perhaps pointy ears are among them), and using these features to determine what the image is. But are these traits we have names for, like "pointy ears"? Most of the time not.

Are our brains using similar traits? Most of the time we don't know. But it's worth noting that the first few layers of a neural network like the one we've shown here seem to pick out certain aspects of images (like the edges of objects) that seem to be different from what we know is picked out by the first layers of visual processing in the brain. The features are similar.

But, suppose we want a "cat recognition theory" for neural networks. We can say "look, this particular network did it" - this immediately gives us some sense of "how hard the problem is" (eg, how many neurons or layers might be needed).

But at least until now, we have no way to have a "narrative description" of what the network is doing. Perhaps this is because it is indeed computationally irreducible, and there's no general way to find out what it's doing other than explicitly tracing each step. Or it could just be that we haven't "figured out the science", haven't identified the "laws of nature" that allow us to generalize what's going on.

We have the same problem when we talk about generating language with ChatGPT. And it's also not clear that there's a way to "summarize what it's doing". But the richness and detail of language (and our experience with it) may take us farther than images.

4. Machine Learning and Neural Network Training

So far we've been talking about neural networks that "already know" how to do a specific task. But what makes neural networks so useful (and presumably in the brain too) is that they can not only do a variety of tasks in principle, but that they can be gradually "trained by examples" to do them.

When we make a neural network that distinguishes between cats and dogs, we don't actually need to write a program to (say) find whiskers explicitly; instead, we just show lots of examples of what is a cat and what is a dog, and then Let the network "machine learn" from these examples how to distinguish them.

The point is, a trained network "generalizes" from the specific examples it's shown. As we saw above, it's not simply getting the network to recognize specific pixel patterns in images of cats it sees; Based on differentiating images.

So, how exactly does the training of the neural network work? Essentially, we're trying to find the weights that will allow the neural network to successfully reproduce the examples we've been given. We then rely on the neural network to "interpolate" (or "generalize") between these examples in a "reasonable" way.

Let's look at a simpler problem than the closest point problem above. Let's just try to get a neural network to learn the function:

For this task, we need a network with only one input and one output, such as:

But what weights etc should we use? Under each possible set of weights, the neural network computes some function. For example, here's what it does with several sets of randomly chosen weights:

Yes, we can clearly see that it doesn't even come close to reproducing the function we want in these cases. So, how do we find the weights that reproduce this function?

The basic idea is to "learn" by providing a large number of "input → output" examples - and then try to find weights that reproduce these examples. Here are the results with an increasing number of examples:

At each stage of this "training" the weights in the network are incrementally adjusted - and we see that eventually we end up with a network that successfully reproduces our desired functionality. So, how do we adjust the weights? The basic idea is to see at each stage "how far" we are from getting the feature we want, and then update the weights in such a way that it gets closer.

To find out "how far we are", we calculate what is usually called a "loss function" (or sometimes a "cost function"). Here we are using a simple (L2) loss function which is simply the sum of squares of the difference between the values ​​we get and the true value.

What we see is that as our training process progresses, the loss function gradually decreases (following a certain "learning curve", which is different for different tasks) - until we reach a point where the network (at least a good approximation) successfully reproduces the function we want:

Alright, the last important part to explain is how to tune the weights to reduce the loss function. As we said, the loss function gives us the "distance" between the value we got and the true value. But "what value we get" at each stage is determined by the current version of the neural network and the weights in it. But now imagine that these weights are variables -- let's say wi. We want to find out how to adjust the values ​​of these variables so that the loss depending on these variables is minimized.

For example, imagine (an uncanny simplification of a typical neural network used in practice) that we only have two weights w1 and w2. Then we might have a loss, as a function of w1 and w2, that looks like this:

Numerical analysis offers various techniques to find the minimum in such cases. But a typical approach is to follow the steepest descending path step by step starting from the previous w1, w2:

Like water flowing down a mountain, all that is guaranteed is that the process ends up at some local minimum on the surface ("a mountain lake"); it will most likely not reach the final global minimum.

Finding the steepest descent path on a "weight landscape" is not obvious, it's not feasible. However, calculus can help us. As we mentioned above, we can always think of a neural net as computing a mathematical function - it depends on its inputs and weights.

But now consider differentiating these weights. It turns out that the chain rule of calculus can actually allow us to "untangle" the operations done by successive layers in a neural network. The result is that we can - at least in some local approximation - "invert" the operation of a neural network and progressively find the weights that minimize the loss associated with the output.

The figure above shows how much minimization we might have to do in the unrealistically simple case of only 2 weights. But it turns out that even with more weights (ChatGPT uses 175 billion), it is still possible to minimize, at least somewhat approximately. In fact, the big breakthrough in "deep learning" that occurred around 2011 was related to the discovery that, in a sense, it is easier to do (at least approximately) minimize when there are many weights involved than when there are considerably fewer weights .

In other words -- and somewhat counter-intuitively -- it's easier to solve more complex problems with neural networks than simpler ones. The general reason for this seems to be that when a person has many "weight variables", he has a high-dimensional space with "many different directions" that can lead him to the minimum - whereas with fewer variables, it is easier Stuck in a local minimum ("mountain lake") with no "direction to go out".

It's worth pointing out that in the typical case, there are many different sets of weights that all lead to neural networks with nearly the same performance. And in actual neural network training, there are usually many random choices that lead to "different but equivalent solutions", like these:

But each of these "different solutions" will behave at least slightly differently. If we ask, say, to "extrapolate" outside the region where we provide training examples, we can get wildly different results:

But which one is "correct"? There's really no way to tell. They are all "consistent with the observed data". But they all correspond to different "innate" ways of "thinking" about what to do "outside the box". To us humans, some may seem "more reasonable" than others.

5. Practice and theory of neural network training

Especially in the past decade, many advances have been made in the art of training neural networks. And, yes, it's basically an art. Sometimes, especially in retrospect, one can see at least a semblance of a "scientific explanation" for what is being done. But for the most part, things were discovered by trial and error, and ideas and tricks were added, gradually building up an important legend of how to use neural networks.

There are several key parts. First, what architecture of the neural network should be used for a specific task. Then, there is a key issue of how to obtain the data to train the neural network. Moreover, people are increasingly not dealing with the problem of training a network from scratch: instead, a new network can be directly fed into another already trained network, or at least can be used to generate more training examples for itself .

One might think that for each specific task one would need a different neural network architecture. But it was found that even for apparently different tasks, the same architecture seemed to work.

In a way, this is reminiscent of the idea of ​​universal computing (and my principle of computational equivalence), but, as I will discuss later, I think it reflects more of the fact that we Often the tasks that neural networks are trying to do are "human-like", and neural networks can capture fairly general "human-like processes".

In the early days of neural networks, people tended to think that one should "make the neural network do as little as possible". For example, when converting speech to text, it is believed that the audio of the speech should first be analyzed, broken down into phonemes, and so on. But it's been found that, at least for "human-like tasks", it's often better to try to train a neural network on an "end-to-end problem", letting it "discover" the necessary intermediate features, encodings, etc. on its own.

There's also the idea that we should introduce complex individual components into neural networks that actually "explicitly implement specific algorithmic ideas." But again, this proves not worth it; instead, it's better to just deal with very simple components and let them "self-organize" (albeit usually in ways beyond our comprehension) to achieve (roughly) the equivalent of those algorithmic ideas .

That's not to say there aren't "structured ideas" associated with neural networks. So, for example, two-dimensional arrays of neurons with local connections seem to be very useful at least in the early stages of image processing. And having connection patterns that focus on "review sequences" seems to be useful -- as we'll see later -- when dealing with things like human language, for example in ChatGPT.

But an important feature of neural networks is that, like computers in general, they are ultimately just processing data. Whereas current neural networks -- current methods of training neural networks -- are designed to work exclusively with arrays of numbers. But during processing, these arrays can be completely rearranged and reshaped. As an example, the network we used above to recognize digits starts with a two-dimensional array of "images", quickly "thickens" to many channels, but then "condenses" into a one-dimensional array that will eventually contain Elements of numbers:

But, well, how do you tell how big a neural net is for a particular task? It's an art. In a way, the key is to know "how hard is this task". But for human-like tasks, this is often difficult to estimate.

Yes, there may be a systematic way to accomplish tasks very "mechanically" by computers. But it's hard to know whether there are supposed tricks or shortcuts that make this task easier, at least on a "human-like level." It may be necessary to enumerate a huge game tree to play a certain game "mechanically"; but there may be an easier ("heuristic") way to achieve "human-level play".

When one is dealing with tiny neural networks and simple tasks, "you can't get there from here" can sometimes be clearly seen. For example, this is the best people seem to be able to achieve with a few small neural networks on the task in the previous section:

And in our case, if the net is too small, it cannot reproduce the functionality we want. But beyond a certain size, it's fine -- at least if one trains it for long enough and with enough examples. By the way, these pictures illustrate a neural network myth: if there is a "squeeze" in the middle, forcing everything to go through a smaller number of interneurons, then we can often end up with a smaller network.

(It's worth noting that "no intermediate layers" -- or so-called "perceptrons" -- networks can only learn essentially linear functions -- but as long as there is an intermediate layer, it is in principle possible to approximate arbitrarily well any function, at least if there are enough neurons, although some kind of regularization or normalization is usually required in order for it to be feasibly trainable).

Well, let's assume we've settled on some kind of neural network architecture. Now there is a problem, how to get data to train the network. Many practical challenges surrounding neural networks and machine learning in general center on obtaining or preparing the necessary training data. In many situations ("supervised learning"), one wishes to have explicit examples of inputs and desired outputs.

So, for example, one might want to label an image by what's in it, or some other attribute. Maybe we have to do it explicitly -- usually painstakingly. But many times, we can draw on work that has already been done, or use it as some kind of proxy.

So, for example, we can use the alt tags of images already available on the web. Or, in another area, we can use closed captions created for videos. Or in language translation training, parallel versions of web pages or other files in different languages ​​can be used.

How much data do you need to show a neural network to train it to do a specific task? Again, this is difficult to estimate from first principles. Of course, the requirements can be greatly reduced by using "transfer learning" to "transfer" something like a list of important features that have already been learned in another network.

But in general, neural networks need to "see a lot of examples" to train well. And one of the great myths of neural networks, at least for some tasks, is that the examples can be very repetitive. In fact, it's a standard strategy to show a neural network all the examples, over and over again. During each "training round" (or "epochs"), the neural network will be in at least a slightly different state, and "reminding" it of a particular example in some way is crucial to making it "remember that example" is very useful. (Yes, perhaps this is analogous to the usefulness of repetition in human memory).

But often just repeating the same example over and over is not enough. Variations of this example also need to be shown to the neural network . And one feature of neural network theory is that these "data-augmented" variations don't have to be complicated to be useful. Simply modifying images slightly with basic image processing methods can make them essentially "as good as new" for neural network training. Likewise, when people don't have actual videos etc. to train self-driving cars, one can continue to get data from simulated video game environments without all the details of actual real-world scenarios.

How about something like ChatGPT? Well, it has a nice feature that it can do "unsupervised learning", which makes it easier to get examples for training. To recap, the basic task of ChatGPT is to figure out how to proceed with a piece of text it is given. So, to get a "training instance", all we have to do is take a piece of text, mask the end, and use that as "input for training" - the "output" is the full, unmasked text. We’ll talk more about this later, but the main point is that unlike learning what’s in a picture, no “explicit labels” are needed; ChatGPT can actually learn directly from any text example it’s given.

Ok, so what does the actual learning process of a neural network look like? At the end of the day, it's all about determining what weights best capture the given training examples. There are various detailed choices and "hyperparameter settings" (so called hyperparameters because the weights can be thought of as "parameters") that can be used to tune how this is done.

There are different choices of loss functions (sum of squares, sum of absolute values, etc.). There are different ways to do loss minimization (how far to move in weight space at each step, etc.). Then there are issues like how large a "batch" of examples to show to get each successive estimate of the loss you're trying to minimize. And, yes, one can apply machine learning (for example, what we do in the Wolfram Language) to automate machine learning -- automatically setting things like hyperparameters.

But in the end, the whole training process is characterized by seeing how the loss gradually decreases (like this progress monitor for small training in the Wolfram Language):

Instead, what one usually sees is that the loss decreases for a period of time, but eventually levels off at some constant value. If this value is small enough, then the training can be considered successful; otherwise, this may be a signal that changes to the network structure should be attempted.

Can you tell us how long it will take for the "learning curve" to flatten? Like many other things, there seems to be an approximate power-law scaling relationship, depending on the size of the neural network and the amount of data used. But the general conclusion is that training a neural network is hard and requires a lot of computational effort. As a practical matter, the vast majority of these efforts are spent operating on arrays of numbers, which is what GPUs are good at -- which is why neural network training is often limited by the availability of GPUs.

In the future, will there be fundamentally better ways to train neural networks, or to do neural network work in general? I think, almost certainly. The basic idea of ​​a neural network is to create a flexible "computational structure" out of a large number of simple (essentially identical) components, and to allow this "structure" to be incrementally modified in order to learn from examples.

In current neural networks, one basically uses ideas from calculus -- applied to real numbers -- to do this kind of incremental modification. But it's becoming increasingly clear that having high-precision numbers isn't important; even with current methods, 8 digits or fewer may suffice.

How a computing system such as a cellular automaton, which essentially operates on many individual bits in parallel, does this kind of incremental modification has never been clear, but there's no reason to think it's not possible. In fact, like the 2012 breakthrough in deep learning, this kind of incremental modification may be easier in more complex cases than in simpler ones.

Neural networks — perhaps a bit like brains — are programmed to have a largely fixed network of neurons, modified by the strength ("weight") of the connections between them. (Perhaps at least in young brains, lots of entirely new connections can also grow.) But while this may be a convenient setup for biology, it's not clear that it's the best way to achieve our desired functions. best way. And something involving progressive network rewrites (perhaps reminiscent of our physics project) might end up being better.

But even within existing neural network frameworks, there is currently a key limitation: neural network training today is fundamentally continuous, with the effects of each batch of examples being propagated back to update the weights . In fact, with current computer hardware - even considering GPUs - a neural network spends most of its time "idling" during training, with only one part being updated at a time. In a sense, this is because our current computers tend to have memory separate from the CPU (or GPU). But in the brain, it's presumably different—every "memory element" (i.e. neuron) is also a potentially active computational element. If we could set up our future computer hardware in this way, it would be possible to train more efficiently.

"Of course, a large enough network can do anything!"

Capabilities like ChatGPT seem impressive, and one might imagine that if one could "go ahead" and train larger and larger neural networks, they would eventually be able to "do anything". If people focus on things that are easy for humans to think about directly, then it is likely to be so. However, the lesson of science over the past few hundred years is that some things can be calculated through formal processes, but are not easily obtained by human direct thinking.

Non-trivial mathematics is a big example. But the general case is actually computing. And the final problem is the phenomenon of computational irreducibility. There are some calculations that one might think require many steps to complete, but can in fact be "simplified" to something fairly straightforward. But the discovery of computational irreducibility means this doesn't always work. Conversely, some procedures—perhaps like this one—need to trace each computational step in order to figure out what happened:

The things we normally do with our brains are presumably chosen specifically to avoid the irreducibility of computation. Doing math in one's brain takes special effort. And, in practice, it is largely impossible to "think" the operational steps of any non-microscopic program just in one's brain.

Of course, we have computers for this. With computers, we can easily do long, computationally irreducible things. And the key point is that there are generally no shortcuts to these things.

Yes, we can remember many specific examples of what happened in a particular computing system. Perhaps we can even see some ("computationally reducible") patterns that allow us to generalize a bit. But the problem is that computational irreducibility means that we can never guarantee that accidents won't happen -- only by doing the computation explicitly can you know what actually happened in any given situation.

Finally, there is a fundamental tension between learnability and the non-reproducibility of computation. Learning actually compresses data by exploiting regularity. But computational non-reproducibility means that ultimately there is a limit to the regularities that can exist.

As a practical matter, we can imagine building some small computing device -- such as a cellular automaton or a Turing machine -- into a trainable system like a neural network. Moreover, this device can indeed be a good "tool" for neural nets, just like Wolfram|Alpha can be a good tool for ChatGPT. But the irreducible nature of computing means we cannot hope to "hack into" these devices and make them learn.

Or in other words, there is an ultimate trade-off between power and trainability: the more you want a system to "truly utilize" its computational power, the more it will exhibit computational non-reproducibility, its trainability The lower the sex. And the more fundamentally trainable it is, the less it can do complex calculations.

(For the current ChatGPT, the situation is actually much more extreme, because the neural network used to generate each output symbol is a pure "feed-forward" network, without loops, and thus has no ability to do anything with non-complex" control. flow" calculations).

Of course, one might ask whether it is really important to be able to do irreversible computations. In fact, for most of human history, it didn't particularly matter. But our modern technological world is built on engineering that uses at least mathematical calculations, and increasingly more general calculations. If we look at the natural world, it's full of irreducible computations -- which we're slowly understanding how to mimic and use for our technological purposes.

Yes, a neural network can certainly notice various regularities in the natural world, and we can easily notice these regularities through the "helpless human mind". But if we want to solve things that fall within the realm of mathematics or computational science, a neural network can't do it -- unless it effectively "uses" an "ordinary" computing system "as a tool".

However, there are some potential points of confusion in all of this. In the past, there were many tasks -- including writing articles -- that we thought were "fundamentally too hard" for computers. And now that we see these tasks being done by ChatGPT, etc., we tend to suddenly think that computers must have become more powerful, especially beyond what they are already basically able to do (such as gradually computing cellular automata and other computing systems. the behavior of).

But this is not the correct conclusion. Computationally irreducible processes remain computationally irreducible, and remain fundamentally difficult for computers—even if computers can easily compute their individual steps. Instead, we should conclude that tasks that we humans can do that we don't think computers can do, such as writing articles, are actually in some sense easier to compute than we think.

In other words, the neural network was able to successfully write an essay because writing an essay proved to be a "computationally shallower" problem than we thought. In a sense, this brings us closer to "having a theory" of how we humans do things like write essays, or process language in general.

If you have a large enough neural network, then, yes, you might be able to do anything a human can easily do. But you won't capture what nature can do in general -- or what the tools we shape from nature can do. And it is the use of these tools—both practical and conceptual—that has enabled us in recent centuries to go beyond the limits of what "purely helpless human thought" could achieve, and to capture for human purposes to much more in the physical and computational universe.

6. The concept of embedding

Neural networks -- at least in their current setup -- are fundamentally numbers-based. So if we're going to use them for things like text, we need a way to represent our text numerically.

Of course, we could start (essentially like ChatGPT did) assigning a number to each word in the dictionary. However, there is an important idea—for example, which is at the heart of ChatGPT—that goes beyond this scope. This is the concept of "embedding". We can think of embeddings as a way of trying to represent the "essence" of things as arrays of numbers -- the property that "nearby things" are represented by nearby numbers.

So, for example, we can think of the embedding of a word as trying to arrange words in a kind of "meaning space" in which words that are somehow "close in meaning" appear in the embedding. Embeddings used in practice - such as in ChatGPT - often involve large lists of numbers. But if we project it into 2D space, we can show examples of how the embedded words are arranged:

And, yes, what we saw did a really good job of capturing typical everyday impressions. But how can we construct such an embedding? The general idea is to look at a large amount of text (5 billion words from the web in this case) and see how similar the "environments" in which different words appear are. So, for example, "alligator" and "crocodile" often appear interchangeably in other similar sentences, which means they are placed nearby in the embedding. But "radish" and "eagle" don't appear in other similar sentences, so they are placed far away in the embedding.

But how do you actually implement something like this using a neural network? Let's start by talking about embeddings not for words, but for images. We want to find some way to describe images by lists of numbers so that "images we think are similar" are assigned to similar lists of numbers.

How do we tell if we should "think images are similar"? Well, if our images are, for example, handwritten digits, we might "think two images are similar" if they are the same digit. Earlier, we discussed a neural network trained to recognize handwritten digits. We can think of this neural network as being set up to put images into 10 different bins, one for each digit, in its final output.

But what if we "intercept" what's going on inside the neural network before making the final decision that "it's a '4'"? We might imagine that in a neural network, there are numbers that describe an image as "mostly 4, but a little bit is 2" or something like that. And the idea is to pick out such numbers as embedded elements.

So here's a concept. Instead of directly trying to describe "what images are near what other images", we consider a well-defined task (in this case digit recognition) for which we have access to explicit training data - and then exploit the fact that In doing this task, the neural network implicitly makes what amounts to a "proximity decision". So we don't need to talk about "image proximity" explicitly, but just the specific question of what number an image represents, and then we "leave it to the neural network" to decide implicitly what that means "image proximity sex".

So how does this work in more detail for digit recognition networks? We can think of this network as consisting of 11 consecutive layers, and we can summarize it diagrammatically (activation functions are shown as separate layers):

At the beginning, we feed the first layer the actual image, represented as a 2D array of pixel values. At the last layer, we're given an array of 10 values, which we can think of as representing how "sure" the network is that the image corresponds to each digit from 0 to 9.

Input the image (handwritten 4), the value of the neuron in the last layer is:

In other words, the neural network has been "very sure" that the image is 4 at this time, and in order to actually get the output "4", we only need to pick the position of the neuron with the largest value.

But what if we go one step further? The last operation in the network is a so-called softmax, which tries to "enforce certainty". But before that, the value of the neuron is:

The neuron representing "4" still has the highest value. But there is also information in the values ​​of other neurons. We can expect that this list of numbers can in some sense be used to describe the "nature" of the image, thus providing something we can use as an embedding. So, for example, each of the 4's here has a slightly different "signature" (or "feature embedding") - all very different from the 8's:

Here, we basically use 10 numbers to describe our image features. But usually, it's better to use more numbers than this. For example, in our digit recognition network, we can get an array of 500 digits by mining the previous layer. And this could be a reasonable array to use as an "image embed".

If we want to explicitly visualize the "image space" of handwritten digits, we need to "reduce dimensionality", effectively projecting our resulting 500-dimensional vector into, say, three-dimensional space:

We just talked about creating a feature (and thus an embedding) for images, effectively based on identifying similarities between images, determining (according to our training set) whether they correspond to the same handwritten digit. If we have a training set that identifies, say, each image as belonging to 5000 common types of objects (cats, dogs, chairs...), we can do the same for images more generally.

In this way, we can craft an image embedding that is "anchored" by our recognition of common objects, but then "generalizes around it" based on the behavior of the neural network. The point is, as long as this behavior is consistent with how we humans perceive and interpret images, this will end up being a "true for us" embedding and useful in practice for doing "human judgment-like" tasks .

Ok, so how do we follow the same approach to find word embeddings? The point is to start with a word task that we can train on at any time. And the standard task is "word prediction". Let's say we got "the cat". Based on a large corpus of text (say, text content on the web), what are the probabilities of different words that might "fill in the blanks"? Or, given "__black_", what is the probability of a different "flanker"?

How do we set up this problem for a neural network? At the end of the day, we have to put everything in numbers. One way to do this is to assign a unique number to each of the 50,000 or so common words in the English language. So, for example, "the" might be 914, and "cat" (preceded by a space) might be 3542. (These are the actual numbers used by GPT-2.) So for the "the_cat" problem, our input might be {914, 3542}. What should the output look like? Well, it should be a list of 50000 or so numbers, effectively giving the probability of each possible "filler" word.

Again, to find an embedding, we "intercept" the "inside" of the neural network before it "reaches its conclusion" - and then pick up the list of numbers that appear there, which we can think of as "the feature".

Okay, so what do these representations look like? Over the past 10 years, a range of different systems have been developed (word2vec, GloVe, BERT, GPT, … ), each based on a different neural network approach. But in the end, all of these systems are lists of hundreds to thousands of numbers that characterize words.

In their raw form, these "embedding vectors" are fairly uninformative. For example, here are the raw embedding vectors produced by GPT-2 for three specific words:

If we do something like measure the distance between these vectors, then we can discover things like the "closeness" of words. We will discuss in more detail later what we might think of as the "cognitive" meaning of this embedding. But now the main point is that we have a way to efficiently turn words into "neural network friendly" collections of numbers.

But actually, we can go one step further and not just describe words as collections of numbers; we can also describe sequences of words, or entire blocks of text. In ChatGPT, that's how it handles things.

It takes the text it has so far and generates an embedding vector to represent it. Then, its goal is to find the probabilities of different words that might come next. It represents its answer as a list of numbers that basically gives probabilities for 50,000 or so possible words.

(Strictly speaking, ChatGPT does not deal with words, but with "tokens" - convenient linguistic units, which may be whole words or just fragments like "pre" or "ing" or "ized". Use Symbols make it easier for ChatGPT to handle rare, compound, and non-English words, and sometimes, for better or worse, to invent new words.)

 7. Inside ChatGPT

Alright, we're finally ready to talk about ChatGPT internals. Yes, in the end, it's a huge neural network -- currently a version of the so-called GPT-3 network, with 175 billion weights. In many ways, this is a very much like the other neural networks we've discussed. But it's a neural network specifically set up to handle language problems. Its most notable feature is a neural network architecture called a "transformer".

In the first neural network we discussed above, every neuron in any given layer was basically connected (with at least some weights) to every neuron in the previous layer. However, if one is dealing with data with a special, known structure, such a fully connected network is (presumably) overkill. So, for example, in the early stages of processing images, it is typical to use so-called convolutional neural networks (“convnets”), in which neurons are effectively arranged on a grid similar to the pixels in an image— And only connect to nearby neurons on the grid.

The idea of ​​a transformer is to do something at least somewhat similar to a sequence of tokens that make up a piece of text. However, instead of just defining a fixed region in the sequence where connections can be made, the transformer introduces the notion of "attention" -- and the notion of "attention" to some parts of the sequence more than others. Maybe one day it will make sense to just fire up a generic neural network and do all the customization through training. But at least for now, "modularizing" things seems to be crucial in practice, like transformers, and probably like our brains.

Alright, so what does ChatGPT (or rather, the GPT-3 network it's based on) actually do? Recall that its overall goal is to perpetuate text in a "reasonable" way, based on what it's seen training (including looking at billions of pages of text from the web, etc.), so at any time, it has some amount of Text whose goal is to suggest the appropriate choice for the next markup to be added.

It operates in three basic phases:

First, it takes the sequence of tokens corresponding to the text so far and finds an embedding (i.e. an array of numbers) representing those tokens. Second, it operates on this embedding in "standard neural network fashion", with values ​​"passed through" successive layers in the network, producing a new embedding (i.e. a new array of numbers). Then, from the last part of this array, it generates an array of about 50,000 values, which become the probabilities of the different possible next tokens.

(And, yes, the number of tokens that happens to be used is the same as the number of common words in English, although only about 3000 tokens are whole words and the rest are fragments.) The key point is that every part of this pipeline is Implemented by a neural network whose weights are determined by end-to-end training of the network. In other words, in reality, apart from the overall architecture, nothing is "explicitly designed"; everything is "learned" from the training data.

However, there are many details in how the architecture is set up, reflecting various experiences and lore of neural networks. And, while this is certainly getting into the weeds, I think it's useful to talk about some of the details, especially to understand what it takes to build something like ChatGPT.

The first is the embedded module. Here is a Wolfram Language diagram of GPT-2:

The input is a vector of n tokens (represented by integers from 1 to 50,000 as described in the previous section). Each of these tokens is converted (via a single-layer neural network) into an embedding vector (length 768 for GPT-2 and 12,288 for ChatGPT’s GPT-3). At the same time, there is a "secondary path", which takes the sequence of (integer) positions of the markers, and creates another embedding vector from these integers. Finally, the embedding vectors from token values ​​and token positions are added together - yielding the final sequence of embedding vectors for the embedding module.

Why just add together the token value and the embedding vector for the token position? I don't think there's any particular scientific basis for this. It's just that all sorts of different things have been tried, and this is the one that seems to work. It's also part of the neural network lore, in the sense that as long as your setup is "roughly right", you can usually do enough training to figure out the details without really needing to "understand at an engineering level" neural How the network ultimately configures it.

Here is what the embedded module does, for the string "hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye bye 2":

The elements of the embedding vector for each markup are displayed further down the page, and across the page we see first the embedding for "hello", followed by the embedding for "bye". The second array above is positional embeddings - whose seemingly somewhat random structure is just "learned by chance" (GPT-2 in this case).

Well, after the embedding module, comes the transformer's "main event": a series of so-called "attention blocks" (12 for GPT-2, 96 for ChatGPT's GPT-3). It's all complicated -- reminiscent of typically incomprehensible large-scale engineered systems, or, biological systems. But anyway, here is a schematic of a single "attention block" (for GPT-2):

Within each such attention block, there are a series of "attention heads" (GPT-2 has 12, ChatGPT's GPT-3 has 96) - each independently operating on a different numerical block in the embedding vector of. (Yes, we don't know why splitting the embedding vector is a good idea, or what "meaning" the different parts of it are; it's just one of those things that "was found to work").

Alright, so what do attention heads do? Basically, they are a way to "look back" in a sequence of tokens (that is, in the text produced so far), and "package" the past into a form that facilitates finding the next token.

In the first section above, we talked about using 2-gram probabilities to pick words based on their immediate predecessors. What the "attention" mechanism in the transformer does is allow "attention" to even earlier words - so it is possible to capture the way, say, verbs can refer to nouns that appear many words before them in the sentence .

At a more detailed level, what the attention head does is recombine chunks of the embedding vector associated with different tokens with certain weights. So, for example, the 12 attention heads in the first attention block (in GPT-2) have the following for the string "hello, bye" above ("look-back-all-the-way- beginning-the-sequence-of-tokens") mode of "reorganization weight":

After being processed by the attention head, the resulting "reweighted embedding vector" (length 768 for GPT-2, length 12288 for ChatGPT's GPT-3) is passed to a standard "fully connected" neural network layer . It's hard to grasp what this layer is doing. But here is a picture of the 768×768 weight matrix it uses (GPT-2 here):

With a 64×64 moving average, some (random walk-like) structure starts to emerge:

What determines this structure? Ultimately, it could be some "neural network encoding" of the characteristics of human language. But until now, what those traits might be was much less clear. In effect, we're "opening up the brains of ChatGPT" (or at least GPT-2) and discovering that, yes, there's complexity in there, and we don't understand it -- although it eventually produces recognizable human speech.

Well, after going through one attention block, we get a new embedding vector - which is then passed successively to other attention blocks (GPT-2 has 12 in total; GPT-3 has 96). Each attention block has its own specific "attention" and "fully connected" weighting modes. Here is the sequence of attention weights for the "hello, bye" input of GPT-2, for the first attention head:

Here is the (moving average) "matrix" for the fully connected layer:

Curiously, although these "weight matrices" look similar in different attention blocks, the size distribution of the weights can be somewhat different (and not always Gaussian):

So, what is the net effect of the transformer after going through all these attention blocks? Essentially, it transforms the original set of embeddings of symbolic sequences into the final set. And the specific way ChatGPT works is to take the last embedding in this set and "decode" it to produce a list of probabilities about what the next token should be.

This is the summary of ChatGPT. It may seem complicated (especially since it has many unavoidable and somewhat arbitrary "engineering choices"), but in reality, the final elements involved are quite simple. Because ultimately what we're dealing with is just a neural network of "artificial neurons," each of which is doing the simple operation of taking a set of numeric inputs and combining them with some weights.

The raw input to ChatGPT is an array of numbers (embedding vectors of symbols so far), and when ChatGPT "runs" to produce a new symbol, all that happens is that these numbers "pass through" the layers of the neural network, and each neuron "does its thing" and passes the result to the neurons in the next layer. There are no loops or "backs". Everything is just "feedforward" through the network.

This is a very different setup from a typical computing system -- a Turing machine -- where results are repeatedly "reprocessed" by the same computing elements. Here, each computational element (i.e., neuron) is used only once, at least when generating a specific output symbol.

But in a sense, even in ChatGPT, there is still an "outer loop" that reuses computational elements. Because when ChatGPT wants to generate a new token, it always "reads" (ie takes as input) the entire sequence of tokens before it, including the tokens that ChatGPT itself "wrote" before. We can take this setup to mean that ChatGPT—at least at its outermost level—is involved in a "feedback loop", although in this loop each iteration is explicitly shown as a mark.

But let's get back to the heart of ChatGPT: the neural network that is used repeatedly to generate each token. In a way, it's pretty simple: a whole collection of identical artificial neurons. Some parts of the network simply consist of ("fully connected") layers of neurons, where each neuron of a certain layer is connected (with a certain weight) to each neuron of the previous layer. But, especially its transformer structure, ChatGPT has more structural parts, where only specific neurons of different layers are connected. (Of course, one can still say, "all neurons are connected" -- but some neurons have zero weight).

Furthermore, some aspects of the neural network in ChatGPT are not most naturally thought of as composed of "homogeneous" layers. For example, as the icon summary above shows, in an attention block, there are places where "multiple copies" of the incoming data are made, and each copy then goes through a different "processing path", possibly involving a different amount of layers before regrouping. But while this may be a convenient representation of what's going on, it's always possible, at least in principle, to consider layers "densely populated", but just let some weights be zero.

If we look at the longest path of ChatGPT, there are around 400 (core) layers involved - not a huge number in some ways. But there are millions of neurons -- 175 billion connections in total, so 175 billion weights. One thing to realize is that every time ChatGPT generates a new token, it does a calculation involving each of these weights.

In terms of implementation, these calculations can be organized "by layer" into highly parallel array operations, which can be conveniently done on the GPU. However, there are still 175 billion calculations (and a little more at the end) to be performed for each token produced - so yes, it's not surprising that generating a long piece of text with ChatGPT takes a while.

But in the end, the most remarkable thing is that all of these operations - each of which is quite simple - somehow work together to do such a good job of "human-like" text generation. It must be emphasized again that there is no "final theoretical reason" (at least to our knowledge so far) to explain such work. In fact, as we'll discuss, I think we have to look at this as a potentially surprising scientific discovery: in a neural network like ChatGPT, it's possible to capture the essence of the human brain in generating language .

8. ChatGPT training

Well, now we have given the working outline of ChatGPT after establishment. But how is it built? How are the 175 billion weights in its neural network determined? Basically, they are the result of very large-scale training, based on a huge corpus of text -- on the web, in books, etc. -- written by humans.

As we said, it is not at all obvious whether a neural network will be able to successfully generate "human-like" text, even given all the training data. And, again, it seems like detailed engineering is required to make this happen. But the biggest surprise and discovery of ChatGPT is that it is possible. In fact, a neural network with "only" 175 billion weights can make a "reasonable model" of text written by humans.

In modern times, there are many human-written texts that exist in digital form. The public web has pages written by at least a few billion people, totaling perhaps a trillion words of text. These numbers are likely to be at least 100 times larger when including non-public pages. To date, there are more than 5 million digitized books available (out of a total of 100 million or so books ever published), and another 100 billion or so words.

As a personal comparison, my total published material in my lifetime is less than 3 million words, over the last 30 years I've written about 15 million words of email, typed a total of about 50 million words, and over the past few years I've been on live Said more than 10 million words. And, yes, I'm going to train a bot from all of these).

But, well, given all this data, how do we train a neural network from it? The basic process is very similar to what we discussed in the simple example above. You present a batch of examples, and then you adjust the weights in the network so that the network's error ("loss") on those examples is minimized. The main problem with "backpropagating" from errors is that every time you do this, every weight in the network usually changes by at least a small amount, and that's a huge number of weights to process. (The actual "reverse calculation" is usually only a small constant harder than the forward calculation).

With modern GPU hardware, it is trivial to compute results from thousands of examples in parallel. However, when it comes to actually updating the weights in a neural network, current methods require us to basically do it batch by batch. (Yes, it's possible that the actual brain - its combination of computing and memory elements - currently has at least one architectural advantage).

Even in the seemingly simple case of learning numeric functions that we discussed earlier, we found that we often had to use millions of examples to successfully train a network, at least from scratch. So, this means how many examples do we need to train a "human-like language" model? There doesn't seem to be any fundamental "theoretical" way to know. But in practice, ChatGPT has been successfully trained on hundreds of billions of words of text.

Some text is entered multiple times, some only once. But somehow it "got what it needed" from the text it saw. But how large a network should it need to be to "learn well" given the amount of text it needs to learn? Again, we don't yet have a basic theoretical way to tell.

Ultimately—and we'll discuss this further below—human language presumably has some kind of "overall algorithmic content," and what humans typically say in it. But the next question is how efficient neural networks will be at implementing models based on the content of this algorithm. We don't know either -- although ChatGPT's success suggests it's reasonably efficient.

Finally we can note that ChatGPT uses hundreds of billions of weights - what it does is comparable to the total number of words (or tokens) of the training data it was given. In some respects, it is perhaps surprising (though also empirically observed in small analogues of ChatGPT) that the "size of the network" that seems to work well is so similar to the "size of the training data". After all, it certainly doesn't mean that "within ChatGPT" all texts from the web, books, etc. are "directly stored". Because what's inside ChatGPT is actually a bunch of numbers -- slightly less than 10 digits of precision -- that's some kind of distributed encoding of the overall structure of all this text.

In other words, we can ask what is the "effective information content" of human language, and what is usually said in it. Here is the original corpus of language instances. Then there is the representation in ChatGPT's neural network. This representation is likely to be far from the "algorithmically smallest" representation (as we discuss below). But it is a representation that is easily used by neural networks. In this representation, the "compression" of the training data appears to be low; on average, it appears that less than one neural network weight is needed to carry the "information content" of the training data for one word.

When we run ChatGPT to generate text, we essentially have to use each weight once. So, if there are n weights, we have n computational steps to do - although in practice many steps can often be done in parallel in the GPU. However, if we need about n words of training data to set these weights, then from what we said above, we can conclude that we need about n2 computational steps to train the network - which is why, with the current approach, one ends up needing to talk about billions of dollars of training work.

9. Beyond Basic Training

Much of the work of training ChatGPT is "showing" it large amounts of existing text from the web, books, etc. But as it turns out, there's an apparently quite important part.

Once it has finished "raw training" on the raw corpus it was shown, the neural network inside ChatGPT can start generating its own text, proceeding to prompts, etc. But while the results of doing so often seem plausible, they often — especially for longer texts — “loosen” in ways that are often quite inhuman. This is not something one can easily discover, say, by doing traditional statistics on the text. But it's something that's easy to notice for someone actually reading the text.

A key idea in building ChatGPT is that after "passively reading" things like the web, there's one more step: getting actual humans to actively interact with ChatGPT, see what it produces, and actually give it feedback "how to be A good chatbot".

But how do neural networks use this feedback? The first step is simply to have humans evaluate the results of the neural network. But then another neural network model was built to try to predict those ratings. But now this predictive model can be run on the original network -- basically like a loss function that actually lets that network be "tuned up" by human feedback. And the results in practice seem to have a big impact on the system's success in producing "human-like" output.

Overall, it is interesting that the "initially trained" network seems to require very little "poke" to make it usefully grow in a particular direction. One might think that for the network to behave like it "learned something new", it would have to run the training algorithm, adjust the weights, etc.

but it is not the truth. Instead, you basically just have to tell ChatGPT something as part of the hints you give it, and then it can successfully leverage what you told it when generating text. Again, I think this is an important clue to understanding what ChatGPT is "really doing" and how it relates to human language and thought structures.

There is of course something human-like about this: at least after all the pre-training it has received, you can tell it something and it can "remember it" -- at least "long enough" to use it to generate a paragraph text. So, what happens in such a situation?

It could be that "everything you could possibly tell it is already there" - you're just directing it to the right place. But that doesn't seem to be plausible. Instead, it seems more likely that, yes, the elements are already there, but the specifics are defined by something like "trajectories between these elements", which is what you tell it.

In fact, like humans, if you tell it something weird, unexpected, and totally out of the framework it knows, it doesn't seem to be able to successfully "integrate" this. It can only "integrate" it if it basically rides on top of the frame it already has in a fairly simple way.

It's also worth pointing out again that there are inevitably "algorithmic limits" to what a neural network can "receive". Tell it "shallow" rules like "this to that", and the neural network is likely to be able to represent and reproduce these rules well -- in fact, what it "already knows" from the language will give it an immediate pattern to follow.

But it won't work if you try to formulate an actual "deep" computational rule for it, involving many potentially irreducible computational steps. (Remember, at each step, it's always "feeding data forward" in its network; never looping, except to generate new tokens.)

Of course, the network can learn specific "irreducible" computational answers. But as long as there is a possibility of combining numbers, this "look-up table" method will not work. So yes, just like humans, now is the time for neural networks to "reach out" and use actual computing tools. (Yes, Wolfram|Alpha and the Wolfram Language are only appropriate because they are built to "talk about things in the world", like neural networks for language models).

10. What really makes ChatGPT work?

Human language—and the thought processes that produce it—have always seemed to represent a pinnacle of complexity. In fact, it seems somewhat remarkable that the human brain — with a network of "only" 100 billion or so neurons (and maybe another 100 trillion connections) — is capable of doing the job. Perhaps, one might imagine, there is something more to the brain than a network of neurons, like some new undiscovered layer of physics.

But now with ChatGPT, we have an important new piece of information: We know that a purely artificial neural network with as many connections as the brain has neurons can generate human language surprisingly well.

And, yes, it's still a large and complex system -- with as many weights in neural networks as there are currently words in the world. But in a way it still seems hard to believe that all the richness of language and what it can talk about can be encapsulated in such a limited system.

Part of this no doubt reflects the ubiquitous phenomenon (which becomes apparent for the first time in the example of Rule 30) that computational processes can actually greatly amplify the apparent complexity of a system, even though its fundamental very simple. But, in practice, as we discussed above, neural networks of the kind used in ChatGPT tend to be purpose-built to limit the effects of this phenomenon and the non-repeatability of the computations associated with it, in order to make it easier to train conduct.

So how did something like ChatGPT get so far in terms of language? The basic answer, I think, is that the language is fundamentally much simpler than it appears. This means that ChatGPT - even though its neural network structure is ultimately simple - can successfully "capture" the essence of human language and the thinking behind it. Furthermore, in its training, ChatGPT somehow "implicitly discovered" any regularities in language (and thinking) that made it possible.

In my opinion, the success of ChatGPT provides us with a fundamental and important piece of scientific evidence: it shows that we can expect significant new "laws of language" -- and effective "laws of thought" -- to be discovered there. In ChatGPT, being a neural network, these regularities are implicit at best. But if we could somehow make these laws explicit, it would be possible to do the various things that ChatGPT does in a more direct, efficient, and transparent way.

But okay, so what might those laws look like? Ultimately, they must give us a prescription for how a certain language—and what we say in it—can be put together. We'll discuss later how "observing ChatGPT" can give us some hints about this, and how what we learn from building computational languages ​​can inform our way forward. But first let's discuss two long-known examples of what amount to "language laws" -- and how they relate to how ChatGPT works.

The first is the syntax of the language. Language is not just a random collection of words. Instead, there are (fairly) clear grammatical rules about how different kinds of words fit together: in English, for example, a noun can be preceded by an adjective and followed by a verb, but usually no two nouns can be right next to each other. Such grammatical structures can be captured (at least approximately) by a set of rules defining how to put together what amounts to a "parse tree":

ChatGPT does not have any explicit "knowledge" of such rules. But during training, it implicitly "discovered" these rules, and then seemed to be pretty good at following them. So, how does it work? On a "big picture" level, this is not clear. But for some inspiration, it might be instructive to look at a simpler example.

Consider a "language" consisting of sequences of () and () whose grammar dictates that parentheses should always be balanced, as represented by a parse tree:

Can we train a neural network to generate "grammatically correct" parenthesis sequences? There are many ways to process sequences in a neural network, but let's use a transformer network, like ChatGPT does. Given a simple transformer network, we can start by feeding it syntactically correct parenthesis sequences as training examples.

A subtlety (which actually shows up in ChatGPT's human language generation as well) is that in addition to our "content markers" (here "(" and ")"), we must also include an "end" marker whose The generation indicates that the output should not continue any further (ie, for ChatGPT, we have reached the "end of the story").

If we set up a TransformNet with just one attention block with 8 heads and feature vectors of length 128 (ChatGPT also uses feature vectors of length 128, but with 96 attention blocks, each with 96 heads), Then it seems impossible to make it learn much parenthesis language. However, with 2 attention heads, the learning process seems to converge - at least after being given 10M or so examples (and, as is common with transformer networks, showing more examples seems to degrade its performance ).

So, for this network, we can do a similar job to ChatGPT and ask the probability of what the next token should be - in a sequence of parentheses:

In the first case, the network is "pretty sure" that the sequence cannot end here - which is fine, because if it did, the parentheses would leave an imbalance. In the second case, however, it "correctly recognizes" that the sequence can end here, although it also "points out" that it is possible to "restart", dropping a "(", presumably followed by a ")". But heck, even with 400,000 or so hard-trained weights, it's saying there's a 15% chance of having ")" as the next token - which isn't right, because that's bound to result in an unbalanced parenthesis .

If we ask the network to provide the highest probability of completion for progressively longer ( ) sequences, we get the following:

Yes, up to a certain length, the web does a good job. But then it started failing. This is a very typical thing to see in this "exact" case of neural networks (or machine learning in general). The situation that humans can "solve at a glance" can also be solved by neural networks. But in cases where one needs to do something "more algorithmic" (such as explicitly counting whether the parentheses are closed), neural networks tend to be somehow "too computationally shallow" to do it reliably. (BTW, even the current full ChatGPT has a hard time matching parentheses in long sequences correctly).

So what does this mean for grammars like ChatGPT and languages ​​like English? The parenthesis language is "naive" - ​​and more like an "algorithmic story". But in English, it is much more realistic to be able to "guess" what is grammatical based on local word choices and other hints.

And, yes, neural nets are much better at this - although it might miss some "formally correct" cases, and humans might too. But the main point is that the fact that language has an overall syntactic structure - and all the regularities it implies - in some sense limits the "degree" a neural network can learn. A key "science-like" observation is that transformer architectures like the neural networks in ChatGPT seem to be able to successfully learn the kind of nested Tree-like syntactic structure.

Syntax provides a constraint on language. But apparently there's more. Sentences like "The blue theory of the curious electronic fish-eating fish" are grammatically correct, but not something one would normally expect to say, and would not be considered successful if ChatGPT generated it - because, Well, in the normal sense of the words in it, it basically doesn't make sense.

However, is there a general way to tell if a sentence makes sense? There is no traditional holistic theory of this. However, we can argue that ChatGPT has implicitly "developed a theory" after being trained on billions of (possibly meaningful) sentences from the web.

What might this theory look like? Well, there's one little corner that's been known for basically two thousand years, and that's logic. Of course, logic, in the form of Syllogistic found by Aristotle, is basically a way of saying that sentences that follow certain patterns are reasonable and others are not.

So, for example, it is reasonable to say "all X is Y, and this is not Y, so it is not X" (just as "all fish are blue, and this is not blue, so it is not a fish"). Just as one can imagine, somewhat whimsically, that Aristotle discovered dual logic by using ("machine-learning-style") large numbers of rhetorical examples, so one can imagine that in training ChatGPT it would be able to Lots of text etc. to "discover dual logic".

(Yes, while we can expect ChatGPT to produce text containing "correct inferences", e.g. based on dual logic, the situation is quite different when it comes to more complex formal logic - I think we can expect it here fails for the same reason it fails on parenthesis matching).

But beyond the narrow example of logic, what can be said about how to systematically construct (or recognize) even plausibly meaningful texts? Yes, there are some things, like Crazy Freedom, that use very specific "phrase templates". But somehow, ChatGPT implies a more general approach. Maybe there's not much to say about how to do this other than "it happens somehow when you have 175 billion neural network weights". But I strongly suspect there is a simpler, more powerful story.

11. Meaningful space and the law of semantic movement

We discussed above that in ChatGPT, any piece of text is effectively represented by an array of numbers, which we can regard as the coordinates of a point in some kind of "linguistic feature space". Therefore, when ChatGPT continues a text, it is equivalent to tracing a trajectory in the language feature space. But now we can ask what makes this trajectory correspond to text that we consider meaningful. Maybe there will be some kind of "semantic motion law" to define - or at least constrain - how points in the linguistic feature space move while retaining "meaningful"?

So, what does this linguistic feature space look like? Here is an example of how individual words (here common nouns) are laid out if we project such a feature space onto a 2D space:

Another example we saw above was based on words representing plants and animals. But the point in both cases is that "semantically similar words" are placed nearby.

As another example, here's how words corresponding to different speech parts are arranged:

Of course, a given word does not generally have "one meaning" (or necessarily corresponds to only one discourse). By looking at how a sentence containing a word is laid out in feature space, we can often "distinguish" different meanings -- as in the example "crane" here (crane, "bird" or "machine"?):

Ok, so we can at least think that this feature space is where "words with similar meaning" are placed in this space, which is reasonable. But, in this space, what kind of additional structures can we identify? For example, is there some notion of "parallel transport" that reflects "flatness" in space? One way to grasp this is to look at the analogy:

And, yes, even when we project into two dimensions, there is often at least a "hint of flatness", although it's certainly not universally visible.

So, what about the trajectory? We can look at the trajectory of ChatGPT's hints in feature space - and then we can look at how ChatGPT continues this trajectory:

There are of course no "geometrically obvious" laws of motion here. This isn't surprising at all; we can fully expect this to be a rather complicated story. And, for example, even if a "semantic law of motion" could be found, it is far from obvious what embedding (or, indeed, what "variables") it most naturally expresses.

In the image above, we show several steps in the "trajectory" - at each step we pick the word that ChatGPT thinks is the most likely (the "zero temperature" case). But we can also ask, at a certain point, which words can be "next" with what probability:

In this case, what we see is a "fan" of high probability words that seems to have a more or less clear direction in the feature space. What if we go any further? Here is the continuous "fan" that occurs as we "move" along the trajectory:

Here's a 3D representation with a total of 40 steps:

And, yes, it seems like a mess - and doesn't do anything to specifically encourage the idea that we can expect to determine "mathematically-like" motion by empirically studying "what ChatGPT does in there" Semantic Laws". But maybe we're just looking at the "wrong variable" (or wrong coordinate system), as soon as we look at the right variable, we'll immediately see that ChatGPT is doing something "mathematically simple" like following Geodesic. But so far, we're not ready to "empirically decode" ChatGPT's "discovery" of how human language is "pieced together" from its "internal behavior."

12. Semantic Grammar and the Power of Computational Language

What does it take to produce "meaningful human language"? In the past, we might have thought it couldn't be a human brain. But now we know that ChatGPT's neural network can do this task very well. Still, maybe that's as far as we can go, and nothing simpler -- or more humanly comprehensible -- will work.

But what I strongly suspect is that the success of ChatGPT implicitly reveals an important "scientific" fact: there is actually much more to the structure and simplicity of meaningful human language than we know, and there may even eventually be There are fairly simple rules to describe how the language is composed.

As we mentioned above, syntactic grammar gives the rules of how words in human language are combined corresponding to different discourses. But to deal with meaning, we need to go a step further. And one version of how to do this is to consider not only the syntactic grammar of the language, but also the semantic grammar.

For grammar purposes, we identify things like nouns and verbs. But for the purpose of semantics, we need a "finer level". So, for example, we can identify the notion of "moving", and the notion of "object" that "maintains an identity independent of location". Each of these "semantic concepts" has endless concrete examples.

But, for the purposes of our semantic grammar, we're going to just have some kind of general rule that basically says "objects" can "move". There's a lot to say (some of which I've said before) about how this all works. But I just want to say a few words here and point out some potential avenues for development.

It's worth mentioning that even if a sentence is perfectly ok according to the semantic grammar, it doesn't mean that it has been (or even can be) achieved in practice. "Elephants went to the moon" will undoubtedly "pass" our semantic grammar, but it certainly hasn't materialized in our actual world (at least not yet) - although for a fictional world, it's definitely fair game .

When we start talking about a "semantic grammar", we quickly ask: "What's underneath it?" What "world model" does it assume? Syntactic grammar is really just a matter of building language out of words. However, a semantic grammar necessarily involves some kind of "world model" -- something that acts as a "skeleton" on which a language made of actual words can be layered.

Until recently, we might have imagined that (human) language would be the only general way of describing our "model of the world". The formalization of certain kinds of things, especially based on mathematics, has been going on for centuries. But now there is a more general formal approach: computational languages.

Yes, this has been a big project of mine for over forty years (now embodied in the Wolfram Language): to develop a precise symbolic representation that can say as broadly as possible about the things in the world, and what we care about abstract things. So, for example, we have symbolic representations of cities, molecules, images, and neural networks, and we have intrinsic knowledge of how to compute these things.

And, after decades of work, we've covered a lot of ground this way. But in the past, we didn't deal with "everyday discourse" in particular. In "I bought two pounds of apples", we can easily represent (and perform nutritional and other calculations on) "two pounds of apples". But we don't (yet) have a symbolic representation of "I bought it".

It's all about the idea of ​​a semantic grammar - the goal is to provide a common symbolic "construction suite" for concepts, which will give us the rules for what can be combined with what, giving us the "flow" of possible translation into human language Provide rules.

But suppose we have this "language of symbolic discourse". What will we do with it? We can start doing things like generating "locally meaningful text". But in the end we might want more "global sense" results -- meaning "compute" more about what actually exists or happens in the world (perhaps in some coherent fictional world).

Now in the Wolfram Language, we have a huge amount of built-in computational knowledge about many kinds of things. But for a complete symbolic discourse language, we have to build additional "computations" about things in the world in general: if an object moves from place A to place B, and from place B to place C, then it moves from place A to C, and so on.

Given a symbolic discourse language, we can use it to make "independent statements". But we can also use it to ask questions about the world, "Wolfram|Alpha style". Or we could use it to state that we "want it to be like this", presumably with some external enforcement mechanism. Or we can use it to make assertions -- maybe about the real world, maybe about a particular world we're considering, fictional or otherwise.

Human language is fundamentally imprecise, not only because it is not "tethered" to a concrete computational implementation, but its meaning is basically defined only by the "social contract" between its users. But a computational language, by its nature, has a certain fundamental precision -- because ultimately what it specifies can always be "unambiguously executed on a computer".

Human language can often escape some kind of ambiguity. (When we say "planet," does that include exoplanets, etc.) But in the language of computing, we have to be precise and clear about all the distinctions we make.

In computing languages, it is often convenient to take advantage of ordinary human language to make up names. But their meaning in the language of computing is necessarily precise, and may or may not cover some specific connotations in typical human language usage.

How should we find out the basic "ontology" suitable for the language of general symbolic discourse? Well, it's not easy. This is perhaps why little has been done in these respects since Aristotle's primitive beginnings more than two thousand years ago. But it does help that today we know so much about how to think about the world computationally (and it doesn't hurt to get "basic metaphysics" from our physics projects and ragiad's ideas).

But what does all this mean in the context of ChatGPT? From its training, ChatGPT has effectively "cobbled together" a certain amount of what amounts to a semantic grammar (quite impressive). But its success gives us reason to think that building something more complete in the form of a computational language will be feasible. And, unlike our understanding of the internals of ChatGPT so far, we can expect computational languages ​​to be designed to be easily understood by humans.

When we talk about semantic grammar, we can compare it to dual logic. At first, dual logic was essentially a collection of sentence rules expressed in human language. But (yes, two thousand years later) when formal logic was developed, the original basic constructs of syllabic logic can now be used to build huge "formal towers", including for example the operation of modern digital circuits. And, we can expect the same to be true for semantic grammars more generally.

At first, it might just be able to handle simple patterns, such as expressed in text. However, once its entire computational language framework is established, we can expect that it will be able to be used to erect towers of "generalized semantic logic" that will allow us to deal in a precise and formal way with a variety of stuff, but only "under the hood" through human language, with all its ambiguities.

We can think of the constructs of computational languages—and of the semantic grammar—representing an ultimate compression of things. Because it allows us to talk about the nature of what is possible without, for example, having to deal with all the "turning phrases" that exist in ordinary human language. We can think of ChatGPT's great advantage as something somewhat similar: because it has also "drilled" in a sense to the point where it can "put languages ​​together in a semantic way" without caring about different possible wording.

So, what happens if we apply ChatGPT to an underlying computing language? A computational language can describe what is possible. But what can still be added is a sense of "what's popular" - eg based on reading everything on the web.

But, underneath, operating in the language of computation means that something like ChatGPT has immediate and fundamental access to what amounts to the ultimate tool for harnessing potentially irreducible computation. This makes it a system that can not only "generate plausible texts," but can be expected to address any solvable question of whether that text actually makes a "correct" statement about the world—or what it's supposed to be talking about.

13. So what is ChatGPT doing and why does it work?

The basic concept of ChatGPT is somewhat simple. Start with a large sample of human-created text from the web, books, etc. Then train a neural network to generate text "like this". In particular, it enables it to start with a "hint" and proceed to generate text "as it was trained to do".

As we can see, the actual neural network in ChatGPT is composed of very simple elements, albeit billions of them. The basic operation of the neural network is also very simple, mainly "passing the input once" through its elements (without any loops, etc.) for each new word (or part of a word) it generates.

But unexpectedly, this process can produce words that successfully "like" the Internet and books. Moreover, it is not only coherent human language, it also "says something", "follows its prompts" and utilizes what it "reads". It doesn't always say things that "globally make sense" (or correspond to correct computation) -- because (eg, without access to Wolfram|Alpha's "computational superpowers"), it only Things "sound like" something was said.

The specific engineering of ChatGPT makes it quite compelling. But in the end (at least until it can use external tools), ChatGPT "just" draws some "coherent textual threads" from the "statistics of conventional wisdom" it has amassed. But surprisingly, the result is so human-like. As I've discussed, this points to something very important, at least scientifically: that human language (and the mental models behind it) are somehow simpler and more "law-like" than we think. ChatGPT has implicitly spotted this. But we have the possibility to expose it explicitly through semantic grammar, computational language, etc.

ChatGPT does an impressive job at generating text, and the results are often very much like what we humans would produce. So, does this mean that ChatGPT works like a brain? Its underlying artificial neural network structure is ultimately modeled on an idealization of the brain. Moreover, many aspects of what seems likely to happen when we humans produce language are similar.

When it comes to training (aka learning), the different "hardware" of the brain and current computers (and, perhaps, some unexplored algorithmic ideas) forces ChatGPT to use a many) strategy. One more point: Even unlike typical algorithmic calculations, there is no "looping" or "recomputation of data" inside ChatGPT. And that inevitably limits its computing power—even compared to current computers, but certainly compared to the brain.

It's unclear how to "fix this" and still maintain the ability to train the system with reasonable efficiency. But doing so will presumably allow future ChatGPTs to do more "brain-like things." Of course, there are plenty of things that the brain doesn't do well -- especially when it comes to the equivalent of irreducible computation. For these, both the brain and things like ChatGPT must look to "external tools" -- such as the Wolfram Language.

But for now, it's exciting to see what ChatGPT has been able to do. In a way, it's a great example of the fundamental scientific fact that large numbers of simple computing elements can do extraordinary and unexpected things. But it also provides us with the best impetus we have had in two millennia to better understand a central feature of the human condition, the fundamental features and principles of human language and the thought processes that underlie it.

Charger will bring you the latest and most comprehensive interpretation as soon as possible, don't forget the triple wave.

Pay attention to the WeChat public account: Recharge your resources

Reply: Chat GPT
Charger sent you: Enjoy using the Chinese version of GPT for free, no magic

No matter what information you need, click the card below, add Charging Jun WeChat, and share a full set of information for free

                                           

Guess you like

Origin blog.csdn.net/CDB3399/article/details/131004596