What exactly is ChatGPT doing? Why can it do this? (2)

Machine Learning and Training of Neural Networks

So far we've been talking about neural networks that "already know" how to do a particular task. But what makes neural networks so useful (and presumably in the brain too) is that they can not only do a variety of tasks, but they can be incrementally "trained by examples" to do them.

In making a neural network that distinguishes between cats and dogs, we don't actually need to write a program to (say) find whiskers explicitly; instead, we just show lots of examples of what is a cat and what is a dog, and let the network From these examples "machine learning" how to distinguish them.

The point is, the trained network "generalizes" based on the specific examples it has seen. As we saw above, it's not about simply getting the network to recognize specific pixel patterns in images of cats it sees; Based on differentiating images.

So, how exactly does the training of the neural network work? Essentially, we're trying to find the weights that will allow the neural network to successfully reproduce the examples we've been given. We then rely on the neural network to "interpolate" (or "generalize") between these examples in a "reasonable" way.

Let's look at a simpler problem than the closest point problem above. Try having just one neural network learn the function:

For this task, we need a network with only one input and one output, such as:

But what weights should we use, etc.? Under each possible set of weights, the neural network computes some function. For example, here's what it does with several sets of randomly chosen weights:

We can clearly see that in these cases the results are not even close to reproducing our desired function. So, how do we find the weights that reproduce this function?

The basic idea is to "learn" by providing a large number of "input → output" examples - and then try to find weights that reproduce these examples. Here are the results obtained with an increasing number of examples:

Each stage in this "training" incrementally adjusts the weights in the network - eventually resulting in a network that successfully reproduces the function we want. So, how do you adjust the weights? The basic idea is to see at each stage "how far" we are from getting the function we want, and then update the weights in such a way that it gets closer.

In order to know "how far away we are", a function is computed which is often called a "loss function" (or sometimes a "cost function"). What is used here is a simple (L2) loss function, which is simply the sum of the squares of the difference between the obtained value and the true value. As our training process progresses, the loss function gradually decreases (following a certain "learning curve", which is different for different tasks) - until we reach a point where the network (at least a good approximation) successfully reproduces our Wanted function:

The last important part to explain is how to adjust the weights to reduce the loss function. The loss function provides the "distance" between the resulting value and the true value. But the "resulting value" at each stage is determined by the current version of the neural network and the weights in it. But now imagine that these weights are variables -- let's say wi. We want to find out how to adjust the values ​​of these variables so that the loss depending on these variables is minimized.

For example, imagine (with a huge simplification of a typical neural network used in practice) that we only have two weights w1 and w2. Then, there could be a loss as a function of w1 and w2 that looks like this:

Numerical analysis provides various techniques to find the minimum in this case. But a classic way is to start from the previous w1, w2 and gradually follow the steepest descending path:

Like water flowing down a mountain, it is guaranteed that this process will end up at some local minimum on the surface ("a mountain lake"); but it will most likely not reach the final global minimum.

Finding the steepest descent path on a "weight landscape" doesn't look very feasible. However, calculus can help us. As mentioned before, we can think of a neural network as computing a mathematical function - depending on its inputs and weights. Now differentiate these weights. It turns out that the chain rule of calculus can actually allow us to "untangle" the operations done by successive layers in a neural network. The result is that we can - at least in some local approximation - "invert" the operation of a neural network and progressively find the weights that minimize the loss associated with the output.

The figure above shows how much minimization we might have to do in the super simple case of only 2 weights. But it turns out that even with more weights (ChatGPT uses 175 billion), it is still possible to minimize, at least somewhat approximately. In fact, the major breakthroughs in "deep learning" discovered around 2011 were related to the sense that when doing (at least approximate) minimization, operations involving a large number of weights are less complex than Weights are easier.

In other words - although somewhat counter-intuitive - it is easier to solve complex problems with neural networks than simple ones. The general reason for this seems to be that when a person has many "weight variables", he has a high-dimensional space with "many different directions" that can lead him to the minimum - whereas with fewer variables, it is easier Stuck in a local minimum ("mountain lake") because there is no "direction to go out".

It is important to point out that, in general, there are many different sets of weights that can give a neural network nearly the same performance. And in actual neural network training, there are usually many random choices that lead to "different but equivalent solutions", like these:

But each of these "different solutions" will be at least slightly different. If we ask, say, to "extrapolate" outside the region where we provide training examples, we can get wildly different results:

But which one is "correct"? There's really no way to judge. They are all "consistent with the observed data". But they all correspond to different "innate" ways of "thinking" about what to do "outside the box". To us humans, some outcomes may seem "more plausible" than others.

Practice and Theory of Neural Network Training

Over the past decade, we have made many advances in methods for training neural networks. And, it's basically an art. Sometimes, especially in retrospect, one can see at least a hint of "scientific explanation" in the running. But for the most part, it was discovered by trial and error, adding ideas and tricks and gradually building up a significant methodology on how to use neural networks.

There are several key parts. First, what architecture of the neural network should be used for a specific task. Then, there is a key issue of how to obtain the data to train the neural network. And, more and more often, it is not a matter of training a network from scratch: instead, a new network can be directly incorporated into another already trained network, or at least can be used to generate more training examples for itself.

One might think that a different neural network architecture is required for each specific task. But we also found that the same architecture seems to work even for apparently different tasks. In a way, this is a bit like the idea of ​​general computing (https://www.wolframscience.com/nks/chap-11--the-notion-of-computation#sect-11-3--the-phenomenon- of-universality, and my principle of computational equivalence, https://www.wolframscience.com/nks/chap-12--the-principle-of-computational-equivalence/), however, we will discuss later, I I think this is more a reflection of the fact that the tasks we usually try to make neural networks do are "human-like", and that neural networks can generally capture these "human-like processes".

In the early days of neural networks, people tended to think that one should "make the neural network do as little as possible". For example, when converting speech to text (https://reference.wolfram.com/language/ref/SpeechRecognize.html), it is believed that the audio of the speech should first be analyzed, broken down into phonemes, etc. But it's been found that, at least for "human-like tasks", it's often better to try to train a neural network on an "end-to-end problem", letting it "discover" the necessary intermediate features, encodings, etc. on its own.

There is also the idea that we should introduce complex individual components into neural networks that "implement specific algorithmic ideas." But, it turns out that's not good either; instead, it's better to just deal with very simple components and let them "self-organize" (albeit often in ways beyond our comprehension) to achieve (roughly) the equivalent of those algorithmic ideas.

That's not to say there aren't "structured ideas" associated with neural networks. Thus, e.g. two-dimensional arrays of neurons with local connections seem to be very useful at least in the early stages of image processing. And having connectivity patterns that focus on "review sequences" seems to be useful when dealing with things like human language - as we'll see later, for example in ChatGPT.

But an important feature of neural networks is that, like computers in general, they are ultimately just processing data. And current neural networks - current neural network training methods - are designed to process numeric arrays (https://reference.wolfram.com/language/guide/NetEncoderDecoder.html). But during processing, these arrays can be completely rearranged and reshaped. As an example, the network we used above to recognize digits (https://resources.wolframcloud.com/NeuralNetRepository/resources/LeNet-Trained-on-MNIST-Data/) starts with a two-dimensional array of "images", quickly "thickened" to many channels, but then "condensed" into a 1D array (https://reference.wolfram.com/language/ref/AggregationLayer.html), which will eventually contain elements representing the different possible output numbers:

But how to judge how big a neural network is needed for a particular task? It's an art. In a way, the key is to know "how hard is this task". But for human-like tasks, this is often difficult to estimate. There may be a systematic way to accomplish tasks very "mechanically" by computers. But it's hard to know if there are any tricks or shortcuts that make this task easier, at least at a "human-like level." It may be necessary to enumerate a huge tree of games (https://writings.stephenwolfram.com/2022/06/games-and-puzzles-as-multicomputational-systems/) to play a certain game "mechanically"; but there may be a Easier ("heuristic") ways to achieve "human-level play".

When dealing with tiny neural networks and simple tasks, "you can't get there from here" can sometimes be clearly seen. For example, this is the best result it seems to be able to achieve with a few small neural networks on the task in the previous section:

Instead, if the network is too small, it cannot reproduce the functionality we want. But beyond a certain size, it's fine - but it needs to be trained for a long enough time and with enough examples. By the way, these pictures illustrate the truth of a neural network: If there is a "squeeze" in the middle, forcing everything to go through a smaller number of interneurons, then we can often use a smaller network. (It's worth noting that "no intermediate layers" -- or so-called "perceptrons" -- networks can only learn essentially linear functions -- but as long as there is an intermediate layer, and if there are enough neurons, in principle can approximate any function well, although some kind of regularization or normalization is usually required in order for it to be feasible to train, https://reference.wolfram.com/language/ref/BatchNormalizationLayer.html).

Well, let's say we've settled on some kind of neural network architecture. Now there is a problem, how to get data to train the network. Many practical challenges surrounding neural networks and machine learning in general center on obtaining or preparing the necessary training data. In many cases ("supervised learning"), we want to have explicit examples of inputs and desired outputs. So, for example, one might want to judge what is in an image by the label. But usually adding these tags takes a lot of effort. But in many cases, existing content can also be used, or existing content can be used as a substitute for desired content. For example, we can use image tags that already exist on the web instead. Or, in another area, we can use closed captions created for videos. Or in language translation training, parallel versions of web pages or other files in different languages ​​can be used instead.

How much data do you need to show a neural network to train it to do a specific task? Again, this is difficult to estimate from first principles. Of course, by using "transfer learning" to "transfer" something like a list of important features already learned in another network, the data size requirements can be greatly reduced. But in general, neural networks need to "see a lot of examples" to train well. And at least for some tasks, these examples can be very repetitive. In fact, it's a standard strategy to show a neural network all the examples, over and over again. During each "training round" (or "epochs"), the neural network will be in at least a slightly different state, and "reminding" it of a particular example in some way is crucial to making it "remember that example" is very useful. (Perhaps this is similar to the effect of repetitive memory in human memory).

But often just repeating the same example over and over is not enough. Variations of this example also need to be shown to the neural network. And one feature of neural network theory is that these "data-augmented" variations don't have to be complicated to be useful. Simply modifying images slightly with basic image processing methods can make them essentially "as good as new" for neural network training. Likewise, when people don't have actual videos etc. to train self-driving cars, one can continue to get data from simulated video game environments without all the details of actual real-world scenarios.

How about something like ChatGPT? It has a nice feature that it can do "unsupervised learning", which makes it easier to get examples for training. To recap, the basic task of ChatGPT is to figure out how to proceed with a piece of text given to it. So, to get a "training instance", all we have to do is take a piece of text, mask the end, and use that as "training input" - the "output" is the full, unmasked text. We’ll talk more about this later, but the main point is that unlike learning what’s in a picture, this learning doesn’t require “explicit labels”; ChatGPT can actually learn directly from any text example it’s given.

So what does the actual learning process of a neural network look like? At the end of the day, it's all about determining what weights best capture the given training examples. There are various detailed choices and "hyperparameter settings" (so called hyperparameters because the weights can be thought of as "parameters") that can be used to tune how this is done. There are different choices of loss functions (sum of squares, sum of absolute values, etc., https://reference.wolfram.com/language/ref/CrossEntropyLossLayer.html). There are different ways to do loss minimization (how far to move in weight space at each step, etc.). Then there are issues like how many examples to show to get a running estimate of each loss that is trying to be minimized. Also, one can apply machine learning (eg in the Wolfram Language) to automate machine learning -- automatically setting things like hyperparameters.

But in the end, the whole training process is characterized by seeing how the loss gradually decreases (like this Wolfram Language progress monitor for small training, https://reference.wolfram.com/language/ref/NetTrain.html):

Instead, what one usually sees is that the loss decreases for a period of time, but eventually levels off at some constant value. If this value is small enough, then the training can be considered successful; otherwise, this may be a signal that changes to the network structure should be attempted.

Can you tell us how long it will take for the "learning curve" to flatten? There also seems to be an approximate power-law scaling relationship depending on the size of the neural network and the amount of data used (https://arxiv.org/pdf/2001.08361.pdf). But the general conclusion is that training a neural network is hard and requires a lot of computational effort. In practice, the vast majority of these efforts are spent operating on arrays of numbers, which is what GPUs are good at -- which is why neural network training is often limited by the number of GPUs available.

In the future, will there be inherently better ways to train neural networks, or better ways to do what neural networks do? I think the answer is yes. The basic idea of ​​a neural network is to create a flexible "computational structure" out of a large number of simple (essentially identical) components, and to allow this "structure" to be gradually modified as the learning examples progress. In current neural networks, one basically uses ideas from calculus -- applied to real numbers -- to do this kind of incremental modification. But it's becoming increasingly clear that having high-precision numbers isn't important; even with current methods, 8 digits or fewer may suffice.

Computational systems like cellular automata (https://www.wolframscience.com/nks/chap-2--the-crucial-experiment#sect-2-1--how-do-simple-programs-behave), Basically operating on many individual bits in parallel, we never figured out how to do this kind of incremental modification, but there's no reason to think it's impossible. In fact, as in "Breakthrough in Deep Learning 2012" (https://en.wikipedia.org/wiki/AlexNet), such incremental modifications may be easier in more complex cases than in simpler ones.

Neural networks -- perhaps a bit like brains -- are programmed to have a largely fixed network of neurons, modified by the strengths ("weights") of the connections between them. (Perhaps at least in young brains, lots of entirely new connections can also grow.) But while this may be a convenient setup for biology, it's not clear that it's the best way to achieve our desired functions. best way. And something that involves progressive network rewriting (perhaps reminiscent of our physics projects, https://www.wolframphysics.org/) may end up being better.

But even within existing neural network frameworks, there is currently a key limitation: neural network training today is fundamentally continuous, with the effects of each batch of examples being propagated back to update the weights. In fact, with current computer hardware - even considering GPUs - a neural network spends most of its time "idling" during training, with only one part being updated at a time. In a sense, this is because our current computers tend to have memory separate from the CPU (or GPU). But in the brain, it's presumably different—every "memory element" (i.e. neuron) is also a potentially active computational element. If we could set up our future computer hardware in this way, it would be possible to train more efficiently.

"As long as the scale is large enough, the network can do anything!"

Capabilities such as ChatGPT are so impressive that we can't help but imagine that if people can "go ahead" and train larger and larger neural networks, they will eventually "omnipotent". This conclusion may be true for things that are easy for humans to think about directly. However, the lesson of science over the past few hundred years is that some things can be calculated through formal processes, but are not easily obtained by human direct thinking.

Nontrivial mathematics is an obvious example. But the general case is actually computing. And the ultimate problem is computational irreducibility (https://www.wolframscience.com/nks/chap-12--the-principle-of-computational-equivalence#sect-12-6--computational-irreducibility). There are some calculations that one might think require many steps to complete, but can in fact be "simplified" to something fairly straightforward. But calculation is irreducible, so sometimes it is not true. Conversely, some procedures—perhaps like this one—need to trace each computational step in order to figure out what happened:

The things we normally do with our brains are presumably chosen specifically to avoid the irreducibility of computation. Doing math in one's brain takes special effort. Moreover, in practice, it is highly unlikely that "thinking" about the steps of any non-trivial program can be performed in one's brain.

Of course, we have computers for this. With computers, we can easily do long, computationally irreducible things. And the key point is that there are generally no shortcuts to these things.

Yes, we can remember many specific examples of what happened in a particular computing system. Maybe we can even see some ("computationally reducible") patterns, so we can generalize a bit. But the problem is that computational irreducibility means that we can never guarantee that accidents won't happen -- only by doing the computation explicitly can you know what actually happened in any given situation.

Finally, there is a fundamental antagonism between learnability and computational irreducibility. Learning is actually about compressing data by exploiting regularity (https://www.wolframscience.com/nks/chap-10--processes-of-perception-and-analysis/). But computational irreducibility means that ultimately there is a limit to the regularities that can exist.

In practice, we can imagine building some small computing device -- such as a cellular automaton or a Turing machine -- into a trainable system like a neural network. Moreover, this device can indeed be a good "tool" for neural networks, just like Wolfram|Alpha can be a good tool for ChatGPT (https://writings.stephenwolfram.com/2023/01/wolframalpha-as-the-way- to-bring-computational-knowledge-superpowers-to-chatgpt/). But the irreducible nature of computing means we cannot hope to "hack into" these devices and make them learn.

Or in other words, there is a final trade-off between power and trainability: the more you want a system to "truly utilize" its computational power, the more it will exhibit computational irreducibility, and its trainability The lower the sex. And the more intrinsically trainable it is, the less it can do complex calculations. (For the current ChatGPT, the situation is actually much more extreme, because the neural network used to generate each output symbol is a pure "feed-forward" network, without loops, and thus has no ability to do anything with non-complex" control. flow" calculations).

Of course, you might ask whether it's really important to be able to do irreducible computations. In fact, for most of human history, it didn't particularly matter. But our modern technological world is built on engineering using mathematical calculations, and computing is increasingly used more pervasively. If we look at nature, it is full of irreducible computations (https://www.wolframscience.com/nks/chap-8--implications-for-everyday-systems/) - we are slowly understanding how to imitate and use for our technical purposes.

A neural network can certainly notice various regularities in the natural world, and we may easily notice these regularities through "unaided human thinking". But if we want to solve things that fall under the umbrella of mathematics or computational science, a neural network can't do it -- unless it can be effectively used "as a tool" with an "ordinary" computing system.

However, there is potential for confusion in all of this. In the past, there were many tasks -- including writing articles -- that we thought were "intrinsically too hard" for computers. And now we see these tasks being done by ChatGPT etc., so it is tempting to think that computers must become more powerful, especially beyond what they were able to do before (such as gradually calculating Behavior).

But this is not the correct conclusion. Computationally irreducible processes remain computationally irreducible, and remain intrinsically hard for computers—even if computers can easily compute their individual steps. Instead, we should conclude that tasks that we humans can do that we don't think computers can do, such as writing articles, are actually in some sense easier to compute than we think.

In other words, the neural network was able to successfully write an essay because writing an essay proved to be a "computationally shallower" problem than we thought. In a sense, this brings us closer to "having a theory" of how humans do things like write essays, or how humans process language.

If you have a large enough neural network, then you can probably do anything that a human can easily do. But you won't capture what nature can do in general -- or what the tools we shape from nature can do. And it is the use of these tools—both practical and conceptual—that has enabled us in recent centuries to push beyond the limits of what "purely unaided human thought" could achieve, and to capture for human purposes to much more in the physical and computational universe.

embedded concept

Neural networks—at least in their current setup—are inherently numbers-based. So if we're going to use them for things like text, we need a way to represent our text numerically (https://reference.wolfram.com/language/guide/NetEncoderDecoder.html). Of course, we could start (essentially like ChatGPT did) assigning a number to each word in the dictionary. However, there is an important idea - which is also the core of ChatGPT - which goes beyond this scope. This is the concept of "embedding". We can think of embeddings as a way of trying to represent the "essence" of things as arrays of numbers -- the property that "nearby things" are represented by nearby numbers.

So, for example, we can think of a word embedding as trying to arrange words in a kind of "meaning space" (https://reference.wolfram.com/language/ref/FeatureSpacePlot.html), where In , words that are somehow "close in meaning" appear in the embeddings. Embeddings used in practice - such as in ChatGPT - often involve large lists of numbers. But if we project this into 2D space, we can give examples of how the embedded words are arranged:

Also, the results we can see do a very good job of capturing typical features on a daily basis. But how can we construct such an embedding? The general idea is to look at a large amount of text (5 billion words from the web in this example) and see how similar the "environments" in which different words appear are. So, for example, "alligator" and "crocodile" (both meaning crocodile) will often appear interchangeably in other similar sentences, which means they will be placed nearby in the embedding. But "radish" and "eagle" would not appear in similar sentences, so they would be placed far away in the embedding.

But how do you actually implement such a mechanism using neural networks? Let's skip word embeddings and discuss image embeddings first. We want to find some way to describe images by lists of numbers so that "images we think are similar" are assigned to similar lists of numbers.

How do we tell if images should be "considered similar"? If our images are handwritten digits, we might "think two images are similar" if they are the same digit. Earlier, we discussed a neural network trained to recognize handwritten digits. Think of this neural network as being set up to put the image into 10 different bins, one for each digit, in its final output.

But what about "intercepting" the process inside the neural network before making the final decision of "this is a '4'"? We might imagine that in a neural network, there are numbers that describe an image as "very much like a 4, but a little bit like a 2" or something like that. And the idea is to pick out such numbers as embedded elements.

So here's a concept. Instead of directly trying to describe "what images are near what other images", we consider a well-defined task (in this case digit recognition) for which we have access to explicit training data - and then exploit the fact that In doing this task, the neural network implicitly makes what amounts to a "proximity decision". So we don't need to discuss "image proximity" explicitly, just the specific question of what number an image represents, and "leave it to the neural network" to implicitly decide what "image proximity" means.

So how does this work in more detail for digital identification networks? We can think of this network as consisting of 11 consecutive layers, and we can summarize it diagrammatically (activation functions are shown as separate layers):

At the beginning, the actual image is input to the first layer, represented by a two-dimensional array of pixel values. At the last layer, you're given an array of 10 values, which you can think of as representing how "sure" the network is that the image corresponds to each digit from 0 to 9.

input image

The values ​​of the neurons in the last layer are:

In other words, the neural network has been "very sure" that the image is 4 at this time, and in order to actually get the output "4", it only needs to pick the position of the neuron with the largest value.

But what if we took a step back? The last operation in the network is a so-called softmax (https://reference.wolfram.com/language/ref/SoftmaxLayer.html), which tries to "enforce determinism". But before that, the value of the neuron is:

The neuron representing "4" still has the highest value. But there is also information in the values ​​of other neurons. It can be expected that this list of numbers can be used in some sense to describe the "nature" of the image, thus providing something we can use as an embedding. So, for example, each of the 4's here has a slightly different "signature" (or "feature embedding") - all very different from the 8's:

Here, we basically use 10 numbers to describe our image features. But usually, it's better to use more numbers than that. For example, in our digit recognition network, we can get an array of 500 digits by mining the previous layer. And this is probably a reasonable array to use as an "image embed".

If we want to explicitly visualize the "image space" of handwritten digits, we need to "reduce dimensionality", effectively projecting our resulting 500-dimensional vector into, for example, a three-dimensional space:

We just talked about creating a feature (and thus an embedding) for images, effectively based on identifying similarities between images, determining (according to our training set) whether they correspond to the same handwritten digit. If we have a training set, let's say, determine that each image belongs to one of 5000 common types of objects (cat, dog, chair...). In this way, we can craft an image embedding that is "anchored" by our recognition of common objects, but then "generalizes around it" based on the behavior of the neural network. The point is, as long as this behavior is consistent with how humans perceive and interpret images, this will end up being an embedding that "is correct for us" and can be used to make "human-like judgments" in practice.

So how should one follow the same approach to find word embeddings? The point is to start with a word task that we can train on at any time. The standard task is "word prediction". Suppose we have "the ___ cat". Based on a large corpus of text (say, text content on the web), what are the probabilities of different words that might "fill in the blanks"? Or, given "___ black ___", what are the probabilities of the different "side words"?

How do we set up this problem for a neural network? At the end of the day, we have to put everything in numbers. One way to do this is to assign a unique number to each of the 50,000 or so common words in the English language. So, for example, "the" might be 914, and "cat" (preceded by a space) might be 3542. (These are the actual numbers used by GPT-2.) So for the "the ___ cat" problem, our input might be {914, 3542}. What should the output look like? It should be a list of 50000 or so numbers giving the probability of each possible "filler" word. To find the embedding, we again "intercept" the "inside" of the neural network before it "reaches its conclusion" - and then take the list of numbers that appear there, which we can think of as "adding features to each word".

Alright, so what do these traits look like? Over the past 10 years, a range of different systems have been developed (word2vec, GloVe, BERT, GPT, … ), each based on a different neural network approach. But in the end, all of these systems are lists of hundreds to thousands of numbers that characterize words.

In their raw form, these "embedding vectors" contain little information. For example, here are the raw embedding vectors produced by GPT-2 for three specific words:

For example, if we measure the distance between these vectors, then we can find things like the "closeness" of words. We will discuss in more detail later what we might think of as the "cognitive" meaning of this embedding. But now the main point is that we have a way to efficiently turn words into "neural network friendly" collections of numbers.

But in fact, we can go one step further and not just describe words with collections of numbers; we can also describe sequences of words, or entire blocks of text. That's how ChatGPT handles things. It takes the text it has so far and generates an embedding vector to represent it. Then, its goal is to find the probabilities of different words that might come next. It represents its answer as a list of numbers that basically gives probabilities for 50,000 or so possible words.

(Strictly speaking, ChatGPT does not deal with words, but with “tokens” — convenient linguistic units that may be whole words or just fragments like “pre” or “ing” or “ized”. Using "symbols" makes it easier for ChatGPT to handle rare, compound, and non-English words, and sometimes, for better or worse, to invent new words.)

Inside ChatGPT

We're finally ready to discuss what's inside ChatGPT. It's a huge neural network -- currently a version of the so-called GPT-3 network, with 175 billion weights. In many ways, this is very much like the other neural networks we've discussed. But it's a neural network specifically set up to handle language problems. Its most notable feature is a neural network architecture called a "transformer".

In the first neural network we discussed above, every neuron in any given layer was basically connected (via at least some weights) to every neuron in the previous layer. However, if one is dealing with data with a special, known structure, such a fully connected network is (presumably) overkill. Thus, for example in the early stages of image processing, it is typical to use so-called convolutional neural networks ("convnets", https://reference.wolfram.com/language/ref/ConvolutionLayer.html), in which neurons are Effectively laid out on a grid similar to the pixels in an image -- and only connected to neurons that are nearby on the grid.

The idea of ​​a transformer is to do something similar for the sequence of "symbols" that make up a piece of text. However, instead of just defining a fixed region in the sequence that can have connections, the transformer introduces the concept of "attention (https://reference.wolfram.com/language/ref/AttentionLayer.html)" -- and The notion of more "attention" to some parts of a sequence than others. Maybe one day it will be possible to just fire up a generic neural network and do all the customization through training. But at least for now, the practice must still be "modular", just like transformers, and probably like our brains.

So what is ChatGPT (or rather, the GPT-3 network it's based on) actually doing? Recall that its overall goal is to perpetuate text in a "reasonable" way, based on what it's seen training (including looking at billions of pages of text from the web, etc.), so at any time, it has some amount of Text, whose goal is to suggest an appropriate choice for the next "symbol" to add.

It operates in three basic stages. First, it takes the sequence of symbols corresponding to the text so far and finds an embedding (i.e. an array of numbers) representing these symbols. It then operates on this embedding in "standard neural network fashion", with values ​​"passed through" successive layers in the network, producing a new embedding (i.e. a new array of numbers). Then, from the last part of this array, it generates an array of about 50,000 values ​​that become the probabilities of the possible next tokens. (Also, the number of symbols used happens to be the same as the number of common words in English, although only about 3000 symbols are whole words and the rest are fragments.) Crucially, each part of this pipeline is powered by a neural network implemented, whose weights are determined by the end-to-end training of the network. In other words, in reality, apart from the overall architecture, nothing is "explicitly designed"; everything is "learned" from the training data.

However, there are many details in how the architecture is set up, reflecting various experiences and knowledge of neural networks. And, as messy as it sounds, I think it's useful to discuss some of the details, especially to understand what it takes to build something like ChatGPT.

The first is the embedded module. Here is a Wolfram Language diagram of GPT-2:

The input is a vector of n tokens (https://reference.wolfram.com/language/ref/netencoder/SubwordTokens.html, represented by integers ranging from 1 to 50,000, as described in the previous section). Each of these tokens is converted (via a single-layer neural network, https://reference.wolfram.com/language/ref/EmbeddingLayer.html) into an embedding vector (length 768 for GPT-2, GPT for ChatGPT -3 is 12,288). At the same time, there is a "secondary path", which takes the sequence of (integer) positions of the markers, and creates another embedding vector from these integers. Finally, the embedding vectors from the tagged values ​​and tagged positions are added together - yielding the final sequence of embedding vectors for the embedding module.

Why just add together the marker value and the embedding vector at the marker position? I don't think there's any particular scientific basis for this. It's just that all sorts of different things have been tried, and this is the one that seems to work. It's also part of the knowledge of neural networks, in the sense that as long as your setup is "roughly right", you can usually do enough training to figure out the details without really needing to really "understand" neural How the network ultimately configures it.

Here's what the embedding module does for the string "hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye bye":

The elements of the embedding vector for each markup are displayed further down the page, and across the page we see first the embedding for "hello", followed by the embedding for "bye". The second array above is positional embeddings - whose seemingly somewhat random structure is just "learned by chance" (GPT-2 in this case).

After embedding the modules, the transformer's "main event" occurs: a series of so-called "attention blocks" (12 for GPT-2, 96 for ChatGPT's GPT-3). It's a complex process -- reminiscent of those complex large engineered systems, or biological systems. Here is a schematic diagram of a single "attention block" (for GPT-2):

Within each such attention block is a series of "attention heads" (GPT-2 has 12, ChatGPT's GPT-3 has 96) - each independently operating on a different numerical block in the embedding vector . (We don't know why it's good to split the embedding vector, or what "meaning" its different parts are; we just "found it works").

So what does the attention head do? Basically, they are a way to "look back" in a sequence of tokens (that is, in the text produced so far), and "package" the past into a form that facilitates finding the next token. In the previous section, we talked about using 2-gram probabilities to pick words based on their immediate predecessors. What the "attention" mechanism in the transformer does is allow "attention" to even earlier words - so it is possible to catch, say, verbs that can refer to nouns that appear many words before them in the sentence The way.

At a more detailed level, what the attention head does is recombine chunks of the embedding vector associated with different tokens with certain weights. So for example the 12 attention heads in the first attention block (in GPT-2) have the following for the string "hello, bye" above ("look-back-all-the-way- beginning-the-sequence-of-tokens, look back at all beginning sequence of tokens") mode "recombination weights":

After being processed by the attention head, the resulting "reweighted embedding vectors" (length 768 for GPT-2 and 12288 for ChatGPT's GPT-3) are passed to a standard "fully connected" neural network Layer (https://reference.wolfram.com/language/ref/LinearLayer.html). It's hard to grasp what this layer is doing. But here is a picture of the 768×768 weight matrix it uses (GPT-2 here):

With a 64×64 moving average, some (random walk-like) structure starts to emerge:

What determines this structure? It may end up being some "neural network encoding" of the characteristics of human language. But until now, it's been pretty unclear what those features might be. In effect, we're "opening the brains of ChatGPT" (or at least GPT-2) and discovering that there's a lot of complexity inside, and we don't understand it -- although it eventually produces recognizable human language.

After going through an attention block, we get a new embedding vector - which is then passed successively to other attention blocks (GPT-2 has 12 in total; GPT-3 has 96) . Each attention block has its own specific "attention" and "fully connected" weighting modes. Here is the sequence of attention weights for the "hello, goodbye" input of GPT-2, for the first attention head:

Here is the (moving average) "matrix" for the fully connected layer:

Curiously, although these "weight matrices" look similar in different attention blocks, the size distribution of the weights can be somewhat different (and not always Gaussian):

So, after all these attention blocks, what is the actual effect of the transformer? Essentially, it transforms a collection of embeddings of raw labeled sequences into a final collection. And the specific way ChatGPT works is to take the last embedding in this set and "decode" it to produce a list of probabilities about what the next token should be.

This is the general content of ChatGPT. It may seem complicated (especially since it has many unavoidable and somewhat arbitrary "engineering choices"), but in reality, the final elements involved are quite simple. Because ultimately what we're dealing with is just a neural network of "artificial neurons," each of which is doing the simple operation of taking a set of numeric inputs and combining them with some weights.

The raw input to ChatGPT is an array of numbers (an embedding vector of tokens), and when ChatGPT "runs" to produce a new token, all that happens is that these numbers "pass through" the layers of the neural network, with each neuron "drying itself" live" and pass the results to the neurons in the next layer. There are no loops or "backs". Everything is just "feedforward" through the network.

This is a very different setup from a typical computing system -- a Turing machine -- where results are repeatedly "reprocessed" by the same computing elements. Here, each computational element (i.e. neuron) is used only once, at least when generating a specific output token.

But in a sense, even in ChatGPT, there is still an "outer loop" that reuses computational elements. Because when ChatGPT wants to generate a new token, it always "reads" (ie takes as input) the entire sequence of tokens before it, including the tokens that ChatGPT itself "wrote" before. We can take this setup to mean that ChatGPT - at least at its outermost level - involves a "feedback loop", although in this loop each iteration explicitly appears as a token appearing in the text it generates .

Back to the heart of ChatGPT: the neural network that is used repeatedly to generate each token. In a way, it's pretty simple: a whole collection of identical artificial neurons. Some parts of the network simply consist of ("fully connected") layers of neurons (https://reference.wolfram.com/language/ref/LinearLayer.html), where each neuron in a layer is connected to the previous Each neuron in a layer is connected (with a certain weight). However, due to its transformer structure, ChatGPT has more structural parts, where only specific neurons of different layers are connected. (Of course, it is still possible to say "all neurons are connected" - but some neurons have zero weight).

In addition, some aspects of the ChatGPT neural network are not automatically considered to be composed of "homogeneous" layers. For example, referring to the diagram summary above, within an attention block, there are places where "multiple copies" of the incoming data are made, and each copy then goes through a different "processing path", possibly involving a different amount of The layers are reassembled. But while this may be a convenient representation of what's going on, it's at least in principle possible to think of "dense padding" layers just to have some weights go to zero.

If we look at the longest path of ChatGPT, about 400 (core) layers are involved - not a huge number. But there are millions of neurons -- 175 billion connections in total, so 175 billion weights. It is important to remember that each time ChatGPT generates a new token, it performs calculations involving each of these weights. In actual operation, these calculations can be organized into highly parallel array operations "by layer", which can be conveniently completed on GPU. However, there are still 175 billion calculations (and a little more in the end) to be performed for each token produced - so generating a long text with ChatGPT will take a while, as we expected.

The most remarkable thing is that all these operations - each of which is quite simple - somehow work together to do such a good job of "human-like" text generation. It must be emphasized again that there is no "ultimate theory" (at least that we know of so far) to explain such a workflow. In fact, as we discuss below, I think this must be seen as a surprise scientific discovery: in a neural network like ChatGPT, it is possible to capture the essence of the human brain in generating language.

ChatGPT training

We have now given an overview of how ChatGPT works once it is set up. But how is it set up? How are the 175 billion weights in its neural network determined? Basically, they are the result of very large-scale training, based on a huge corpus of text -- on the web, in books, etc. -- written by humans. As we said, it is not certain that a neural network will be able to successfully generate "human-like" text even when all the training data is considered. And, again, it seems like detailed engineering is required to make this happen. But the biggest surprise and discovery of ChatGPT is that it is possible. In fact, a neural network with "only" 175 billion weights can make a "reasonable model" of text written by humans.

In modern times, there are many human-written texts that exist in digital form. The public web has pages written by at least a few billion people, totaling perhaps a trillion words of text. These numbers are likely to be at least 100 times larger when including non-public pages. To date, there are more than 5 million digitized books available (out of a total of 100 million or so books ever published), and another 100 billion or so words. As a personal comparison, my total published material in my lifetime is less than 3 million words, over the last 30 years I've written about 15 million words of email, typed a total of about 50 million words, and over the past few years I've been on live Said more than 10 million words. I'll train a bot on these).

But given all this data, how do we train a neural network from it? The basic process is very similar to what we discussed in the simple example above. You present a batch of examples, and then you adjust the weights in the network so that the network's error ("loss") on those examples is minimized. The main problem with "backpropagating" from errors is that every time you do this, every weight in the network usually changes by at least a small amount, and that's a huge number of weights to process. (The actual "reverse calculation" is usually only a small constant factor harder than the forward calculation).

With modern GPU hardware, it is trivial to compute results from thousands of examples in parallel. However, when it comes to actually updating the weights in a neural network, current methods require us to basically do it batch by batch. (Yes, that's probably the least architectural advantage the actual brain currently has, with its combined computing and memory elements).

Even in the seemingly simple case of learning numeric functions that we discussed earlier, we found that we often had to use millions of examples to successfully train a network, at least from scratch. So, this means how many examples do we need to train a "human-like language" model? There doesn't seem to be any fundamental "theoretical" way to know. But in practice, ChatGPT has been successfully trained on hundreds of billions of words of text.

Some text is entered multiple times, some only once. But somehow, ChatGPT "got what it needed" from the text it saw. But how large a network should it need to be to "learn well" given the amount of text it needs to learn? Again, we don't yet have a basic theoretical way to tell. As we discuss below, ultimately human language presumably has some kind of "overall algorithmic content", and what humans typically say in it. But the next question is how efficient neural networks will be at implementing models based on the content of this algorithm. We don't know either -- although ChatGPT's success suggests it's reasonably efficient.

Finally we can note that ChatGPT uses hundreds of billions of weights - what it does is comparable to the total word count (or tokens) of the training data it was given. In some respects, it is perhaps surprising (though also empirically observed in small analogues of ChatGPT) that the "size of the network" that seems to work well is so similar to the "size of the training data". After all, it certainly doesn't mean that "within ChatGPT" all texts from the web, books, etc. are "directly stored". Because what's inside ChatGPT is actually a bunch of numbers -- slightly less than 10 digits of precision -- that's some kind of distributed encoding of the overall structure of all this text.

In other words, we can ask what is the "effective information content" of human language, and what is usually said in it. Here is the original corpus of language instances. Then there is the representation in ChatGPT's neural network. This representation is likely to be far from the "algorithmically smallest" representation (as we discuss below). But it is a representation that is easily used by neural networks. In this representation, the "compression" of the training data appears to be low; on average, it appears that less than one neural network weight is needed to carry the "information content" of the training data for one word.

When we run ChatGPT to generate text, we essentially have to use each weight once. So, if there are n weights, we have n computational steps to do - although in practice many steps can often be done in parallel in the GPU. However, if we need about n words of training data to set these weights, then from what we said above, we can conclude that we need about n2 computational steps to train the network - which is why, with the current approach, one ends up needing to talk about billions of dollars of training work.

Beyond Basic Training

Much of the work of training ChatGPT is "showing" it large amounts of existing text from the web, books, etc. But as it turns out, there's an apparently quite important part.

Once it has finished "raw training" on the raw corpus it was shown, the neural network inside ChatGPT can start generating its own text, proceeding to prompts, etc. But while the results of doing so often seem plausible, they often — especially for longer texts — “loosen” in ways that are often quite inhuman. This is not something one can easily discover, say, by doing traditional statistics on the text. But it's something that's easy to notice for someone actually reading the text.

A key idea in building ChatGPT is that after "passively reading" things like the web, there's one more step: getting actual humans to actively interact with ChatGPT, see what it produces, and actually give it feedback "how to be A good chatbot". But how do neural networks use this feedback? The first step is simply to have humans evaluate the results of the neural network. But then another neural network model was built to try to predict those ratings. But now this predictive model can be run on the original network -- basically like a loss function that actually lets that network be "tuned up" by human feedback. And the results in practice seem to have a big impact on the system's success in producing "human-like" output.

Overall, it is interesting that the "initially trained" network seems to require very little "poke" to make it usefully grow in a particular direction. One might think that for the network to behave like it "learned something new", it would have to run the training algorithm, adjust the weights, etc.

but it is not the truth. Instead, you basically just have to tell ChatGPT something as part of the hints you give it, and then it can successfully leverage what you told it when generating text. Again, I think this is an important clue to understanding what ChatGPT is "really doing" and how it relates to human language and thought structures.

There is of course something human-like about this: at least after all the pre-training it has received, you can tell it something and it can "remember it" -- at least "long enough" to use it to generate a paragraph text. So, what happens in such a situation? It could be that "everything you could possibly tell it is already there" - you're just directing it to the right place. But that doesn't seem to be plausible. Instead, it seems more likely that, yes, the elements are already there, but the specifics are defined by something like "trajectories between these elements", which is what you tell it.

In fact, like humans, if you tell it something weird, unexpected, and totally out of the framework it knows, it doesn't seem to be able to successfully "integrate" this. It can only "integrate" it if it basically rides on top of the frame it already has in a fairly simple way.

It's also worth pointing out again that there are inevitably "algorithmic limits" to what a neural network can "receive". Tell it "shallow" rules like "this to that", and the neural network will likely be able to represent and reproduce these rules well -- in fact, what it "already knows" from the language will give it an immediate pattern to follow. But it won't work if you try to formulate an actual "deep" computational rule for it, involving many potentially irreducible computational steps. (Remember, at each step, it's always "feeding data forward" in its network; never looping, except to generate new tokens.)

Of course, the network can learn specific "irreducible" computational answers. But as long as there is a possibility of combining numbers, this "look-up table" method will not work. So yes, just like humans, now is the time for neural networks to "reach out" and use actual computing tools. (Yes, Wolfram|Alpha and the Wolfram Language are only appropriate because they are built to "talk about things in the world", like neural networks for language models).

What actually makes ChatGPT work?

Human language—and the thought processes that produce it—have always seemed to represent a pinnacle of complexity. In fact, it seems somewhat remarkable that the human brain — with a network of "only" 100 billion or so neurons (and maybe another 100 trillion connections) — is capable of doing the job. Perhaps, one might imagine, there is something more to the brain than a network of neurons, like some new undiscovered layer of physics. But now with ChatGPT, we have an important new piece of information: We know that a purely artificial neural network with as many connections as the brain has neurons can generate human language surprisingly well.

And, yes, it's still a large and complex system -- with as many weights in neural networks as there are currently words in the world. But in a way it still seems hard to believe that all the richness of language and what it can talk about can be encapsulated in such a limited system. Part of this no doubt reflects the ubiquitous phenomenon (which becomes apparent for the first time in the example of Rule 30) that computational processes can actually greatly amplify the apparent complexity of a system, even though its fundamental very simple. But, in practice, as we discussed above, neural networks of the kind used in ChatGPT tend to be purpose-built to limit the effects of this phenomenon and the non-repeatability of the computations associated with it, in order to make it easier to train conduct.

So how did something like ChatGPT get so far in terms of language? The basic answer, I think, is that the language is fundamentally much simpler than it appears. This means that ChatGPT - even though its neural network structure is ultimately simple - can successfully "capture" the essence of human language and the thinking behind it. Furthermore, in its training, ChatGPT somehow "implicitly discovered" any regularities in language (and thinking) that made it possible.

In my opinion, the success of ChatGPT provides us with a fundamental and important piece of scientific evidence: it shows that we can expect significant new "laws of language" -- and effective "laws of thought" -- to be discovered there. In ChatGPT, being a neural network, these regularities are implicit at best. But if we could somehow make these laws explicit, it would be possible to do the various things that ChatGPT does in a more direct, efficient, and transparent way.

But okay, so what might those laws look like? Ultimately, they must give us a prescription for how a certain language—and what we say in it—can be put together. We'll discuss later how "observing ChatGPT" can give us some hints about this, and how what we learn from building computational languages ​​can inform our way forward. But first let's discuss two long-known examples of what amount to "language laws" -- and how they relate to how ChatGPT works.

The first is the syntax of the language. Language is not just a random collection of words. Instead, there are (fairly) clear grammatical rules about how different kinds of words fit together: in English, for example, a noun can be preceded by an adjective and followed by a verb, but usually no two nouns can be right next to each other. Such grammatical structures can be captured (at least approximately) by a set of rules defining how to put together what amounts to a "parse tree":

ChatGPT does not have any explicit "knowledge" of such rules. But during training, it implicitly "discovered" these rules, and then seemed to be pretty good at following them. So, how does it work? On a "big picture" level, this is not clear. But for some inspiration, it might be instructive to look at a simpler example.

Consider a "language" of ( and ) sequences whose grammar (https://www.wolframscience.com/nks/notes-7-9--nested-lists/) specifies that parentheses should always be balanced, as in The parse tree is represented as:

Can we train a neural network to generate "grammatically correct" parenthesis sequences? There are many ways to process sequences in a neural network, but let's use a transformer network, like ChatGPT does. Given a simple transformer network, we can start by feeding it syntactically correct parenthesis sequences as training examples. A subtlety (which actually shows up in ChatGPT's human language generation as well) is that in addition to our "content markers" (here "(" and ")"), we must also include an "end" marker whose The generation indicates that the output should not continue any further (ie, for ChatGPT, we have reached the "end of the story").

If we set up a TransformNet with just one attention block with 8 heads and feature vectors of length 128 (ChatGPT also uses feature vectors of length 128, but with 96 attention blocks, each with 96 heads), Then it seems impossible to make it learn much parenthesis language. However, with 2 attention heads, the learning process seems to converge - at least after being given 10M or so examples (and, as is common with transformer networks, showing more examples seems to degrade its performance ).

So, for this network, we can do a similar job to ChatGPT and ask the probability of what the next token should be - in a sequence of parentheses:

In the first case, the network is "pretty sure" that the sequence cannot end here - which is fine, because if it did, the parentheses would leave an imbalance. In the second case, however, it "correctly recognizes" that the sequence can end here, although it also "points out" that it is possible to "restart", dropping a "(", presumably followed by a ")". But heck, even with 400,000 or so hard-trained weights, it's saying there's a 15% chance of having ")" as the next token - which isn't right, because that's bound to result in an unbalanced parenthesis .

If we ask the network to provide the highest probability of completion for progressively longer ( ) sequences, we get the following:

Within a certain length, the network runs stably. But then it starts failing. This is a very typical thing to see in this "exact" case of neural networks (or machine learning in general). The situation that humans can "solve at a glance" can also be solved by neural networks. But in cases where one needs to do something "more algorithmic" (such as explicitly counting whether the parentheses are closed), neural networks tend to be somehow "too computationally shallow" to do it reliably. (BTW, even the current full ChatGPT has a hard time matching parentheses in long sequences correctly).

So what does this mean for grammars like ChatGPT and languages ​​like English? The parenthesis language is "naive" - ​​and more like an "algorithmic story". But in English, it is much more realistic to be able to "guess" what is grammatical based on local word choices and other hints. And, yes, neural nets are much better at this - although it might miss some "formally correct" cases, and humans might too. But the main point is that the fact that language has an overall syntactic structure - and all the regularities it implies - in some sense limits the "degree" a neural network can learn. A key "science-like" observation is that transformer architectures like the neural networks in ChatGPT seem to be able to successfully learn the kind of nested Tree-like syntactic structure.

Syntax provides a constraint on language. But apparently there's more. Sentences like "Inquisitive electrons eat blue theories for fish" are grammatically correct, but not something one would normally expect to say, and would not be if ChatGPT generated it Considered a success -- because, well, in the normal sense of the words in it, it basically doesn't make sense.

However, is there a general way to tell if a sentence makes sense? There is no traditional holistic theory of this. However, we can argue that ChatGPT has implicitly "developed a theory" after being trained on billions of (possibly meaningful) sentences from the web.

What might this theory look like? Well, there's one little corner that's been known for basically two thousand years, and that's logic. Of course, logic, in the form of Syllogistic found by Aristotle, is basically a way of saying that sentences that follow certain patterns are reasonable and others are not. So, for example, it is reasonable to say "all X is Y, and this is not Y, so it is not X" (just as "all fish are blue, and this is not blue, so it is not a fish"). Just as one can imagine, somewhat whimsically, that Aristotle discovered dual logic by using ("machine-learning-style") large numbers of rhetorical examples, so one can imagine that in training ChatGPT it would be able to Lots of text etc. to "discover dual logic". (Yes, while we can expect ChatGPT to produce text containing "correct inferences", e.g. based on dual logic, the situation is quite different when it comes to more complex formal logic - I think we can expect it here fails for the same reason it fails on parenthesis matching).

But beyond the narrow example of logic, what can be said about how to systematically construct (or recognize) even plausibly meaningful texts? Yes, there are some things, like Crazy Freedom, that use very specific "phrase templates". But somehow, ChatGPT implies a more general approach. Maybe there's not much to say about how to do this other than "it happens somehow when you have 175 billion neural network weights". But I strongly suspect there is a simpler, more powerful story.

Meaningful Space and the Law of Semantic Motion

We discussed above that in ChatGPT, any piece of text is effectively represented by an array of numbers, which we can regard as the coordinates of a point in some kind of "linguistic feature space". Therefore, when ChatGPT continues a text, it is equivalent to tracing a trajectory in the language feature space. But now we can ask what makes this trajectory correspond to text that we consider meaningful. Maybe there will be some kind of "semantic motion law" to define - or at least constrain - how points in the linguistic feature space move while retaining "meaningful"?

So, what does this linguistic feature space look like? Here is an example of how individual words (here common nouns) are laid out if we project such a feature space onto a 2D space:

Another example we saw above was based on words representing plants and animals. But the point in both cases is that "semantically similar words" are placed nearby.

As another example, here's how words corresponding to different speech parts are arranged:

Of course, a given word does not generally have "one meaning" (or necessarily corresponds to only one discourse). By looking at how a sentence containing a word is laid out in feature space, we can often "distinguish" different meanings - as in the example "crane" here (crane/crane: "bird" or "machine"?):

Ok, so we can at least think that this feature space is where "words with similar meaning" are placed in this space, which is reasonable. But, in this space, what kind of additional structures can we identify? For example, is there some notion of "parallel transport" that reflects "flatness" in space? One way to grasp this is to look at the analogy:

And, yes, even when we project into two dimensions, there tends to be at least a "hint of flatness", although it's certainly not universally visible.

So, what about the trajectory? We can look at the trajectory of ChatGPT's hints in feature space - and then we can look at how ChatGPT continues this trajectory:

There are of course no "geometrically obvious" laws of motion here. This isn't surprising at all; we can fully expect this to be a rather complicated story. And, for example, even if a "semantic law of motion" could be found, it is far from obvious what embedding (or, indeed, what "variables") it most naturally expresses.

In the image above, we show several steps in the "trajectory" - at each step we pick the word that ChatGPT thinks is the most likely (the "zero temperature" case). But we can also ask, at a certain point, which words can be "next" with what probability:

In this case, what we see is a "fan" of high probability words that seems to have a more or less clear direction in the feature space. What if we go any further? Here is the continuous "fan" that occurs as we "move" along the trajectory:

Here's a 3D representation with a total of 40 steps:

And, yes, it seems like a mess - and doesn't do anything to specifically encourage the idea that we can expect to determine "mathematically-like" motion by empirically studying "what ChatGPT does in there" Semantic Laws". But maybe we're just looking at the "wrong variable" (or wrong coordinate system), as soon as we look at the right variable, we'll immediately see that ChatGPT is doing something "mathematically simple" like following Geodesic. But so far, we're not ready to "empirically decode" ChatGPT's "discovery" of how human language is "pieced together" from its "internal behavior."

Semantic Grammar and the Power of Computational Languages

What does it take to produce "meaningful human language"? In the past, we might have thought it couldn't be a human brain. But now we know that ChatGPT's neural network can do this task very well. Still, maybe that's as far as we can go, and nothing simpler -- or more humanly comprehensible -- will work. But what I strongly suspect is that the success of ChatGPT implicitly reveals an important "scientific" fact: there is actually much more to the structure and simplicity of meaningful human language than we know, and there may even eventually be There are fairly simple rules to describe how the language is composed.

As we mentioned above, syntactic grammar gives the rules of how words in human language are combined corresponding to different discourses. But to deal with meaning, we need to go a step further. And one version of how to do this is to consider not only the syntactic grammar of the language, but also the semantic grammar.

For grammar purposes, we identify things like nouns and verbs. But for the purpose of semantics, we need a "finer level". So, for example, we can identify the notion of "moving", and the notion of "object" that "maintains an identity independent of location". Each of these "semantic concepts" has endless concrete examples. But, for the purposes of our semantic grammar, we're going to just have some kind of general rule that basically says "objects" can "move". There's a lot to say (some of which I've said before) about how this all works. But I just want to say a few words here and point out some potential avenues for development.

It's worth mentioning that even if a sentence is perfectly ok according to the semantic grammar, it doesn't mean that it has been (or even can be) achieved in practice. "Elephants went to the moon" will undoubtedly "pass" our semantic grammar, but it certainly hasn't materialized in our actual world (at least not yet) - although for a fictional world, it's definitely fair game .

When we start talking about a "semantic grammar", we quickly ask: "What's underneath it?" What "world model" does it assume? Syntactic grammar is really just a matter of building language out of words. However, a semantic grammar necessarily involves some kind of "world model" -- something that acts as a "skeleton" on which a language made of actual words can be layered.

Until recently, we might have imagined that (human) language would be the only general way of describing our "model of the world". The formalization of certain kinds of things, especially based on mathematics, has been going on for centuries. But now there is a more general formal approach: computational languages.

Yes, this has been a big project of mine for over forty years (now embodied in the Wolfram Language): to develop a precise symbolic representation that can say as broadly as possible about the things in the world, and the abstract things we care about. So, for example, we have symbolic representations of cities, molecules, images, and neural networks, and we have intrinsic knowledge of how to compute these things.

And, after decades of work, we've covered a lot of ground this way. But in the past, we didn't deal with "everyday discourse" in particular. In "I bought two pounds of apples", we can easily represent (and perform nutritional and other calculations on) "two pounds of apples". But we don't (yet) have a symbolic representation of "I bought it".

It's all about the idea of ​​a semantic grammar - the goal is to provide a common symbolic "construction suite" for concepts, which will give us the rules for what can be combined with what, giving us the "flow" of possible translation into human language Provide rules.

But suppose we have this "language of symbolic discourse". What will we do with it? We can start doing things like generating "locally meaningful text". But in the end we might want more "global sense" results -- meaning "compute" more about what actually exists or happens in the world (perhaps in some coherent fictional world).

Now in the Wolfram Language, we have a huge amount of built-in computational knowledge about many kinds of things. But for a complete symbolic discourse language, we have to build additional "computations" about things in the world in general: if an object moves from place A to place B, and from place B to place C, then it moves from place A To land C, etc.

Given a symbolic discourse language, we can use it to make "independent statements". But we can also use it to ask questions about the world, "Wolfram|Alpha style". Or we could use it to state that we "want it to be like this", presumably with some external enforcement mechanism. Or we can use it to make assertions -- maybe about the real world, maybe about a particular world we're considering, fictional or otherwise.

Human language is fundamentally imprecise, not only because it is not "tethered" to a concrete computational implementation, but its meaning is basically defined only by the "social contract" between its users. But a computational language, by its nature, has a certain fundamental precision -- because ultimately what it specifies can always be "unambiguously executed on a computer". Human language can often escape some kind of ambiguity. (When we say "planet," does that include exoplanets, etc.) But in the language of computing, we have to be precise and clear about all the distinctions we make.

In computing languages, it is often convenient to take advantage of ordinary human language to make up names. But their meaning in the language of computing is necessarily precise, and may or may not cover some specific connotations in typical human language usage.

How should we find out the basic "ontology" suitable for the language of general symbolic discourse? Well, it's not easy. This is perhaps why little has been done in these respects since Aristotle's primitive beginnings more than two thousand years ago. But it does help that today we know so much about how to think about the world computationally (and it doesn't hurt to get "basic metaphysics" from our physics projects and ragiad's ideas).

But what does all this mean in the context of ChatGPT? From its training, ChatGPT has effectively "cobbled together" a certain amount of what amounts to a semantic grammar (quite impressive). But its success gives us reason to think that building something more complete in the form of a computational language will be feasible. And, unlike our understanding of the internals of ChatGPT so far, we can expect computational languages ​​to be designed to be easily understood by humans.

When we talk about semantic grammar, we can compare it to dual logic. At first, dual logic was essentially a collection of sentence rules expressed in human language. But (yes, two thousand years later) when formal logic was developed, the original basic constructs of syllabic logic can now be used to build huge "formal towers", including for example the operation of modern digital circuits. And, we can expect the same to be true for semantic grammars more generally. At first, it might just be able to handle simple patterns, such as expressed in text. However, once its entire computational language framework is established, we can expect that it will be able to be used to erect towers of "generalized semantic logic" that will allow us to deal in a precise and formal way with a variety of , except of course the "lower level" human language with all its ambiguities.

We can think of the constructs of computational languages—and of the semantic grammar—representing an ultimate compression of things. Because it allows us to talk about the nature of what is possible without, for example, having to deal with all the "turning phrases" that exist in ordinary human language. We can think of ChatGPT's great advantage as something somewhat similar: because it has also "drilled" in a sense to the point where it can "put languages ​​together in a semantic way" without caring about different possible wording.

So, what happens if we apply ChatGPT to an underlying computing language? A computational language can describe what is possible. But what can still be added is a sense of "what's popular" - eg based on reading everything on the web. But, underneath, operating in the language of computation means that something like ChatGPT has immediate and fundamental access to what amounts to the ultimate tool for harnessing potentially irreducible computation. This makes it a system that can not only "generate plausible texts," but can be expected to address any solvable question of whether that text actually makes a "correct" statement about the world—or what it's supposed to be talking about.

So... what is ChatGPT doing? Why can this be done?

The basic concept of ChatGPT is somewhat simple. Start with a large sample of human-created text from the web, books, etc. Then train a neural network to generate text "like this". In particular, it enables it to start with a "hint" and proceed to generate text "as it was trained to do".

As we can see, the actual neural network in ChatGPT is composed of very simple elements, albeit billions of them. The basic operation of the neural network is also very simple, mainly "passing the input once" through its elements (without any loops, etc.) for each new word (or part of a word) it generates.

But unexpectedly, this process can produce words that successfully "like" the Internet and books. Moreover, it is not only coherent human language, it also "says something", "follows its prompts" and utilizes what it "reads". It doesn't always say things that "globally make sense" (or correspond to correct computation) -- because (eg, without access to Wolfram|Alpha's "computational superpowers"), it only Things "sound like" something was said.

The specific engineering of ChatGPT makes it quite compelling. But in the end (at least until it can use external tools), ChatGPT "just" draws some "coherent textual threads" from the "statistics of conventional wisdom" it has amassed. But surprisingly, the result is so human-like. As I've discussed, this points to something very important, at least scientifically: that human language (and the mental models behind it) are somehow simpler and more "law-like" than we think. ChatGPT has implicitly spotted this. But we have the possibility to expose it explicitly through semantic grammar, computational language, etc.

ChatGPT does an impressive job at generating text, and the results are often very much like what we humans would produce. So, does this mean that ChatGPT works like a brain? Its underlying artificial neural network structure is ultimately modeled on an idealization of the brain. Moreover, many aspects of what seems likely to happen when we humans produce language are similar.

When it comes to training (aka learning), the different "hardware" of the brain and current computers (and, perhaps, some unexplored algorithmic ideas) forces ChatGPT to use a many) strategy. One more point: Even unlike typical algorithmic calculations, there is no "looping" or "recomputation of data" inside ChatGPT. And that inevitably limits its computing power—even compared to current computers, but certainly compared to the brain.

It's unclear how to "fix this" and still maintain the ability to train the system with reasonable efficiency. But doing so will presumably allow future ChatGPTs to do more "brain-like things." Of course, there are plenty of things that the brain doesn't do well -- especially when it comes to computing that amounts to irreducible calculations. For these, both the brain and things like ChatGPT must look to "external tools" -- such as the Wolfram Language.

But for now, it's exciting to see what ChatGPT has been able to do. In a way, it's a great example of the fundamental scientific fact that large numbers of simple computing elements can do extraordinary and unexpected things. But it also provides us with the best impetus we have had in two millennia to better understand a central feature of the human condition, the fundamental features and principles of human language and the thought processes that underlie it.

Guess you like

Origin blog.csdn.net/m0_67129275/article/details/130135703