Birth of a Transformer

Here is the origin of the title

To better understand the internal mechanisms of Transformer-based large language models (LLMs) to improve their reliability and interpretability.

As large language models (LLMs) continue to increase in use and deployment, it becomes increasingly important to open the black boxes and understand their inner workings. A better understanding of how these models make decisions is critical to improving the models and mitigating their faults, such as hallucinations or inference errors.

It is well known that an important factor in the recent success of LLMs is their ability to learn and reason from context. The ability of LLMs to learn these contexts is generally attributed to the Transformer architecture, specifically the use of self-attention blocks, which enable careful selection of input sequences to reason about plausible next tokens. Furthermore, predictions may require global knowledge, such as grammar rules or general facts, which may not be present in context and need to be stored in the model.

We can't help but wonder why Transformer-based models are so good at using their context to predict new tokens, and how does this ability arise during training? With these questions in mind, researchers from Meta AI conducted in-depth research. By studying Transformer's learning mechanism in a synthetic setting, they reveal the balance of its global and contextual learning, and interpret the weight matrix as an associative memory, providing a basis for understanding and optimizing Transformer.

Paper address: https://arxiv.org/pdf/2306.00802.pdf

The first thing to understand is how Transformer discovers these capabilities during training. To this end, the study introduces a synthetic dataset consisting of sequences generated by a bigram language model. The model then needs to rely on contextual learning to make good predictions for specific binary sequences, while global binary can make guesses based on the global statistics of the current token. While a single-layer Transformer cannot reliably predict contextual binary, the study found success by developing a two-layer Transformer with an induction head mechanism, a circuit with two attention heads that allows the Transformer to learn from the context [・・・, a, b,・・・, a] predicts b in , and seems to be ubiquitous in Transformer language models. This induction head mechanism is ubiquitous and successful in Transformer language models.

Furthermore, in order to better understand how the context mechanism emerges during training, this study freezes some layers (including embeddings and value matrices) during random initialization to further simplify the model architecture. In this way, the research focus shifts to attention and feed-forward mechanisms, while avoiding the difficulty of learning representations. At the same time, this simplification also introduces a natural model for a single weight matrix as an associative memory. Natural models can store input-output or key-value pair embeddings via their outer products. Stochastic high-dimensional embeddings are particularly suitable for this view due to their near-orthogonality.

In summary, the contributions of this study can be summarized as:

  • This paper introduces a new synthetic setting to study global and contextual learning: sequences follow a bigram language model, where some bigrams vary across sequences and others do not.

  • This paper treats the Transformer's weight matrix as learning to store an associative memory of a specific embedding pair, and uses this as a task to derive a simplified but more interpretable model.

  • This paper presents a meticulous empirical study of the training dynamics: first learning the global binary, and then learning the appropriate memory in a top-down fashion to form the induction head.

  • This paper gives theoretical insights into the training dynamics, showing how to recover the desired associative memory by performing some top-down gradient steps on the population loss by finding the signal in the noisy input.

method introduction

The study then introduces the synthetic data setting, which enables a close study of the development of the sensing head mechanism during training and how the Transformer learns to exploit contextual information.

Binary data model: The model sequence consists of a general-purpose binary language model (ie, a Markov chain), and each sequence is generated as follows: Figure 2 below    visualizes the attention map on the test sequence, which shows that the model has learned sensor head mechanism. 

The study then introduces the Transformer associative memory perspective: Because of the nearly orthogonal embeddings, the weight matrix behaves as an associative memory, storing pairs of embeddings as a weighted sum of their outer products. The study introduces a simplified Transformer model with fixed random embeddings, which will use this idea to generate a precise understanding of learning dynamics.

Furthermore, the study raises a useful view of model weights in Transformer as an associative memory of high-dimensional embedding vectors. The sensing head mechanism can be obtained by the following outer product matrix as memory, while all other weights are fixed to a random initialization state:

Figure 3 investigates the effect of freezing different layers before 300 iterations on the training dynamics. 

Global vs contextual learning. From Fig. 4 (left/right), it can be seen that when all layers are jointly trained, the global binary statistics tend to learn faster than the sensing head, which can be seen from the rapid decrease of loss and KL in early iterations.

Furthermore, as seen in Figure 4 (left), changes in the data distribution can have a significant impact on the learning speed of the context mechanism. The study observed that context learning could be slowed down by (i) a smaller number of triggers K, (ii) using only a few fixed triggers, and (iii) using random triggers instead of fixed triggers . whaosoft  aiot  http://143ai.com

The study also shows in Figure 4 (middle) that changing the output token distribution to a binary distribution at training time reduces accuracy, suggesting that using a more diverse training distribution can yield better generalization accuracy. model with only a small additional training cost.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131341422