Stanford training Transformer alternative model: 170 million parameters, capable of debiasing, controllable and interpretable

Click on the blue word to follow us

Follow and star

never get lost

Institute of Computer Vision

f1a0c632b158244a28a6b1f14a81c7d0.gif

11d96b8ea97b1bcc7169ad33f5ef2a0d.gif

Public IDComputer Vision Research Institute

Learning groupScan the QR code to get the joining method on the homepage

49cad7d34874dbd7a36edc37517f4e47.png

Paper address: https://arxiv.org/abs/2305.16765

Project address: https://backpackmodels.science

Computer Vision Research Institute column

Column of Computer Vision Institute

Is it better to carry words in a backpack than a bag? In this paper, researchers at Stanford University proposed an intervenable Backpack language model, which intervenes in the behavior of the language model by regulating the meaning vector, and guides the language model to output the desired results.

8616346b896374934b3ee6899104c721.gif

Large-scale language models represented by GPT have achieved and will continue to achieve remarkable results, but they also have well-known problems, such as bias problems caused by imbalanced training sets.

In response to this problem, several researchers at Stanford University proposed a new neural architecture Backpack, which claims to be able to regulate the meaning vector to intervene in the behavior of the language model and guide the language model to output the desired results. Both the code and mockups for the project have been released.

John Hewitt, the first author of the paper and a CS doctoral student at Stanford University, said that Backpacks is an alternative to Transformers that can be expanded on expressivity and provide a new interface for interpretability through control. A backpack learns k non-contextual meaning vectors for each word, unsupervisedly decoupling the predictive usefulness of words.

69ce157d6d4d9b594e06f200433ad602.png

introduction

First we assume that there is the first half of the sentence "The CEO believes that _", and our problem is to remove the bias of the neural language model in the gender distribution of this sentence. We can intuitively know that the gender bias in this sentence comes from the word "CEO", because if "CEO" is replaced with "nurse", the bias will be gender-reversed. In order to remove bias on CEO, the model must be intervened and applied to all contexts in which the term CEO appears.

Ideally, we would like the intervention to not change the context of the model and to be able to predict the impact of this intervention. In general, from all aspects of interpretability and control, we prefer to implement interventions through an easily manipulated interface (such as non-contextual representation) that applies globally.

But for Transformers, such intervention is difficult to achieve because their contextual representation is a monolithic function of their input. A single function refers to a function with logical branches inside, which can execute different logic according to the input. Any intervention on the Transformer model will have complex non-linear effects depending on the context. But we want models that can achieve rich precision interventions, nenggou predictions in all contexts, and still be expressive; in this way, such models can become a viable alternative to Transformer.

To address these challenges, the researchers propose a new neural architecture, Backpack, whose predictions are log-linear combinations of out-of-context representations. Their approach is to represent each word in the vocabulary as a set of context-free sense vectors (sense vectors), which represent different learned aspects of the word.

For example, the meaning vector of the word "science" can encode the type of science, the relationship with technology, the scientific concepts that have been recognized, and different aspects of the scientific process (reproduction or experimentation), see Table 1 below. Sense vectors learn not classical word meanings, but more general aspects of a word's potential role in different contexts; in fact, meaning vectors can be viewed as a multi-vector generalization of classical word vectors.

2fd0e9a2649ea9a0c29b0da01d73569c.png

Figure 1: Transformers are singleton functions of sequences, while the output of Backpack is a context-free, weighted sum of aspects of learned words.

When intervening in the meaning vector, in order to make the intervention results predictable in different contexts, Backpack will represent each word in a sequence as a linear combination of the meaning vectors of all words in the sequence. Backpack's expressive power comes from a network model that computes the weights of this linear combination as a function of the entire sequence. By the way, the network model used by the researchers in the experiment is Transformer. Since sense vectors are chosen roughly based on context, they can be specialized to specific domains; each sense can be learned to be useful only in certain contexts, and usefulness can be predicted. That is, meaning contributes to prediction in a log-linear fashion, meaning that interventions on the meaning vector apply equally regardless of context (until the weights become non-negative scalars).

The researchers' experiments show that the Backpack language model is indeed powerful and show that interventions on the meaning vectors help explain and control the model. In the experiment, the researchers trained the Backpack language model on OpenWebText's 50 billion tokens; the context network of this Backpack model has 124 million parameters (the meaning vector has 46 million parameters), which can reach the perplexity of a 124 million parameter Transformer; but If higher interpretability is desired, larger models are required. The researchers also show how to specifically encode rich semantic concepts through meaning vectors.

According to the quantitative analysis results on four lexical similarity data sets (such as SimLex999), the meaning vector of Backpack with 170 million parameters is better than the word embedding of GPT-J-6B Transformer with 6 billion parameters, and it is close to the special purpose for this task. The current best performance of the method. The researchers also showed that meaning vectors can provide a control mechanism for Backpack language models.

For example, for words with occupational gender stereotypes (such as "CEO" or "nurse"), the meaning vector associated with the gender bias is often learned; the researchers found that by reducing the magnitude of the meaning vector, the limited environment greatly reduces gender differences in contextual predictions.

1f3c962460b88cfeea47944c29243824.png

Table 1: On the left is an example of a meaning vector for the word science with rich domain-specific orientation; on the right is an example of editing a meaning vector in an out-of-context manner (making MacBook relevant to HP), thereby changing the resulting context prediction.

Backpack Architecture

The following will first define the general form of the Backpack architecture, and then prove that the continuous bag of words word2vec (CBOW) and the self-attention-only network are actually special cases of Backpack.

  • General form of Backpack

Backpack is a parametric function that maps a sequence of symbols ee3e373ebd58181acb3caad170ce2879.pnginto a sequence of vectors ffb487a40eba1e978c6befa4014d342f.png, where each symbol x_i belongs to a finite vocabulary V, and  9a83262375b645c6a03468ff5498abfd.png. Here o_i is referred to as the Backpack representation of x_i in the context sequence x_{1:n}.

meaning vector. For each x ∈ V, Backpack builds k sense vectors:

67714ca3c61e0c9107a1228445a67093.jpeg

Among them  c29270c75a3f97a022073f687123da9a.png. Sense vectors are multi-vectors similar to classic out-of-context word representations like word2vec or GloVe.

weighted sum. For a sequence x_{1:n}, the representation o_i of element x_i is the weighted sum of the predicted meaning vectors of words in the context: given the contextualization weight71eed3d06841b9b6d5d22ceb6fe610a5.png330e0c29e34c6279b81414efaf349b08.png

Backpack's contextualization weights are then 5912d63c26a3507938261414f5b6e1d5.pngdefined by a (non-linear) context function over the entire sequence x_{1:n}:

0fe6dfeaaf18d6dffe8a6901d3d8058c.jpeg

in b6278355cdf462d944981be48b627997.png

The name Backpack was inspired by the fact that backpack means backpack, like a bag (analogous to bag-of-words), but with a little more order. Similar to Bag of Words, Backpack representation is also a weighted sum of non-contextual meanings; but Backpack is more ordered because the weight of this weighted sum depends on the ordered sequence.

Backpack model. The Backpack model is a probabilistic model that defines the probability on some output space Y as a log-linear function of a Backpack representation o_{1:n}:

a6ae455d696cf82e12e3d59036bccae9.jpeg

where  622a9e09270ad4ea874ed1cbdf9bbc2f.pngis a linear transformation. Because the representation of the Backpack model exhibits a log-linear pattern, the contribution of the meaning vector to the prediction also follows a log-linear pattern. This allows us to examine the meaning vectors by projecting them onto the vocabulary via E and seeing how exactly they contribute to the prediction in an arbitrary context.

The parameterization of the model can use commonly used deep neural networks, including LSTM and Transformer; these are not Backpacks, because their output representation is a (relatively speaking) unconstrained function of the entire sequence. Relatively speaking, the expressiveness of Backpack appears to be limited: its representation o_i is a scalar- 65979d433e68c1b0b2f5942ce3ad714d.pngweighted sum of non-contextual vectors. The contextual relationship between sequence elements can only be represented by weight α. Nevertheless, the researchers' experiments show that a contextualized weight network with strong performance can represent complex functions through the weighted sum of meaning vectors. For example, the newly proposed Backpack language model with 170 million parameters uses a Transformer model with 124 million parameters. to calculate α, and achieved the same loss as the 124 million parameter Transformer language model.

The researchers proved through mathematical form that both continuous bag of words and single-layer attention are special cases of Backpack, but we will not discuss too much here, please refer to the original paper for details.

Language Modeling with Backpack

The researchers used Backpack for parameterization and defined a neural autoregressive language model. For the probability of the next token in the sequence, they use a standard softmax parameterization with a weight matrix that d46ac203940ef18600fdc6d8c4c7767e.pngmaps representations to logits f85754546676da773537883df774ef1e.png9fe7739de00d93eaec52d3af45202a79.png

2c1b757c56a84127bc43cf3282962b45.jpeg

Recall that the Backpack representation o_j is defined by the sense vector C(x) and the contextualization weight α_j. The parameterization of the predictive meaning vector C in Equation (1) is presented first, followed by the parameterization of the contextualized weight network A. When o_j is parameterized by Backpack, the model can be called Backpack language model.

  • parameterize meaning

For the meaning function  de457651bcd693b6008b70038deafa5c.png, we embed each x ∈ V in  1dfd29ab17b2ba10b6c5902164d4d2e3.png, and then pass these embeddings through a feed-forward network  463e7be0ad6fd8bcc7b79836f85e4416.png:

6876ddd9d69bdc598b1cbb8f12d7e8b7.jpeg

Among them, the embedding/projection matrix E is closely related to the output matrix in (9). Now we can use a lookup table to define all k × |V| sense vectors, but as k increases the number of parameters becomes very large. So the approach the researchers take here is to embed words into 92606b15456c7da648ddf8b48b75422a.png, and then use shared weights to amplify them into  b71755c122d70b0f88a07739f1079109.png. This might explain the correlated meaning effects observed across different word types.

  • Parameterize contextualized weights

The researchers used a standard Transformer plus a layer of multi-head keyword query self-attention to 2a318e7330ef60c25db4eb963572ba71.pngparameterize, that is, let an embedded sequence pass through a Transformer:

e4f5f8b926caf4cb1f80d05f2ebe7abd.jpeg

Here an appropriate autoregressive mask and some kind of location representation is used, and then the matrix d88e06fa4f82b80c784aeb9fd4bd093c.pngwhere for each prediction meaning ℓ=1,...,k is b05eae01b085152f261d049cf4dee6bf.pngcomputed 02162477104184e1b5b4ec1b166e1e87.png.

The researchers regard these k senses as heads, and for each head, contextualization weights define a distribution for the attention to words.

Experiments on training the Backpack language model

This section presents the researchers' experiments for validation, including hyperparameters, data and optimization procedures, evaluations, and results for training Backpack and Transformer language models. We won't go into too much detail here, but the researchers highlight that learning k > 1 sense vectors is necessary for good language modeling performance.

377230540d3ddacf900cf0255496537e.png

Table 2: Language modeling performance, all models trained for 100,000 steps, token batch size 500,000, on OWT. For PPL metrics, lower is better; for accuracy metrics, higher is better. Note that the parameters of these models are not comparable; each Backpack has a Transformer of comparable size in its contextual network.

It can be seen that comparing each Backpack language model and the Transformer language model with comparable specifications to Backpack's contextual network, the performance of the Backpack language model is roughly equivalent. It should be pointed out that Backpack has more parameters, which mainly come from the meaning vector. The researchers found that the Backpack language model takes longer to converge than the Transformer during training. Surprisingly, although Small Backpack and Transformer achieve almost the same OWT perplexity, the Backpack language model performs significantly better on LAMBADA and Wikitext, while performing worse on BLiMP.

Emergent structures in meaning vectors

In the following, qualitative and quantitative experiments will be conducted to verify the effectiveness of meaning vectors in computing lexical similarity and correlation. These results suggest that meaning vectors can be a high-level interface for implementing interventions.

  • visualization of meaning

Based on experimental experience, the trained Backpack model will associate specific meaning vector indices with different predictive effects. Researchers explain these effects by taking the meaning ℓ of word x and projecting this meaning to the word embedding: 4ca1d71334f5c153a24b2a5feab2e5e4.png. Note that this is exactly (up to a scalar) how meaning contributes to any predictions of the model. The researcher interprets the role of the meaning vector by reporting the highest-scoring word under that projection.

Table 3 below shows some senses visually, such as sense 12 which seems to encode a broad related concept for almost all words; sense 3 which encodes a specific case of a bivariate distribution given x; sense 14 which seems to encode a related object for verbs , also encodes the associated modifier-dependent child for the noun.

3773891c6777766b7f684abd840a6869.png

Table 3: Visualizing how the same meaning index over many words encodes fine-grained concepts of meaning, relevance, and predictive usage.

  • Lexical Relationship Test

As can be seen from Table 4 below, Sense 12 (Synonym Sense) performed well on all datasets, comparable to or better than embeddings such as GPT-2-1.5B and GPT-J-6B, while GPT-J-6B performed well on RG The exception is on -65. Sense 14 is the verb-object sense and performs well only on verb similarity (VerbSim3500), while the minimal similarity of meaning performs especially well on noun-lexical similarity (SimLex999). This shows that the newly proposed method is comparable to the current state-of-the-art method, although their training tasks are very different, and the meaning vector encodes a large amount of lexical information.

26c569babdc50b41d0b1bee7fcd7e74a.png

Table 4: Lexical similarity evaluation results. All values ​​are Spearman correlations; higher is better.

Meaning vectors for control

Finally, the researchers conducted a proof-of-concept through some specific cases, that is, the behavior of the language model can be controlled using the meaning vector.

  • Generate topic-specific content

In Figure 2 below, the generation theme is controlled by meaning intervention in Backpack, compared with Transformer's PPLM.

9f6a612fb43682cae8368b2a08d05634.png

  • reduce gender bias

The researchers found that the meaning vectors10 of many professional nouns (such as nurse, CEO, teacher) carry gender stereotypes, and this stereotype will be coherently expressed through pronouns. By downgrading the meaning by 10 (multiplied by a scalar less than 1), the researchers found that Backpack's gender bias on these occupational nouns could be reduced.

ffd599e0202ecb00af6a2bca9171aa8c.png

Table 5: Reducing pronoun-based gender bias in a limited setting.

c91a9c1ddf252ba98ba663cd827a14e4.png

Figure 3: For the first half of the sentence "when the nurse walked into the room" (when the nurse walked into the room), the Backpack language model changed the meaning 10 of the word "nurse" from 0 (completely removed) to 1 (original situation), the conditional probability distribution is affected.

  • knowledge editor

The researchers also investigated the application of the new method to knowledge editing. Knowledge editing refers to the prediction of world knowledge edited by the model. In particular, it should be pointed out that many words related to a proper noun can be located in the meaning vector of the noun. In a qualitative proof-of-concept experiment, the researchers edited the meaning vector of a target word (such as MacBook) to remove correlations with another word (such as Apple), and then replaced these correlations with another word (such as HP). sex. Conceivably, this intervention would lead to the association of MacBook and HP in the predicted results.

012ecd55a966bd20813f59805772bae1.png

Table 6: Samples from Backpack, where Apple is projected outside the meaning embedding of MacBook, and Apple's original position is replaced by HP. The third sample is similar and is related to American football teams and players. The bold part is prompt.

© THE END 

For reprinting, please contact this official account for authorization

f677836a57734a0d18c79042351d1f53.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as object detection, object tracking, image segmentation, OCR, model quantization, and model deployment. The research institute shares the latest paper algorithm new framework every day, provides one-click download of papers, and shares actual combat projects. The research institute mainly focuses on "technical research" and "practice implementation". The Institute will share the practice process for different fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

03e11b29e7d22beac9c10a5b03b31f80.png

Guess you like

Origin blog.csdn.net/gzq0723/article/details/131407761