【Paper Notes】Controllable Natural Language Generation with Contrastive Prefixes

Controllable Natural Language Generation with Contrastive Prefixes


insert image description here

Conference : ACL2022 Findings

Task : Controlled Text Generation

Original : link

Abstract

This paper proposes a novel GPT2-based controllable text generation method that uses a series of small attribute-specific vectors, called prefixes, to guide natural language generation.

  • Different from previous work related to Prefix-Tuning, we take the relationship between prefixes into consideration and train multiple prefixes at the same time.
  • This paper proposes a novel supervised method and unsupervised method and novel objective to train single-aspect generative control, and by connecting the two methods can achieve multi-aspect (muti-aspect) GPT-2 generative control .
  • Experimental results show that our method is able to guide generation to desired properties while maintaining language quality, both in unilateral and multi-aspect generation control. The work in this paper provides a unified perspective on unilateral control and multilateral control.

insert image description here

Motivation

  • There are many shortcomings in the past CTG-related models, such as high training cost (CTRL), slow inference speed (PPLM), GeDi (additional introduction of a large number of parameters).
  • Prefix-Tuning is a lightweight fine-tuning framework, which introduces few additional parameters and can maintain a comparable inference speed to the original LM.
  • One aspect may include multiple attributes in the controllable text generation task, and there may be relationships among the attributes. For example, emotional control has two attributes, positive and negative, which are opposite to each other, and topic control may have multiple attributes. We think this inverse relationship helps to improve the control ability of prefix .

Main idea and Framework

Prefix is ​​a free parameter, denoted as P θ P_θPi, whose size is N × M × DN × M × DN×M×D N N N is the number of prefixes, for current aspect control,NNN is the number of attributes, such as the ordinary emotion control task in this experiment, the emotion includes positive and negative, soN = 2 N=2N=2 M M M is the prefix length,D = 2 × L × ED = 2 × L × ED=2×L×E is the GPT activation dimension size,LLL is the number of layers,EEE is the size of the hidden layer, and 2 indicates that the hidden layer of the GPT model includes the Key vector and the V vector. Same as the Prefix-Tuning work, this paper also uses reparameterization operation,H θ [ i , j , : ] = W i H ′ θ [ i , j , : ] H_θ[i, j, :] = W_i H′_θ[i, j, :]Hi[i,j,:]=WiHi[i,j,:] , the training is completed with onlyH θ H_θHiNeed to save, the rest can be discarded.

insert image description here

Prefixes can be trained using supervised, semi-supervised, and unsupervised methods. Semi-supervised is the connection of unsupervised and supervised. This article only presents supervised and unsupervised methods. For clarity, this article presents methods under the control of the parties.

Supervised Method

Suppose the attribute set of the current control aspect is YYY , the training sample is( x , y ) (x,y)x,y x x x is the input text,yyy is the attribute label,y ∈ Y y ∈ YyY , the attribute label indicates that it is inH θ H_θHithe correct index in .

  • discriminative loss

An additional discriminative loss is introduced to train multiple prefixes at the same time, and the final loss is the weighted sum of the language model loss and the discriminative loss. Among them, logp ( xt ∣ x < t , y ) log_p(x_t|x_{<t}, y)logp(xtx<t,y ) parameter化试logp θ , γ ( xt ∣ x < t , H θ [ y , : , : ] ) logp_{θ,γ}(x_t|x_{<t}, H_θ[y, :, :])logpi , c(xtx<t,Hi[y,:,:]) ,c cγ is the fixed GPT2 parameter,θ θθ is a learnable prefix parameter.

Each prefix can be trained independently. LLM L_{LM}LLMThe training prefix incorporates information that encourages generation, however, we observe that it is also helpful to infuse the prefix with information that discourages generation in controllable NLG. Given a sample ( x , y ) (x,y)(x,y),前缀 H θ [ y , : , : ] H_θ[y, :, :] Hi[y,:,:] should be optimized to producexxx , while other prefixes should avoid generatingxxx . To achieve the whole goal, all prefixes should be trained simultaneously, thus introducing discriminative loss.

According to Equation 3, optimize L d L_dLd, improves the attribute alignment p ( y ∣ x ) p(y|x)p ( y x ) is augmented byp ( x ∣ y ) p(x|y)p ( x y ) and simultaneously reducep ( x ∣ y ‾ ) p(x|\overline{y})p(xy)y ‾ ∈ Y / { y } \overline{y}∈Y/\{y\}yY / { y } , assuming a uniform prior distribution, sop ( y ) p(y)p ( y )p ( y ′ ) p(y')p ( and )can be dated.

insert image description here
insert image description here

Unsupervised Method

The unsupervised training method is shown in the figure:

insert image description here

In the unsupervised scenario, suppose the attribute set YYY is known, and the input of the training sample only contains the textxxx , attribute labelyyy is not available, soxxThe prefix index corresponding to x is also unknown. In other words,xxThe prefix index corresponding to x is a hidden variable zzz , whose.

  • discrete latent variable

Inspired by VQ-VAE , this paper regards prefixes as discrete hidden variables . (Personal understanding of discrete means that the size of the hidden variable space of this prefix is ​​fixed, such as K*R, query this prefix space according to the calculated z to get a 1XR vector as a prefix, and refer to VQ-VAE for specific understanding. Refer to 1 , reference 2 )

Introduce an encoder to learn the category distribution q ( z ∣ x ) q(z|x)q ( ​​z x ) (personally understands that the probability distribution of this category distribution is one-hot, reflecting discreteness), according toq ( z ∣ x ) q(z|x)q ( ​​z x ) , prefix indexzzz can be chosen with the corresponding prefixH θ [ z , : , : ] H_θ[z, :, :]Hi[z,:,:] are fed into the decoder to reconstruct the textxxx

  • Gumbel-Softmax

Because the process of selecting a prefix index is not derivable, Gumbel-Softmax (GS) is introduced.

  • Learning the Posterior Distribution

q ( z ∣ x ) q(z|x) q ( ​​z x ) is calculated as follows, whereτ ττ is the temperature coefficient of GS,E nc EncE n c is the encoding function, and this paper uses the pre-trained GPT-2+ linear layer as the encoder.

  • Personal understanding: The Euclidean distance between the two is minimized here, which is somewhat similar to the nearest neighbor search method , such as clustering, to find the most similar (smallest distance) vector to the representation of encoded x in the discrete hidden variable space of Prefix index zzz , and then take this index zzwhen decodingz to search in the prefix vector space. Here is learning this category distribution, that is, how to learn an index z that can distinguish data. According to the author's figure, getq ( z ∣ x ) q(z|x)After q ( z x ) , do a matrix multiplication operation with the prefix space, that is, according to the indexzzz looks for prefix.
  • Another understanding is equivalent to inputting q ( z ∣ x ) q(z|x)q ( ​​z x ) is Query,H θ H_θHiare Key and Value. This Query is based on Enc(x) and H θ H_θHiCalculate the similarity, and then perform matrix multiplication with V to obtain the final weighted result.

insert image description here
insert image description here

  • Unsupervised Contrastive Loss

Similar to the discriminative loss of supervised learning, the unsupervised contrastive loss L c L_c is introducedLc, as shown below, where mmm is the edge of the preset. The contrastive loss aims to passp ( z ∣ x ) p(z|x)p(zx)推离 p ( z ‾ ∣ x ) p(\overline{z}|x) p(zx ) a certain distance to improve attribute alignment.

insert image description here

insert image description here

Therefore, in order to train multiple prefixes simultaneously, the unsupervised loss function is a weighted sum of the following losses. In order to ensure that the posterior probability is as accurate as possible, LKL L_{KL} is introducedLKL, which is the KL divergence, assuming a priori p ( z ) p(z)p ( z ) is uniform, and these two items constitute the loss function of VAE. Optimizing these two items improveslogp ( x ) logp(x)Evidence lower bound for log p ( x ) .

insert image description here

Two related papers :

Toward controlled generation of text.ICML2017, VAE is used for controlled text generation.

Attribute alignment: Controlling text generation from pretrained language models.

Experiments

Unsupervised Setting

In the unsupervised setting, GPT-2+ hint engineering shows good control, but this method does not perform well on the detoxification task.

The unsupervised method proposed in this paper performs well on the detoxification task, and ablation experiments show that the contrastive loss plays a key role. But it does not perform well on sentiment control, especially when the target attribute is negative and the attribute alignment is poor, but it is good when it is positive. The reason may be: compared with the difference between toxic and neutral sentences, the difference between positive emotion and negative emotion is weaker and less obvious, and it is difficult for the model to learn the difference in sentence emotion level . Therefore, it is more challenging for the GPT2 encoder in our unsupervised model to accurately separate the unlabeled data into two emotions. The result of this is that the implicit criterion for the encoder to classify the input text may not be entirely sentiment , which is why after removing the contrast loss in the unsupervised loss function, the correlation between negative sentiment and negative sentiment is higher, and positive sentiment is more positive. low reason.

Supervised Setting

In the supervised learning scenario, few-shot learning can still maintain a certain degree of control over the three tasks, demonstrating the robustness of our method to the size of the training data.

Ablation experiments show that discriminative loss is important in supervised learning methods. Removing the discriminative loss and using GPT-medium between them shows that using Prefix-Tuning directly can achieve results in emotion control and topic control, but it is ineffective for detoxification tasks. The reason may be that the task characteristics are different. Detoxification requires the model to prevent the generation of some words or phrases according to the context , which is difficult to achieve for prefix-tuning, which is the meaning of discriminative loss.

On the DBPedia topic control task, the discriminative loss also greatly improves the model's attribute alignment effect. This is because the topic control has more attributes than other tasks, so the fusion of discriminative loss enables prefixes to capture each topic more efficiently. The only distinguishing feature .

Multi-Aspect Control

It is difficult to obtain multi-faceted labeled data, but we have unilaterally labeled data from multiple aspects, which can achieve multi-faceted control. Methods as below:

  • concatenate: Supervised learning, training a series of prefixes for each aspect, and then splicing them together.
  • Semi-supervised: By considering each unilaterally labeled data as a partial label, multiple prefixes are trained for multiple aspects at the same time, a method of connecting supervised learning and unsupervised learning, and the model structure is the same as the unsupervised method. The loss function is shown in the figure below, where the hidden variable zzz is spliced ​​from supervised and unsupervised aspects, and the encoding loss is introduced because some labeled samples imply the ground truth of the prefix index in the labeling aspect, which provides supervisory information for the prefix and the encoder.
    insert image description here
    The experimental results show that the concatenate method performs better in emotion + topic control, and the order of prefix does not affect the results. On the other hand, semi-supervised methods can further improve the attribute alignment without sacrificing too much language quality. Similar to the previous results of facet control, removing the discriminative loss significantly reduces the attribute alignment rate, especially topic control, and removing the encoding loss can achieve a higher attribute alignment rate, but the language quality is significantly reduced.

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/129372290