论文翻译:Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Summary

Humans learn language by listening, speaking, writing, reading, and by interacting with the multimodal real world. While existing language pre-training frameworks demonstrate the effectiveness of text-based self-supervised learning, this paper explores the idea of ​​a language model that uses vision as a supervisory signal. We find that the main impediment to this exploration is the large difference in size and distribution between vision-based language datasets and pure language corpora. We therefore developed a technique called "vokenization" to extend multimodal alignment to linguistic data only by contextually mapping linguistic tokens to relevant images (which we call "vokens"). The "vokenizer" is trained on a relatively small dataset of image captioning, which is then applied to generate vokens for large language corpora. Visually supervised language models trained using these context-generated vokens show consistent improvements on multiple pure language tasks such as GLUE, SQuAD, and SWAG.

1 Introduction

When learning language understanding, most people use multiple modalities, not just text and audio, and especially take advantage of visual modalities. As claimed by Bloom (2002), visual pointing is an important step in learning the meaning of words for most children. However, existing language pre-training frameworks are driven by context learning and only use language context as self-supervision. For example, word2vec (Mikolov et al., 2013) uses surrounding bag-of-words; ELMo (Peters et al., 2018) and GPT (Radford et al., 2018) use subsequent context; BERT (Devlin et al., 2019) uses randomly masked tokens . While these self-supervised frameworks have made strong progress in understanding human language, they do not borrow underlying information from the external visual world (see motivation for recent related work by Bender and Koller (2020) and Bisk et al. (2020)) .

In this paper, we introduce a visually supervised language model that mimics human language learning with visual orientation (Bloom, 2002).

As shown in the figure below, the model takes language tokens as input and uses images associated with the tokens as visual supervision. We refer to these images as vokens (i.e., visual tokens), since they act as visualizations of the corresponding tokens. Assuming a large aligned dataset of token-vokens exists, the model can learn the task of voken prediction from these vokens.
insert image description here
Unfortunately, no such aligned marker-voken datasets exist yet, so the creation of datasets with a one-to-one correspondence between vision and language remains a major challenge.

2. Supervised visual language model

Contextual language representation learning is driven by self-supervision without considering explicit connections (basis) to the external world. In this section, we introduce the idea of ​​a visually supervised language model and discuss the challenges of creating it with visual supervision.

2.1 Vokens: Visual Tokens

To provide visual supervision for language models, we assume the existence of a text corpus where each token is aligned to an associated image (although these voken annotations do not currently exist, we will attempt to generate vokens through the vokenization process in Section 3). Therefore, these images can be considered as labeled visualizations, and we named them "vokens". Based on these vokens, we propose a new language pre-training task: voken classification.

2.2 Voken classification task

Most language backbone models (e.g. ELMo (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2019)) for sentences s = wis = {w_i }s=wiEach token in outputs a local feature representation hi {h_i}hi. And without modifying the model schema.

Suppose the vokens come from the finite set XXX , we use a linear layer and a softmax layer to hide the outputhi h_ihiConvert to probability distribution pi p_ipi, then the voken classification loss is the negative log probability of all corresponding vokens:
h 1 , h 2 , . . . , hl = lm ( w 1 , w 2 , . . . , wl ) h_1,h_2,...,h_l =lm(w_1,w_2,...,w_l)h1,h2,...,hl=l m ( w1,w2,...,wl)
p i ( v ∣ s ) = s o f t m a x v { W h i + b } p_i(v|s)=softmax_v \lbrace Wh_i+b\rbrace pi(vs)=softmaxv{ Whi+b}
L V O K E N − C L S ( s ) = − ∑ i = 1 l l o g p i ( v ( w i ; s ) ∣ s ) \mathcal{L}_{VOKEN-CLS}(s)=-\sum_{i=1}^llog p_i(v(w_i;s)|s) LVOKENCLS(s)=i=1llogpi(v(wi;s)s)

2.3 Display

This task can be easily integrated into the current language pre-training framework. The following is the algorithm flow of the language pre-training model:
insert image description here

The figure above shows an example implementation of the voken classification task for visually supervised BERT, providing visual supervision for BERT (Devlin et al., 2019). The original BERT pre-training task mainly relies on covering the language model: the main part of the speech is randomly covered with a mask, and the model needs to predict these missing parts from the language context.

For simplicity, we use ss respectivelys ands ^ \hat ss^ indicates the token set and mask token set. Unmasked partsss ands ^ \hat ss^ The difference set is recorded as:s \ s ^ s \backslash \hat ss\s^

Suppose qi q_iqiright iiThe conditional probability distribution of i marks, the Masked Language Model (MLM) loss is the negative log likelihood of the mask token:
LMLM ( s , s ^ ​​) = − ∑ wi ∈ s ^ logqi ( wi ∣ s \ s ^ ) \ mathcal{L}_{MLM}(s,\hat s)=-\sum_{w_i \in \hat s}log q_i(w_i|s \backslash \hat s)LMLM(s,s^)=wis^I 'm sorry _i(wis\s^)

Without changing the model and model input, we calculate the voken classification loss for all tokens (as shown on the right side of Figure 2):
LVOKEN − CLS ( s , s ^ ​​) = − ∑ wi ∈ slogpi ( v ( wi ; s ) ∣ s \ s ^ ) \mathcal{L}_{VOKEN-CLS}(s,\hat s)=-\sum_{w_i \in s}log p_i(v(w_i;s)|s \backslash \hat s)LVOKENCLS(s,s^)=wislogpi(v(wi;s)s\s^)

The visually supervised mask language model sums the ratios of these two losses to obtain the final loss function
LVLM ( s , s ^ ​​) = LVOKEN − CLS ( s , s ^ ​​) + λ LMLM ( s , s ^ ​​) \mathcal{ L}_{VLM}(s,\hat s)=\mathcal{L}_{VOKEN-CLS}(s,\hat s)+\lambda \mathcal{L}_{MLM}(s,\hat s )LV L M(s,s^)=LVOKENCLS(s,s^)+λLMLM(s,s^)

where λ is a hyperparameter that controls the balance between the voken classification task and the masked language model task. This approach can improve the performance and generalization ability of language models.

The Two Challenges of Creating Vokens

insert image description here
The potential for external supervision using existing Vokens was introduced in the previous sections. However, we currently lack dense annotations from tokens to images. The most similar concept to Vokens is phrase localization (e.g. in the Flickr30K entity (Young et al., 2014; Plummer et al., 2017)). Due to the costly process of collecting phrase localizations, the coverage and annotation volume cannot meet our requirements. Besides phrase localization, the most promising data sources are image captioning datasets with sentence-to-image mappings (or found from multimodal documents, as in Hessel et al. (2019)). Image captioning belongs to a specific type of language called entity-based language (Roy and Pentland, 2002; Hermann et al., 2017), which has an explicit basis for external presence or physical behavior. However, entity-based language is very different from other types of natural language such as news, Wikipedia, and textbooks. To illustrate this, we list three image captioning datasets in Table 1 (i.e. MS COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017) and Conceptual Captions (Sharma et al., 2018) ) and key statistics of three language corpora for other language types, namely Wiki103 (Merity et al., 2017), English Wikipedia, and CNN/Daily Mail (See et al., 2017).

Vokenization

In this section, we develop a framework that can generate vokens. The basic idea is to learn a "vokenizer" from an image-text dataset and use it to annotate a large language corpus (i.e. English Wikipedia), thus bridging the gap between the base language and other types of natural language. We first illustrate the vokenization process and then describe how to implement it.

3.1 Vokenization process

As shown in the figure above, vokenization is the sentence s = (w 1 , w 2 , . . . , wl ) s = (w_1, w_2, ..., w_l)s=w1w2...wl) for each token wi w_iinwiAssign a related image v ( wi ; s ) v(w_i; s)v(wi;s ) process. We take this imagev ( wi ; s ) v(w_i; s)v(wi;s ) is called "voken" (visualization token).

Instead of using a generative model to create this image, we start with a set of images X = x 1 , x 2 , . . . , xn X = {x_1, x_2, ..., x_n}X=x1x2...xnRetrieval and marker-image correlation scoring function r θ ( wi , x ; s ) r_θ(w_i,x; s)ri(wi,x;s ) related images. This is given byθ θθ parameterized scoring functionr θ ( wi , x ; s ) r_θ(w_i,x; s)ri(wi,x;s ) , measure the markwi w_iwiand image xxCorrelation between x .

We assume that the optimal parameter of this function is θ ∗ θ*θ ,v ( wi ; s ) v(w_i; s )
associated with the sentence sv(wi;s ) are realized as maximizing their correlation scorer θ r_θriImage x ∈ X x \in XxX
v ( w i ; s ) = a r g m a x x ∈ V r θ ∗ ( w i , x ; s ) v(w_i;s)= argmax_{x \in V}r_θ * (w_i,x;s) v(wi;s)=argmaxxVri(wi,x;s)

Thanks to image set xxX actually builds a limited vocabulary for vokens, so we can leverage the voken classification task to visually supervise the training of language models. Next, we will discuss the detailed implementation of this vokenization process.

3.2 Context Token-Image Matching Model

The core of the votenization process is a contextual Token-Image matching model.

The model starts with a sentence sss and an imagexxx as input, and the sentencesss consists of a series of tokens{ w 1 , w 2 , . . . , wl } , \lbrace w_1,w_2,...,w_l \rbrace,{ w1,w2,...,wl} , composition.

输出 r θ ( w i , x ; s ) r_θ(w_i,x;s) ri(wi,x;s)是token w i ∈ s w_i \in s wiAffinity score between s and image x while considering the entire sentence sss as context.
Modeling in order to build the relevance score functionr θ ( wi , x ; s ) r_θ(w_i,x;s)ri(wi,x;s ) model,

We decompose it into linguistic feature representation f θ ( wi ; s ) f_θ(w_i;s)fi(wi;s ) and visual feature representationg θ ( x ) g_θ(x)giInner product of ( x ) :

r θ ( wi , x ; s ) = f θ ( wi ; s ) T g θ ( x ) r_θ(w_i,x;s) = f_θ(w_i;s)^Tg_θ(x)ri(wi,x;s)=fi(wi;s)Tgi(x)

These two feature representations are generated by linguistic and visual encoders, respectively. The language encoder first uses the pre-trained BERTBASE (Devlin et al., 2019) model to discrete token { wi } \lbrace w_i\rbrace{ wi} context embedding into hidden output vector{ hi } \lbrace hi \rbrace In { hi } :

h 1 , h 2 , . . . , h l = b e r t ( w 1 , w 2 , . . . , w l ) h_1,h_2,...,h_l = bert(w_1,w_2,...,w_l) h1,h2,...,hl=bert(w1,w2,...,wl)

Then we apply a multi-layer perceptron (MLP) wmlp θ w{mlpθ}w m lpθ pair hidden outputhi h_ihiPerform dimensionality reduction. To simplify the retrieval process in Section 3.1, the final linguistic features are normalized to a vector of norm 1 by dividing its Euclidean norm by its own value:

f θ ( wi ; s ) = wmlp θ ( hi ) ∣ ∣ wmlp θ ( hi ) ∣ ∣ f_θ(w_i;s) = \frac{w_{mlpθ}(h_i)}{||w_{mlpθ}(h_i) ||}fi(wi;s)=∣∣wm lpθ(hi)∣∣wm lpθ(hi)

On the other hand, the visual encoder first extracts visual embeddings from the pre-trained ResNeXt (Xie et al., 2017). Similar to the language encoder, then apply the MLP layer xmlp θ x_{mlpθ}xm lpθand L2 normalization layer:

e = R e s N e X t ( x ) e=ResNeXt(x) e=R es N e Xt ( x )
g θ ( x ) = xmlp θ ( e ) ∣ ∣ xmlp θ ( e ) ∣ ∣ g_θ(x)=\frac{x_{mlpθ}(e)}{||x_{mlpθ }(e)||}gi(x)=∣∣xm lpθ(e)∣∣xm lpθ(e)

train

Since dense annotations from tokens to images are lacking and difficult to generate, we choose to train token-image matching models from weakly supervised image captioning datasets such as MS COCO (Lin et al., 2014). These datasets consist of sentence-image pairs ( sk , xk ) (s_k,x_k)(sk,xk) , where the sentencesk s_kskdescribes the image xk x_kxkvisual content in .

To establish the correspondence between tokens and images, we put the sentence sk s_kskAll markers in the image xk x_kxkpair. The model is then optimized, without loss of generality, by maximizing the correlation score between these aligned marker-image pairs and the unaligned pairs.

Suppose ( s , x ) (s,x)(s,x ) is the image text data point, we randomly select another imagex ′ x′x , satisfying the conditionx ′ ≠ xx′≠xx=x . Then we use hinge loss to optimize the weightsθ θθ such that the positive label-image pairr θ ( wi , x ; s ) r_θ(w_i,x;s)ri(wi,x;s ) is at least better than the negative pairr θ ( wi , x ′ ; s ) r_θ(w_i,x′;s)ri(wi,x;s )高一个实方M。
L θ ( s , x , x ′ ) = ∑ i = 1 lmax { 0 , M − r θ ( wi , x ; s ) + r θ ( wi , x ′ ; s ) } \ mathcal{L}_θ(s,x,x′)=\sum_{i=1}^l max \lbrace0,M−r_θ(w_i,x;s)+ r_θ(w_i,x′;s)\rbraceLi(s,x,x)=i=1lmax{ 0,Mri(wi,x;s)+ri(wi,x;s)}

Intuitively, this hinge loss max { 0 , M − pos + neg } max \lbrace 0, M−pos + neg\rbrace is minimized when the score difference is smaller than the edge Mmax{ 0Mpos+ne g } will try to increase the score of positive pairs and decrease the score of negative pairs. Otherwise (if difference ≥ bound M), both scores remain unchanged.

reasoning

Given that the relevance score is decomposed into a feature representation f θ ( wi ; s ) f_θ(w_i;s)fi(wi;s )g θ ( v ) g_θ(v)gi( v ) , so the retrieval problem in Section 3.1 can be formulated as a maximum inner product search (Mussmann and Ermon, 2016). Furthermore, since vectors are norm-1, the vector with the largest inner product is the same as the closest vector in Euclidean space (i.e. nearest neighbor (Knuth, 1973)).

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/129991722