Chinese-English translation of Sentence-Bert papers

Full name of Sentence-Bert paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-Bert paper address: https://arxiv.org/abs/1908.10084
Sentence-Bert paper code: https://github.com/UKPLab /sentence-transformers

Abstract - Abstract

BERT (Devlin et al., 2018) and RoBERTa (Liuet al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.
In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.1

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) have achieved state-of-the-art performance on sentence pair regression tasks such as Semantic Text Similarity (STS) . However, it needs to feed both sentences into the network, which creates a large computational overhead: finding the most similar pair in a set of 10,000 sentences requires about 50 million inference calculations with BERT (about 65 hours ). The construction of BERT makes it unsuitable for unsupervised tasks such as semantic similarity search and clustering.

In this paper, we propose Sentence-BERT (SBERT), which is a modification of the pre-trained BERT network, using doublet and triplet network structures to obtain semantically meaningful sentence vectors (embeddings), which can Use cosine similarity for comparison. This reduces the workload of finding similar sentence pairs from 65 hours for BERT/RoBERTa to about 5 seconds for SBERT, while still maintaining the accuracy of BERT.

We evaluate the performance of SBERT and SRoBERTa on common STS tasks and transfer learning tasks, which outperform other excellent methods for obtaining sentence vectors.

1. Introduction - Introduction

In this publication, we present Sentence-BERT (SBERT), a modification of the BERT network using siamese and triplet networks that is able to derive semantically meaningful sentence embeddings2. This enables BERT to be used for certain new tasks, which up-to-now were not applicable for BERT. These tasks include large-scale semantic similarity comparison, clustering, and information retrieval via semantic search.

BERT set new state-of-the-art performance on various sentence classification and sentence-pair regression tasks. BERT uses a cross-encoder: Two sentences are passed to the transformer network and the target value is predicted. However, this setup is unsuitable for various pair regression tasks due to too many possible combinations. Finding in a collection of n = 10 000 sentences the pair with the highest similarity requires with BERT n·(n−1)/2 = 49 995 000 inference computations. On a modern V100 GPU, this requires about 65 hours. Similar, finding which of the over 40 million existent questions of Quora is the most similar for a new question could be modeled as a pair-wise comparison with BERT, however, answering a single query would require over 50 hours.
A common method to address clustering and semantic search is to map each sentence to a vector space such that semantically similar sentences are close. Researchers have started to input individual sentences into BERT and to derive fixedsize sentence embeddings. The most commonly used approach is to average the BERT output layer (known as BERT embeddings) or by using the output of the first token (the [CLS] token). As we will show, this common practice yields rather bad sentence embeddings, often worse than averaging GloVe embeddings (Pennington et al., 2014).

In this paper, we propose Sentence-BERT (SBERT), a modification of the BERT network using doublet and triplet networks that yields semantic sentence vectors. This enables BERT to be used for certain new tasks that until now were not suitable for BERT. These new tasks include large-scale semantic similarity comparisons, clustering, and information retrieval through semantic search.

BERT creates new state-of-the-art performance on a variety of sentence classification and sentence pair regression tasks. BERT uses a cross-encoder: two sentences are passed to the transformer network, and the target value is predicted. However, since there are too many possible combinations, this network structure is not suitable for the regression task of various sentence pairs. Finding the pair with the highest similarity in a set of n = 10,000 sentences requires n*(n-1)/2=49 995 000 inference calculations with BERT. On a modern V100 GPU, this takes about 65 hours. Similarly, finding a question that is most similar to a new question among Quora's 40+ million existing questions can be modeled as a one-to-one comparison with BERT, however, it takes more than 50 hours to answer a single query.

A common approach to solve clustering and semantic search is to map each sentence into a vector space, such that semantically similar sentence vectors will be close. Researchers have started to feed single sentences into BERT and derive fixed-size sentence embedding vectors. The most commonly used method is to average pool the output layer of BERT (called BERT's embedding vector) or use the first token ([CLS]token) as the output sentence vector of BERT. As we will show, this common practice produces quite poor sentence vectors, often worse than the average of GloVe vectors (Pennington et al., 2014).

To alleviate this issue, we developed SBERT. The siamese network architecture enables that fixed-sized vectors for input sentences can be derived. Using a similarity measure like cosinesimilarity or Manhatten / Euclidean distance, semantically similar sentences can be found. These similarity measures can be performed extremely efficient on modern hardware, allowing SBERT to be used for semantic similarity search as well as for clustering. The complexity for finding the most similar sentence pair in a collection of 10,000 sentences is reduced from 65 hours with BERT to the computation of 10,000 sentence embeddings (~5 seconds with SBERT) and computing cosinesimilarity (~0.01 seconds). By using optimized index structures, finding the most similar Quora question can be reduced from 50 hours to a few milliseconds (Johnson et al., 2017).
We fine-tune SBERT on NLI data, which creates sentence embeddings that significantly outperform other state-of-the-art sentence embedding methods like InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018). On seven Semantic Textual Similarity (STS) tasks, SBERT achieves an improvement of 11.7 points compared to InferSent and 5.5 points compared to Universal Sentence Encoder. On SentEval (Conneau and Kiela, 2018), an evaluation toolkit for sentence embeddings, we achieve an improvement of 2.1 and 2.6 points, respectively.
SBERT can be adapted to a specific task. It sets new state-of-the-art performance on a challenging argument similarity dataset (Misra et al., 2016) and on a triplet dataset to distinguish sentences from different sections of a Wikipedia article (Dor et al., 2018).
The paper is structured in the following way: Section 3 presents SBERT, section 4 evaluates SBERT on common STS tasks and on the challenging Argument Facet Similarity (AFS) corpus (Misra et al., 2016). Section 5 evaluates SBERT on SentEval. In section 6, we perform an ablation study to test some design aspect of SBERT. In section 7, we compare the computational efficiency of SBERT sentence embeddings in contrast to other state-of-the-art sentence embedding methods.

To alleviate this problem, we developed SBERT. The Siamese network architecture makes it possible to derive a fixed-size vector of the input sentence. Semantically similar sentences can be found using similarity metrics such as cosine similarity or Manhattan/Euclidean distance. These similarity metrics can be executed very efficiently on modern hardware, allowing SBERT to be used for semantic similarity search as well as clustering. The complexity of finding the most similar sentence pairs in a collection of 10,000 sentences is reduced from 65 hours for BERT to the calculation of 10,000 sentence vectors (about 5 seconds for SBERT) and the calculation of cosine similarity (about 0.01 seconds). By using an optimized index structure, finding the most similar Quora questions can be reduced from 50 hours to a few milliseconds (Johnson et al., 2017).

We fine-tuned SBERT on the NLI data, which produces sentence vectors that significantly outperform other excellent sentence vector methods such as InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018). In the seven semantic-text similarity (STS) tasks, SBERT achieved an improvement of 11.7 points compared with the InferSent task and 5.5 points compared with the Universal Sentence Encoder task. On the sentence vector evaluation toolkit SentEval (Conneau and Kiela, 2018), we achieve 2.1 and 2.6 points of improvement, respectively.

SBERT can be adapted to specific tasks. It achieves new state-of-the-art performance on the challenging argument similarity dataset (Misra et al., 2016) and triplets dataset distinguishing different sentences of Wikipedia articles (Dor et al., 2018).

The paper is structured as follows: Section 3 introduces SBERT, and Section 4 evaluates SBERT on common STS tasks and the challenging Argument Aspect Similarity (AFS) corpus (Misra et al., 2016). Section 5 evaluates SBERT on SentEval. In Section 6, we conduct ablation studies to test some design aspects of SBERT. In Section 7, we compare the computational efficiency of the SBERT sentence embedding method with other state-of-the-art sentence embedding methods.

Two, Related Work - related work

We first introduce BERT, then, we discuss stateof-the-art sentence embedding methods.
BERT (Devlin et al., 2018) is a pre-trained transformer network (Vaswani et al., 2017), which set for various NLP tasks new state-of-the-art results, including question answering, sentence classification, and sentence-pair regression. The input for BERT for sentence-pair regression consists of the two sentences, separated by a special [SEP] token. Multi-head attention over 12 (base-model) or 24 layers (large-model) is applied and the output is passed to a simple regression function to derive the final label. Using this setup, BERT set a new state-of-the-art performance on the Semantic Textual Semilarity (STS) benchmark (Cer et al., 2017). RoBERTa (Liu et al., 2019) showed, that the performance of BERT can further improved by small adaptations to the pre-training process. We also tested XLNet (Yang et al., 2019), but it led in general to worse results than BERT.
A large disadvantage of the BERT network structure is that no independent sentence embeddings are computed, which makes it difficult to derive sentence embeddings from BERT. To bypass this limitations, researchers passed single sentences through BERT and then derive a fixed sized vector by either averaging the outputs (similar to average word embeddings) or by using the output of the special CLS token (for example: May et al. (2019); Zhang et al. (2019); Qiao et al. (2019)). These two options are also provided by the popular bert-as-a-service-repository3 . Up to our knowledge, there is so far no evaluation if these methods lead to useful sentence embeddings.

We first introduce BERT and then discuss state-of-the-art sentence embedding (vector) methods.

BERT (Devlin et al., 2018) is a pretrained transformer network (Vaswani et al., 2017) that achieves state-of-the-art results on a variety of NLP tasks, including question answering, sentence classification, and sentence pair regression. The BERT input for the sentence pair regression task consists of two sentences separated by a special [SEP] token. Multi-head attention over 12 layers (base model) or 24 layers (large model) is used, and the output is passed to a simple regression function to generate the final labels. Using this architecture, BERT achieves a new state-of-the-art performance on the Semantic-Text Similarity (STS) benchmark dataset (Cer et al., 2017). RoBERTa (Liu et al., 2019) shows that the performance of BERT can be further improved by making small adaptations of training preprocessing. We also tested XLNet (Yang et al., 2019), but overall, the results were worse than BERT.

A big disadvantage of the BERT network structure is that no independent sentence vectors (embeddings) are calculated, which makes it difficult to derive sentence vectors from BERT. To get around this limitation, researchers pass a single sentence through BERT, and then pool the output through average (similar to average word embeddings) or with a special CLS token pooling output (e.g.: May et al. (2019); Zhang et al. ( 2019); Qiao et al. (2019)) generate fixed-size vectors. Both options are also provided by the popular bert-as-a-service-repository. To the best of our knowledge, it has not been evaluated whether these methods would benefit sentence embeddings.

Sentence embeddings are a well studied area with dozens of proposed methods. Skip-Thought (Kiros et al., 2015) trains an encoder-decoder architecture to predict the surrounding sentences. InferSent (Conneau et al., 2017) uses labeled data of the Stanford Natural Language Inference dataset (Bowman et al., 2015) and the MultiGenre NLI dataset (Williams et al., 2018) to train a siamese BiLSTM network with max-pooling over the output. Conneau et al. showed, that InferSent consistently outperforms unsupervised methods like SkipThought. Universal Sentence Encoder (Cer et al., 2018) trains a transformer network and augments unsupervised learning with training on SNLI. Hill et al. (2016) showed, that the task on which sentence embeddings are trained significantly impacts their quality. Previous work (Conneau et al., 2017; Cer et al., 2018) found that the SNLI datasets are suitable for training sentence embeddings. Yang et al. (2018) presented a method to train on conversations from Reddit using siamese DAN and siamese transformer networks, which yielded good results on the STS benchmark dataset.
Humeau et al. (2019) addresses the run-time overhead of the cross-encoder from BERT and present a method (poly-encoders) to compute a score between m context vectors and pre-computed candidate embeddings using attention. This idea works for finding the highest scoring sentence in a larger collection. However, polyencoders have the drawback that the score function is not symmetric and the computational overhead is too large for use-cases like clustering, which would require O(n2) score computations.
Previous neural sentence embedding methods started the training from a random initialization. In this publication, we use the pre-trained BERT and RoBERTa network and only fine-tune it to yield useful sentence embeddings. This reduces significantly the needed training time: SBERT can be tuned in less than 20 minutes, while yielding better results than comparable sentence embedding methods.

Sentence embedding is a widely studied field and many methods have been proposed. Skip-Thought (Kiros et al., 2015) trains an encoder-decode network structure to predict surrounding sentences. Inferesent (Conneau et al., 2017) uses the Stanford Natural Language Inference Annotated Dataset (Bowman et al., 2015) and the MultiGenre NLI Dataset (Williams et al., 2018) to train the Siamese BiLSTM network with max pooling. Output the result. Conneau et al. show that InferSent consistently outperforms unsupervised methods such as Skipthough. The Universal Sentence Encoder (Cer et al., 2018) trains a transformer network and augments unsupervised learning with SNLI training. Hill et al. (2016) show that the sentence embedding training task significantly affects its quality. Previous research work (Conneau et al., 2017; Cer et al., 2018) found that the SNLI dataset is suitable for training sentence vectors . Yang et al. (2018) proposed a method for training on dialogue data brought into Reddit using Siamese DAN and Siamese transformer networks, which achieved good results on the STS benchmark dataset.

Humeau et al. (2019) address BERT's cross-encoder runtime overhead and propose a method (poly-encoders) to compute scores between m context vectors and precompute candidate vectors using attention. The idea is to find the highest scoring sentences in a larger set. However, the disadvantage of Polyncoders is that the scoring function is not symmetric and is too expensive for use cases like clustering, requiring O(n2) score calculations.

Previous sentence vector methods are trained from random initialization. In this paper, we use BERT's pre-trained model and the RoBERTa network and only fine-tune it to produce efficient sentence vectors. This drastically reduces the time required for training: SBERT can be tuned in less than 20 minutes, while producing better vector results than comparable sentence-vector methods.

Three, Model - model

SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding. We experiment with three pooling strategies: Using the output of the CLS-token, computing the mean of all output vectors (MEANstrategy), and computing a max-over-time of the output vectors (MAX-strategy). The default configuration is MEAN.
In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al., 2015) to update the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity.
The network structure depends on the available training data. We experiment with the following structures and objective functions.
Classification Objective Function. We concatenate the sentence embeddings u and v with the element-wise difference |u−v| and multiply it with the trainable weight Wt ∈ R 3n×k : o = softmax(Wt(u, v, |u − v|))
where n is the dimension of the sentence embeddings and k the number of labels. We optimize cross-entropy loss. This structure is depicted in Figure 1.
Regression Objective Function. The cosinesimilarity between the two sentence embeddings u and v is computed (Figure 2). We use meansquared-error loss as the objective function.
Triplet Objective Function. Given an anchor sentence a, a positive sentence p, and a negative sentence n, triplet loss tunes the network such that the distance between a and p is smaller than the distance between a and n. Mathematically, we minimize the following loss function: max(||sa − sp|| − ||sa − sn|| + , 0)
with sx the sentence embedding for a/n/p, || · || a distance metric and margin . Margin ensures that sp is at least closer to sa than sn. As metric we use Euclidean distance and we set = 1 in our experiments.

SBERT adds a pooling layer operation to the output of BERT/RoBERTa to derive a fixed-size sentence vector. We experiment with three pooling strategies: using the output of the CLS-token, computing the mean of all output vectors (MEAN strategy), and computing the maximum value of the output vectors (max strategy). The default configuration is MEAN.

To fine-tune BERT/RoBERTa, we created siamese and triplet networks (Schroff et al., 2015) to update weights so that the resulting sentence vectors are semantically meaningful and can be compared using cosine similarity.

The network structure depends on the available training data. We conduct experiments with the following structures and objective functions.

Classification objective function. We obtain the u vector and the v vector respectively, splicing the two vectors with the bitwise difference vector |uv|, and then multiplying the spliced ​​vector by a trainable weight where n is the dimension of the sentence vector and k is the insert image description here
insert image description here
label quantity. We optimize the cross-entropy loss. The structure is shown in Figure 1.
insert image description here

Figure 1: SBERT architecture with classification objective function, e.g. for fine-tuning on the SNLI dataset. Two BERT networks have parallel weights (siamese network structure).

Regression objective function. The cosine similarity between the two sentence vectors of u vector and v vector was calculated (Fig. 2). We use the mean square error loss as the objective function.
insert image description here

Figure 2: Example of an inference SBERT architecture for computing similarity scores. This architecture is also used with regression objective functions.

Triple objective function. Given an anchor sentence a, a positive sentence p and a negative sentence n, the model is optimized by making the distance between p and a smaller than the distance between n and a. The objective function o is minimized, that is,
insert image description here
where || · || represents the distance between two samples. This article uses the Euclidean distance, and S_a, S_p, and S_n are the sentence-Embedding of the corresponding samples. During the experiment, we set the hyperparameter epsilon to 1.

1. Training Details - training details

We train SBERT on the combination of the SNLI (Bowman et al., 2015) and the Multi-Genre NLI (Williams et al., 2018) dataset. The SNLI is a collection of 570,000 sentence pairs annotated with the labels contradiction, eintailment, and neutral. MultiNLI contains 430,000 sentence pairs and covers a range of genres of spoken and written text. We fine-tune SBERT with a 3-way softmaxclassifier objective function for one epoch. We used a batch-size of 16, Adam optimizer with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. Our default pooling strategy is MEAN.

We train SBERT on a joint dataset of SNLI (Bowman et al., 2015) and multi-category NLI (Williams et al., 2018). SNLI contains 570,000 sentence pairs with opposing, supporting, and neutral labels. MultiNLI contains 430,000 sentence pairs covering various types of spoken and written texts. Every epoch, we fine-tune SBERT with 3 objective functions for softmax classification. We use an Adam optimizer with a batch-size of 16 and a learning rate of 2e−5, and 10% of the training data set uses a linear warm-up learning rate. Our default pooling strategy is the average pooling strategy.

Four, Evaluation - Semantic Textual Similarity - Evaluation - Semantic Textual Similarity

We evaluate the performance of SBERT for common Semantic Textual Similarity (STS) tasks.State-of-the-art methods often learn a (complex) regression function that maps sentence embeddings to a similarity score. However, these regression functions work pair-wise and due to the combinatorial explosion those are often not scalable if the collection of sentences reaches a certain size. Instead, we always use cosine-similarity to compare the similarity between two sentence embeddings. We ran our experiments also with negative Manhatten and negative Euclidean distances as similarity measures, but the results for all approaches remained roughly the same.

We evaluate the performance of SBERT on a common Semantic-Text Similarity (STS) task. State-of-the-art methods typically learn (complex) regression functions that map sentence embeddings to similarity scores. However, these regression functions work in pairs, and due to combinatorial explosion, these functions are often uncountable if the set of sentences reaches a certain size. Instead, we always use cosine similarity to compare the similarity between two sentence embeddings. We also conducted experiments using negative Manhattan distance and negative Euclidean distance as similarity measures, but all methods gave basically the same results.

1, Unsupervised STS - Unsupervised STS

We evaluate the performance of SBERT for STS without using any STS specific training data. We use the STS tasks 2012 - 2016 (Agirre et al., 2012, 2013, 2014, 2015, 2016), the STS benchmark (Cer et al., 2017), and the SICK-Relatedness dataset (Marelli et al., 2014). These datasets provide labels between 0 and 5 on the semantic relatedness of sentence pairs. We showed in (Reimers et al., 2016) that Pearson correlation is badly suited for STS. Instead, we compute the Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and the gold labels. The setup for the other sentence embedding methods is equivalent, the similarity is computed by cosinesimilarity. The results are depicted in Table 1.

We evaluate the performance of SBERT on STS without using any specific STS training data. We use the 2012-2016 STS task (Agirre et al., 2012, 2013, 2014, 2015, 2016), the STS benchmark (Cer et al., 2017) and the disease correlation dataset (Marelli et al., 2014). These datasets provide labels between 0 and 5 for the semantic relatedness of sentence pairs. We found in (Reimers et al., 2016) that the Pearson correlation is very poorly suited for STS. Instead, we compute the Spearman rank correlation between the cosine similarity of sentence embeddings and the golden labels. The settings of other sentence embedding methods are the same, and the similarity is calculated by cosine similarity. The results are shown in Table 1.
insert image description here
Spearman rank correlation ρ between cosine similarity of sentence representations and a gold label for various textual similarity (STS) tasks. By convention, properties are reported as ρ×100. STS12-STS16: June 2012-2016, STSb: STSbenchmark, SICK-R: SICK correlation dataset.

The results shows that directly using the output of BERT leads to rather poor performances. Averaging the BERT embeddings achieves an average correlation of only 54.81, and using the CLStoken output only achieves an average correlation of 29.19. Both are worse than computing average GloVe embeddings.

It turns out that using BERT's output directly leads to rather poor performance. BERT's embedding average pooling achieves an average score of only 54.81, and using CLS token output only reaches an average score of 29.19. Both of these are worse than computing the average GloVe embedding.

Using the described siamese network structure and fine-tuning mechanism substantially improves the correlation, outperforming both InferSent and Universal Sentence Encoder substantially. The only dataset where SBERT performs worse than Universal Sentence Encoder is SICK-R. Universal Sentence Encoder was trained on various datasets, including news, question-answer pages and discussion forums, which appears to be more suitable to the data of SICK-R. In contrast, SBERT was pre-trained only on Wikipedia (via BERT) and on NLI data.

Using the previously described Siamese network structure and fine-tuning mechanism significantly improves the correlation, greatly outperforming InferSent and the general sentence encoder. The only dataset where SBERT performs worse than a general sentence encoder is SICK-R. The general sentence encoder is trained on a variety of datasets, including news, Q&A, and forums, which seems to be more suitable for SICK-R data. In contrast, SBERT is only pretrained on Wikipedia (via BERT) and NLI data.

While RoBERTa was able to improve the performance for several supervised tasks, we only observe minor difference between SBERT and SRoBERTa for generating sentence embeddings.

Although RoBERTa is able to improve the performance of some supervised tasks, we only observe a small difference between SBERT and SRoBERTa in generating embedding sentence vectors.

2, Supervised STS - Supervised STS

The STS benchmark (STSb) (Cer et al., 2017) provides is a popular dataset to evaluate supervised STS systems. The data includes 8,628 sentence pairs from the three categories captions, news, and forums. It is divided into train (5,749), dev (1,500) and test (1,379). BERT set a new state-of-the-art performance on this dataset by passing both sentences to the network and using a simple regression method for the output.

The STS benchmark (STSb) (Cer et al., 2017) provides a general dataset for evaluating supervised STS systems. The data consist of 8628 sentence pairs from three categories: headline, news and forum. Divided into training set (5749), validation set (1500) and test set (1379). Bert achieves a good performance on this dataset by passing two sentences to the network and using a simple regression method to output the result.

We use the training set to fine-tune SBERT using the regression objective function. At prediction time, we compute the cosine-similarity between the sentence embeddings. All systems are trained with 10 random seeds to counter variances (Reimers and Gurevych, 2018).

We use the training set to fine-tune the regression objective function used by SBERT. At prediction time, we compute the cosine similarity between sentence vectors. All systems are trained with 10 random seeds to counteract variance (Reimers and Gurevych, 2018).
insert image description here
Table 2: Evaluation on the STS benchmark test set. The BERT system is trained with 10 random seeds and 4 epochs. SBERT is fine-tuned on the STSb dataset, and SBERT-NLI is pre-trained on the NLI dataset and then fine-tuned on the STSb dataset.

The results are depicted in Table 2. We experimented with two setups: Only training on STSb, and first training on NLI, then training on STSb. We observe that the later strategy leads to a slight improvement of 1-2 points. This two-step approach had an especially large impact for the BERT cross-encoder, which improved the performance by 3-4 points. We do not observe a significant difference between BERT and RoBERTa.

The results are shown in Table 2. We experimented with two approaches: 1, training on STSb only, and 2, training on NLI first and then on STSb. We observe that the latter strategy improves slightly by 1-2 points. This two-step approach had a particularly large impact on the BERT interleaved encoder, improving performance by 3-4 points. We did not observe significant differences between Bert and RoBert.

3. Argument Facet Similarity - Argument Facet Similarity

We evaluate SBERT on the Argument Facet Similarity (AFS) corpus by Misra et al. (2016). The AFS corpus annotated 6,000 sentential argument pairs from social media dialogs on three controversial topics: gun control, gay marriage, and death penalty. The data was annotated on a scale from 0 (“different topic”) to 5 (“completely equivalent”). The similarity notion in the AFS corpus is fairly different to the similarity notion in the STS datasets from SemEval. STS data is usually descriptive, while AFS data are argumentative excerpts from dialogs. To be considered similar, arguments must not only make similar claims, but also provide a similar reasoning. Further, the lexical gap between the sentences in AFS is much larger. Hence, simple unsupervised methods as well as state-of-the-art STS systems perform badly on this dataset (Reimers et al., 2019).

We evaluate SBERT on the Argument-Level Similarity (AFS) corpus of Misra et al. (2016). The AFS corpus annotates 6,000 sentence-argument pairs from social media conversations on three controversial topics: gun control, same-sex marriage, and the death penalty. Data are labeled on a scale from 0 ("different subject") to 5 ("exactly equivalent"). The concept of similarity in the AFS corpus is quite different from that in SemEval's STS dataset. STS data are generally descriptive, whereas AFS data are polemical excerpts from conversations. To be similar, arguments must not only make similar claims but also provide similar reasoning. Furthermore, the lexical variance between sentences is much larger in AFS. Therefore, simple unsupervised methods as well as state-of-the-art STS systems perform poorly on this dataset (Reimers et al., 2019).

We evaluate SBERT on this dataset in two scenarios: 1) As proposed by Misra et al., we evaluate SBERT using 10-fold cross-validation. A drawback of this evaluation setup is that it is not clear how well approaches generalize to different topics. Hence, 2) we evaluate SBERT in a cross-topic setup. Two topics serve for training and the approach is evaluated on the left-out topic. We repeat this for all three topics and average the results.

We evaluate SBERT on this dataset in two cases: 1) As proposed by Misra et al., we evaluate SBERT using 10-fold cross-validation. A disadvantage of this method of assessment is that it is not clear how well the method generalizes to different topics. Therefore, 2) we evaluate SBERT in a cross-subject setting. Two topics are used for training, and evaluation is performed on the missing topics. We repeat this step for all three subjects and average the results.

SBERT is fine-tuned using the Regression Objective Function. The similarity score is computed using cosine-similarity based on the sentence embeddings. We also provide the Pearson correlation r to make the results comparable to Misra et al. However, we showed (Reimers et al., 2016) that Pearson correlation has some serious drawbacks and should be avoided for comparing STS systems. The results are depicted in Table 3.

SBERT is fine-tuned using a regression objective function. Based on the embedding sentence vector, the cosine similarity is used to calculate the similarity score. We also provide the Pearson correlation r to make the results comparable to those of Misra et al. However, we found (Reimers et al., 2016) that the Pearson correlation has some serious flaws and should be avoided when comparing STS systems. The results are shown in Table 3.
insert image description here

Unsupervised methods like tf-idf, average GloVe embeddings or InferSent perform rather badly on this dataset with low scores. Training SBERT in the 10-fold cross-validation setup gives a performance that is nearly on-par with BERT.

Unsupervised methods like tf-idf, average GloVe embedding or InferSent perform quite poorly on this dataset with low scores. The performance of training SBERT in 10-fold cross-validation mode is almost equal to that of BERT.

However, in the cross-topic evaluation, we observe a performance drop of SBERT by about 7 points Spearman correlation. To be considered similar, arguments should address the same claims and provide the same reasoning. BERT is able to use attention to compare directly both sentences (e.g. word-by-word comparison), while SBERT must map individual sentences from an unseen topic to a vector space such that arguments with similar claims and reasons are close. This is a much more challenging task, which appears to require more than just two topics for training to work on-par with BERT.

However, in the cross-subject evaluation, we observe a performance drop of about 7 Spearman correlation points for SBERT. To be similar, arguments should address the same claims and provide the same reasoning. BERT is able to use attention to directly compare two sentences (e.g. word-for-word), whereas SBERT has to map a single sentence from an unseen topic to a vector space so that arguments with similar claims and justifications are close together. This is a much more challenging task and seems to require more than two subjects for training to be comparable to BERT.

4. Wikipedia Sections Distinction - Wikipedia Sections Distinction

Dor et al. (2018) use Wikipedia to create a thematically fine-grained train, dev and test set for sentence embeddings methods. Wikipedia articles are separated into distinct sections focusing on certain aspects. Dor et al. assume that sentences in the same section are thematically closer than sentences in different sections. They use this to create a large dataset of weakly labeled sentence triplets: The anchor and the positive example come from the same section, while the negative example comes from a different section of the same article. For example, from the Alice Arnold article: Anchor: Arnold joined the BBC Radio Drama Company in 1988., positive: Arnold gained media attention in May 2012., negative: Balding and Arnold are keen amateur golfers.

Dor et al. (2018) used Wikipedia to create a training, validation, and test set of fine-grained topics for sentence embedding methods. Wikipedia articles are distributed in different sections, and the articles in each section focus on a certain field. Dor et al. assume that sentences of the same part are closer in topic than sentences of different parts. They use this assumption to create a large dataset of weakly labeled triplet sentence groups: topics and positive examples come from the same section, while negative examples come from different sections of the same article. For example, from Alice Arnold's article: Subject: Arnold joined the BBC Radio Drama Company in 1988. Positive example: In May 2012, Arnold gained media attention. Negative example: Balding and Arnold are keen amateur golfers.

We use the dataset from Dor et al. We use the Triplet Objective, train SBERT for one epoch on the about 1.8 Million training triplets and evaluate it on the 222,957 test triplets. Test triplets are from a distinct set of Wikipedia articles. As evaluation metric, we use accuracy: Is the positive example closer to the anchor than the negative example?

We use the dataset from Dor et al. We train SBERT on a training set of 1.8 million triples per iteration using the triplet objective, and evaluate it on 222,957 test triplets. The test triplets come from a different set of Wikipedia articles. As an evaluation metric, we use accuracy: are the positive examples closer to the subject than the negative examples?
insert image description here

Results are presented in Table 4. Dor et al. finetuned a BiLSTM architecture with triplet loss to derive sentence embeddings for this dataset. As the table shows, SBERT clearly outperforms the BiLSTM approach by Dor et al.

The results are shown in Table 4. Dor et al. use a triplet loss function to fine-tune the BiLSTM structure to generate sentence embeddings for this dataset. As shown in the table, SBERT significantly outperforms the BiLSTM of Dor et al.

Five, Evaluation - SentEval - Evaluation - SentEval

SentEval (Conneau and Kiela, 2018) is a popular toolkit to evaluate the quality of sentence embeddings. Sentence embeddings are used as features for a logistic regression classifier. The logistic regression classifier is trained on various tasks in a 10-fold cross-validation setup and the prediction accuracy is computed for the test-fold.

SentEval (Conneau and Kiela, 2018) is a general toolkit for evaluating the quality of sentence vectors. Sentence vectors are used as features for a logistic regression classifier. In a 10-fold cross-validation setting, a logistic regression classifier is trained on multiple tasks, and then the prediction accuracy is computed on the test set.

The purpose of SBERT sentence embeddings are not to be used for transfer learning for other tasks. Here, we think fine-tuning BERT as described by Devlin et al. (2018) for new tasks is the more suitable method, as it updates all layers of the BERT network. However, SentEval can still give an impression on the quality of our sentence embeddings for various tasks.

SBERT sentence embeddings are not intended for transfer learning on other tasks. Here, we argue that fine-tuning BERT as described by Devlin et al. (2018) is a more appropriate approach for new tasks, since it updates all layers of the BERT network. However, SentEval can still show the effect of our sentence embedding quality on multiple tasks.

We compare the SBERT sentence embeddings to other sentence embeddings methods on the following seven SentEval transfer tasks:

• MR: Sentiment prediction for movie reviews snippets on a five start scale (Pang and Lee, 2005).
• CR: Sentiment prediction of customer product reviews (Hu and Liu, 2004).
• SUBJ: Subjectivity prediction of sentences from movie reviews and plot summaries (Pang and Lee, 2004).
• MPQA: Phrase level opinion polarity classification from newswire (Wiebe et al., 2005).
• SST: Stanford Sentiment Treebank with binary labels (Socher et al., 2013).
• TREC: Fine grained question-type classification from TREC (Li and Roth, 2002).
• MRPC: Microsoft Research Paraphrase Corpus from parallel news sources (Dolan et al., 2004).

We use the following seven SentEval transfer tasks to compare SBERT sentence-vectors with other sentence-vector methods:
• MR: Sentiment prediction of movie review segments, scaled with five starting points (Peng and Li, 2005).
• CR: Sentiment Prediction of Customer Product Reviews (Hu and Liu, 2004).
• Topics: Subjectivity Prediction of Sentences in Movie Reviews and Plot Summaries (Pang and Lee, 2004).
• MPQA: Phrase-Level Opinion Polarity Classification for Newswires (Wiebe et al., 2005).
• SST: Stanford Sentiment Treebank with binary labels (Socher et al., 2013).
• TREC: Fine-grained question type classification from TREC (Li and Roth, 2002).
• MRPC: Microsoft Research Paraphrasing a Corpus of Parallel News Sources (Dolan et al., 2004).
insert image description here

The results can be found in Table 5. SBERT is able to achieve the best performance in 5 out of 7 tasks. The average performance increases by about 2 percentage points compared to InferSent as well as the Universal Sentence Encoder. Even though transfer learning is not the purpose of SBERT, it outperforms other state-of-the-art sentence embeddings methods on this task.

The results are shown in Table 5. SBERT is able to achieve the best performance in 5 out of 7 tasks. Compared with InferSent and the general sentence encoder, the average performance is improved by about 2 percentage points. Although transfer learning is not the purpose of SBERT, it outperforms other state-of-the-art sentence embedding methods on this task.

It appears that the sentence embeddings from SBERT capture well sentiment information: We observe large improvements for all sentiment tasks (MR, CR, and SST) from SentEval in comparison to InferSent and Universal Sentence Encoder.

SBERT's sentence embeddings seem to capture sentiment information well: we observe that we find large improvements on all sentiment tasks (MR, CR, and SST) on SentEval compared to InferSent and the general sentence encoder .

The only dataset where SBERT is significantly worse than Universal Sentence Encoder is the TREC dataset. Universal Sentence Encoder was pre-trained on question-answering data, which appears to be beneficial for the question-type classification task of the TREC dataset.

The only dataset where SBERT is significantly worse than the Universal Sentence Encoder is the TREC dataset. The general sentence encoder is pre-trained on question answering data, which is beneficial for the question type classification task of the TREC dataset.

Average BERT embeddings or using the CLStoken output from a BERT network achieved bad results for various STS tasks (Table 1), worse than average GloVe embeddings. However, for SentEval, average BERT embeddings and the BERT CLS-token output achieves decent results (Table 5), outperforming average GloVe embeddings. The reason for this are the different setups. For the STS tasks, we used cosine-similarity to estimate the similarities between sentence embeddings. Cosine-similarity treats all dimensions equally. In contrast, SentEval fits a logistic regression classifier to the sentence embeddings. This allows that certain dimensions can have higher or lower impact on the classification result.

For various STS tasks (Table 1), the output of BERT using average pooling or CLS-token has achieved unsatisfactory results, even worse than GloVe embeddings. However, for SentEval, both have achieved relatively good results (as shown in Table 5), and are better than the averaged GloVe vector. The reason for this difference is that the settings are different. For the STS task, we use cosine similarity to evaluate the similarity between sentence vectors. Cosine similarity is the same for all dimensions. Instead, SentEval adapts a logistic regression classifier to sentence vectors. This allows certain dimensions to have a higher or lower impact on the classification results.

We conclude that average BERT embeddings / CLS-token output from BERT return sentence embeddings that are infeasible to be used with cosinesimilarity or with Manhatten / Euclidean distance. For transfer learning, they yield slightly worse results than InferSent or Universal Sentence Encoder. However, using the described fine-tuning setup with a siamese network structure on NLI datasets yields sentence embeddings that achieve a new state-of-the-art for the SentEval toolkit.

We concluded that it is not feasible to use cosine similarity or Manhattan distance or Euclidean distance for the average pooling returned by BERT or the sentence vector returned by CLS-token. For transfer learning, they produce somewhat inferior results than InferSent or the Universal Sentence Encoder. However, using Siamese network structure fine-tuning on the NLI dataset, the resulting sentence embeddings achieve a new best performance for the SentEval toolkit.

Six, Ablation Study - ablation experiment

We have demonstrated strong empirical results for the quality of SBERT sentence embeddings. In this section, we perform an ablation study of different aspects of SBERT in order to get a better understanding of their relative importance.

We already have strong experimental results for the quality of SBERT sentence vectors. In this section, we conduct ablation studies on different aspects of SBERT to better understand their relative importance.

We evaluated different pooling strategies (MEAN, MAX, and CLS). For the classification objective function, we evaluate different concatenation methods. For each possible configuration, we train SBERT with 10 different random seeds and average the performances.

We evaluate different pooling strategies (mean, max and CLS). For the objective function of classification, we evaluate different stitching methods. For each possible configuration, we train SBERT with 10 different random seeds and take their average performance.
insert image description here

The objective function (classification vs. regression) depends on the annotated dataset. For the classification objective function, we train SBERTbase on the SNLI and the Multi-NLI dataset. For the regression objective function, we train on the training set of the STS benchmark dataset. Performances are measured on the development split of the STS benchmark dataset. Results are shown in Table 6.

The objective function (classification vs regression) depends on the annotation of the dataset. For the objective function of classification, we train SBERT based on SNLI and Multi-NLI datasets. For the objective function of regression, we train on the training set of the STS benchmark dataset. Performance effects are measured on the split validation set of the STS benchmark dataset. The results are shown in Table 6.

When trained with the classification objective function on NLI data, the pooling strategy has a rather minor impact. The impact of the concatenation mode is much larger. InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018) both use (u, v, |u − v|, u ∗ v) as input for a softmax classifier. However, in our architecture, adding the element-wise u ∗ v decreased the performance.

When training with a classified objective function on NLI data, the pooling strategy has less impact and the splicing mode has a much larger impact. Both InferSent (Conneau et al., 2017) and the Universal Sentence Encoder (Cer et al., 2018) use (u,v,|u−v|,u∗v) as input to the softmax classifier. However, in our architecture, adding elements u∗v degrades the performance instead.

The most important component is the elementwise difference |u − v|. Note, that the concatenation mode is only relevant for training the softmax classifier. At inference, when predicting similarities for the STS benchmark dataset, only the sentence embeddings u and v are used in combination with cosine-similarity. The element-wise difference measures the distance between the dimensions of the two sentence embeddings, ensuring that similar pairs are closer and dissimilar pairs are further apart.

The most important part is the difference between the elements |u−v|. Note that stitching mode is only relevant for training softmax classifiers. At inference time, only sentence vectors u and v are used in conjunction with cosine similarity when predicting similarity on the STS benchmark dataset. The difference between elements measures the distance between the dimensions of two sentence vectors, ensuring that similar pairs are closer and dissimilar pairs are farther away.

When trained with the regression objective function, we observe that the pooling strategy has a large impact. There, the MAX strategy perform significantly worse than MEAN or CLS-token strategy. This is in contrast to (Conneau et al., 2017), who found it beneficial for the BiLSTM-layer of InferSent to use MAX instead of MEAN pooling.

When training with the objective function of regression, we observe that the pooling strategy has a large impact. Here, the effect of MAX pooling strategy is significantly lower than that of MEAN pooling or CLS-token strategy. This is in contrast to the work of (Conneau et al., 2017) et al., who found that using MAX instead of average pooling is more beneficial for Infresent's BiLSTM layer.

Seven, Computational Efficiency - Computational Efficiency

Sentence embeddings need potentially be computed for Millions of sentences, hence, a high computation speed is desired. In this section, we compare SBERT to average GloVe embeddings, InferSent (Conneau et al., 2017), and Universal Sentence Encoder (Cer et al., 2018).

Sentence vectors may need to be calculated for millions of sentences, thus requiring high computational speed. In this section, we compare SBERT with GloVe, InferSent (Conneau et al., 2017) and the Universal Sentence Encoder (Cer et al., 2018).

For our comparison we use the sentences from the STS benchmark (Cer et al., 2017). We compute average GloVe embeddings using a simple for-loop with python dictionary lookups and NumPy. InferSent4 is based on PyTorch. For Universal Sentence Encoder, we use the TensorFlow Hub version5 , which is based on TensorFlow. SBERT is based on PyTorch. For improved computation of sentence embeddings, we implemented a smart batching strategy: Sentences with similar lengths are grouped together and are only padded to the longest element in a mini-batch. This drastically reduces computational overhead from padding tokens.

For comparison, we use sentences from the STS benchmark (Cer et al., 2017). We use Python's dictionary lookup and NumPy to implement a simple for loop to compute the averaged GloVe sentence vectors. InferSent is implemented based on PyTorch. For the Universal Sentence Encoder, we are using the TensorFlow Github version 5 implementation. SBERT is implemented based on PyTorch. To improve the computation of sentence embeddings, we implement an intelligent batching strategy: sentences of similar length are grouped together and only filled to the longest element in the batch. This greatly reduces the computational overhead of populating tokens.
insert image description here

Performances were measured on a server with Intel i7-5820K CPU @ 3.30GHz, Nvidia Tesla V100 GPU, CUDA 9.2 and cuDNN. The results are depicted in Table 7.

Performance evaluation is performed on a server equipped with Intel i7-5820K CPU @ 3.30GHz, Nvidia Tesla V100 GPU, CUDA 9.2 and cuDNN. The results are shown in Table 7.

On CPU, InferSent is about 65% faster than SBERT. This is due to the much simpler network architecture. InferSent uses a single BiLSTM layer, while BERT uses 12 stacked transformer layers. However, an advantage of transformer networks is the computational efficiency on GPUs. There, SBERT with smart batching is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. Smart batching achieves a speed-up of 89% on CPU and 48% on GPU. Average GloVe embeddings is obviously by a large margin the fastest method to compute sentence embeddings.

On CPU, InferSent is about 65% faster than SBERT due to InferSent's simpler network architecture. InferSent uses a single BiLSTM layer, while BERT uses 12 stacked transformer layers. However, one advantage of transformer networks is their computational efficiency on GPUs. Here, SBERT with smart batching is about 9% faster than InferSent and about 55% faster than the Universal Sentence Encoder. Smart Batching achieves an 89% speedup on the CPU and a 48% speedup on the GPU. Obviously, Average GloVe is the fastest way to compute sentence vectors.

Eight, Conclusion - Summary

We showed that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity. The performance for seven STS tasks was below the performance of average GloVe embeddings

We demonstrate that BERT's output embeddings directly map sentences to a vector space that is not well suited for use with commonly used similarities such as cosine similarity. The results of the seven STS tasks were all lower than the averaged GloVe vector results.

To overcome this shortcoming, we presented Sentence-BERT (SBERT). SBERT fine-tunes BERT in a siamese / triplet network architecture. We evaluated the quality on various common benchmarks, where it could achieve a significant improvement over state-of-the-art sentence embeddings methods. Replacing BERT with RoBERTa did not yield a significant improvement in our experiments.

To overcome this shortcoming, we propose SBERT. SBERT is fine-tuned on Siamese/Triple network body structures. We evaluate the quality of SBERT on a variety of common benchmark datasets, where it can achieve significant improvements over other state-of-the-art sentence embedding methods. Using RoBERTa instead of BERT did not yield significant improvements in our experiments.

SBERT is computationally efficient. On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. SBERT can be used for tasks which are computationally not feasible to be modeled with BERT. For example, clustering of 10,000 sentences with hierarchical clustering requires with BERT about 65 hours, as around 50 Million sentence combinations must be computed. With SBERT, we were able to reduce the effort to about 5 seconds.

SBERT is computationally efficient. On GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. SBERT can be used to compute tasks that BERT cannot model. As an example: pairwise clustering of 10,000 sentences using hierarchical clustering takes about 65 hours because about 50 million sentence pair combinations have to be calculated, whereas with SBERT we were able to reduce the workload to about 5 seconds.

Guess you like

Origin blog.csdn.net/TFATS/article/details/120881344