Multimodal approach (under update)

待coding:
moco
pcl

Semantic Representation for Dialogue Modeling

Abstract Meaning Representation (AMR) to help with conversation modeling

Dialogue-based sentence representation

https://zhuanlan.zhihu.com/p/437790124

Node-to-word relationship mapping: An AMR Aligner Tuned by Transition-based Parser

PCL: Peer-Contrastive Learning with Diverse Augmentations for Unsupervised Sentence Embeddings

Two comparative learning methods to find positive samples - discrete (Discrete augmentation format) and continuous (Continuous augmentation format):

  • Discrete augmentation format: Modify positive samples directly through characters or n-grams, such as synonym replacement, word shuffing, word deletion, and back translation;
  • Continuous augmentation format: through hidden variables, such as two dropouts in SimCSE

Motivation :
Current models either use discrete methods or continuous methods, but they all use a single enhancement method (mono-augmenting format) and have limited enhancement strategies, which leads to "learning shortcuts" in the above methods. For example, a model trained by relying on two dropout positive samples tends to be judged by sentence length.

A small encyclopedia of knowledge points :

  • Shortcut Learning:
    Pre-trained language models such as BERT have shown excellent performance on many NLU tasks, but recent research shows that such models tend to exploit the bias of the data set and try to use "shortcuts" to achieve higher results. to measure performance rather than truly understanding the language. This often leads to poor generalization performance of the model on OOD samples and poor robustness in the face of adversarial attacks.
    Specifically consider the classification task: given sample xxx , the model needs to learn a mappingf ( x ) f(x)f ( x ) to predict labelyyy . During the training process, if some words or phrases are associated with a certain labelyyThe number of co-occurrences of y is higher than that of other words, and the model will capture such features for prediction. According to the assumption of independent and identical distribution, the training, validation and test sets are all sampled from the same data distribution, so even if the model captures such shortcut features for prediction, the performance in the test set will not be poor. However, when exposed to out-of-distribution data (OOD samples) and adversarial samples (adversarial samples), the model will show poor generalization and robustness because they are not necessarily the same as the training set data. shortcuts.

method
simultaneously uses multiple methods to enhance positive samples (discrete + continuous), but using multiple enhancement methods also has a double-edged sword (double-edged sword) - multiple enhancement methods may cause the sample quality to be unable to be guaranteed. We propose a brand-new peer-contrastive learning framework that can not only perform ordinary forward-contrastive contrasts, but also direct-contrastive comparisons.
we propose a brand-new peer-contrastive learning framework that not only performs the vanilla positive-negative contrast but a positive-positive contrast.

Insert image description here

Axiomatic Attribution for Deep Networks

  • How humans make attributions
    Humans usually rely on counterfactual intuition when making attributions. When humans attribute some responsibility to a cause, they implicitly use the absence of that cause as a baseline for comparison. For example, if the reason for wanting to sleep is that you are sleepy, you will not want to sleep when you are not sleepy.

  • Deep network attribution
    Based on the principle of human attribution, deep network attribution also requires a baseline input to simulate the absence of causes. In many deep networks, there is a natural baseline in the input space. For example, in an object recognition network, a pure black image is a baseline. The formal definition of deep network attribution is given below:
    Suppose there is a function FFF: R n → [ 0 , 1 ] R^n \to [0,1] Rn[0,1 ] , which represents a neural network. The input to the network isx = ( x 1 , . . . , xn ) ∈ R nx=(x_1,...,x_n) \in R^nx=(x1,...,xn)Rn , thenxxx compared to the baseline inputx ′ ∈ R n x'\in R^nxRThe attribution of n is a vectorAF ( x , x ′ ) = ( a 1 , . . . , an ) ∈ R n A_F(x,x')=(a_1,...,a_n) \in R^nAF(x,x)=(a1,...,an)Rn , insideai a_iaiis the input xi x_ixiContribution to the prediction result F(x).

  • The significance of attribution
    : First, in the scenario of using image neural networks to predict illness, attribution can help doctors understand which part caused the model to think that the patient is sick; second, deep network attribution can be used to provide insights for rule-based systems ;Finally, attribution can also be used to inform recommendation structures.

  • Applying the integral gradient method
    1. Selecting the baseline
    1.1 The key step in applying the integral gradient method is to choose a good baseline. The score of the baseline in the model is preferably close to 0, which helps to interpret the attribution results.
    1.2 The baseline must represent a completely uninformative sample so that it can be distinguished whether the cause is from the input or the baseline.
    1.3 In the image task, you can choose a completely black image or an image composed of noise. In text tasks, using an embedding of all 0s is a better choice.
    The all-black image in the 1.4 image also represents a meaningful input, but the all-0 vector in the text has no valid meaning at all.
    2. Calculate the integral gradient.
    The integral gradient can be approximated efficiently by summation. It only needs to be the baseline x ′ x'x toxxThe gradients of sufficiently spaced points on the x straight line can be similar.
    Insert image description here
    Among them,mmm is the order of approximation (the order of Taylor expansion),mmThe larger m is, the more approximate it is, but the amount of calculation is also larger.
    In practice,mmm can be between 20 and 300.

Shortcut learning behavior of NLU models

Towards Interpreting and Mitigating Shortcut Learning Behavior of NLU Models

https://zhuanlan.zhihu.com/p/363904438

MoCo: Momentum Contrast Unsupervised Learning

Insert image description here
https://zhuanlan.zhihu.com/p/158023072

Deep Mutual Learning-Deep Mutual Learning: When three people walk together, there must be one teacher.

https://zhuanlan.zhihu.com/p/71192348

The model distillation algorithm was proposed by Hinton et al. in 2015. It uses a pre-trained large network as a teacher to provide the small network with additional knowledge, that is, smoothed probability estimates. Experiments show that the small network can imitate the category probabilities estimated by the large network. The optimization process becomes easier and shows similar or better performance than large networks. However, the model distillation algorithm requires a large network that has been pre-trained in advance, and the large network remains fixed during the learning process. It only performs one-way knowledge transfer to the small network, and it is difficult to obtain feedback information from the learning status of the small network to improve the training process. Optimized adjustments.

We try to explore a training mechanism that can learn stronger large and small networks - deep mutual learning, which uses multiple networks to train at the same time. During the training process, each network not only accepts supervision from ground truth labels, but also refers to the peer network's Learn from experience to further improve generalization capabilities. Throughout the process, the two networks continued to share learning experiences and achieve mutual learning and common progress.

Insert image description here
Insert image description here
Our proposed mutual learning algorithm is also easily extended to multi-network learning and semi-supervised learning scenarios. When there are K networks, deep mutual learning uses the remaining K-1 networks as teachers to provide learning experience when learning each network. Another strategy is to fuse the remaining K-1 networks to obtain a teacher to provide learning experience. In the semi-supervised mutual learning scenario, we calculate supervised loss and interaction loss for labeled data, while for unlabeled data we only calculate interaction loss to help the network mine more useful information from the training data.

Generally speaking, small networks benefit more from mutual learning training, such as Resnet-32 and MobileNet. Although the WRN-28-10 network has a large number of parameters, performance improvements can still be achieved through mutual learning training with other networks. Therefore, unlike model distillation algorithms that require pre-training of large networks to help small networks improve performance, the deep mutual learning algorithm we propose can also help large networks participating in training improve their performance.

Insert image description here
We can see from Figure 3 that increasing the number of networks can improve the performance of a single network under the mutual learning strategy. This shows that more teacher networks provide more learning experience and help the network learn better features. On the other hand, the performance of multiple independent teachers (DML) in multi-network mutual learning will be better than that of fused teachers (DML_e), which shows that multiple different teacher networks can provide more diverse learning experiences and are more beneficial to each network of learning.

What will be the effect of heterogeneous networks?

TRANS-ENCODER Self-supervised Sentence Bi & Cross Encoder

Insert image description here

motivation:

Since the self-supervision effect is excellent, it has equaled or even surpassed supervised learning on Bi-Encoder, and Cross-Encoder is generally more effective than Bi-Encoder, so I can’t help but wonder why not join forces and self-supervision. A Cross-Encoder is coming out for supervised learning.

question:

Then comes the difficulty. There is no ready-made self-supervised framework for self-supervised learning of a Cross-Encoder. If we insist on copying SimCSE/Mirror-Bert, we will splice a sentence and its augmentation into the Cross-Encoder and let the model judge. Is it similar? This positive example is a bit too simple for the model, and the same is true for the negative example. It is difficult for the model to learn effective information on such a self-supervised task design; the reason why Bi-encoder can be used is because the two sentences are passed through the model separately, and the similarity is calculated at the final output. This task has certain requirements for Bi-encoder. of difficulty.

plan:

The solution to the problem is to combine knowledge distillation and self-supervised learning. According to the figure above, we first train a powerful Bi-Encoder model according to self-supervised learning, and then use the Bi-Encoder model as a teacher to use knowledge distillation to train a Cross -Encoder, it is worth noting that although Cross-Encoder is a student, the upper limit of its own model architecture is stronger than its teacher Bi-Encoder, so it can be better than its predecessor. After knowledge distillation, its model effect Better than Bi-Encoder. This is the first step of the model: Bi-Encoder —> Cross-Encoder.

Then, the second step is simple. We have a Cross-Encoder semantic similarity model that is stronger than Bi-Encoder. Then, similar to Augmented SBert, we can use Cross-Encoder as our teacher and knowledge distillation to enhance Bi-Encoder. At this point, the broad road is in sight. The above two steps can be repeated and iteratively carried out; the two models are teachers of each other, teaching each other, and becoming stronger together.

Some small details:

The article is cleverly conceived and simple: In terms of loss function, the loss function of Bi-Encoder -> Cross-Encoder is binary cross entropy loss BCE; the loss function of Cross-Encoder -> Bi-Encoder is mean square error loss MSE. At the same time, the network design is also very elegant. Bi-Encoder and Cross-Encoder share the network structure. The only difference is that the input of Bi-Encoder is a single sentence, while the input of Cross-Encoder is [CLS] sent1 [SEP] sent2 [SEP]. The method in the article has achieved great improvement in various benchmarks.

Insert image description here

Reference:
TRANS-ENCODER: Unsupervised sentence pair model for self-distillation and mutual distillation.
Paper sharing - Self-supervised Sentence Bi & Cross Encoder

Mirror-Bert

Related background

Contrastive learning training uses the InfoNCE[1] loss function as the training target, aiming to bring closer the representation of sentences similar to the current sentence within a batch, push away the representation of dissimilar sentences, and measure the distance between sentence representations by cosine similarity. . In contrastive learning, constructing diverse and high-quality positive example pairs is the key. Here is a list of the positive example construction methods used by various comparative learning methods recently in Figure 1, which are mainly divided into two levels:

  • Modifications at the text input level
    Random deletion of words (ConSERT[2], CLEAR[3])
    Random deletion of consecutive words (ConSERT, CLEAR, Mirror-BERT)
    Disruption of input order (ConSERT, CLEAR)
    Synonym replacement (CLEAR)
  • Construct different perspectives at the feature level.
    Randomly mask a certain dimension of features (ConSERT).
    Two different Dropout results (ConSERT, SimCSE, Mirror-BERT).
    Add noise perturbation (ConSERT).
    Two different models provide features from different perspectives (Self-Guide Contrastive). Learning[4], CT[5])

Method introduction

In this paper, Mirror-BERT mainly uses the random deletion of consecutive words and the dropout strategy to construct positive examples. Dropout has also been proven by other work to be a simple and effective way to construct positive examples for contrastive learning. Judging from recent work related to contrastive learning, not doing too much damage to the original sentence can ensure the quality of the constructed positive examples. It has also been proven that without considering the training efficiency, an additional model can be trained to provide sentence representation from another perspective. efficient.
Insert image description here
Pay more attention to the results of STS. Compared with SimCSE, the average results are not as good as SimCSE, and some tasks are slightly better.
Insert image description here
Mirror uses 10k data in different tasks. In my impression, SimCSE uses more data. After checking: We randomly sample 1 0 6 10^610 6 sentences from English Wikipedia and fine-tune BERT
base
with learning rate = 3e-5, N = 64. In all our experiments, no STS training sets are used. But look at the picture below, Mirror-Bert can achieve the best results for most tasks at 10k-20k.

In the ablation test performed on the STS task, the span mask plays a greater role, but the best effect is achieved when the two are used together. where drophead method: it randomly prunes attention heads at MLM training as a regularization step

Mirror-BERT Improves Isotropy? It seems so

From the perspective of data enhancement, SimCSE is indeed a special case of it, but SimCSE extends the comparison method to supervised and unsupervised, and the effect is indeed better than Mirror-Bert. It is also the work of goddess danqi. If you just want to read Chapter, I recommend reading SimCSE.

Reference link:
Fast, Effective, and Self-Supervised:Mirror-BERT

Self-guided contrastive learning for BERT sentence representations

From Seoul National University, the question discussed is how to use BERT's own information for comparison without introducing external resources or display data enhancement, so as to obtain a higher quality sentence representation?

The comparison in this article is: BERT's middle layer representation and the final CLS representation. The model contains two BERTs. The parameters of one BERT are fixed and used to calculate the representation of the middle layer. The calculation is divided into two steps: (1) Use MAX-pooling to obtain the sentence vector representation of each layer (2) Use uniform sampling. The method samples a representation from the N layer; the other BERT is fine-tuned and used to calculate the representation of the sentence CLS. Two representations of the same sentence are obtained through two BERTs, thereby forming a positive example, and the negative example is the representation of the middle layer of another sentence or the representation of the final CLS.

This paper does not choose to start directly from the perspective of underlying data enhancement. It is slightly more partial to the improvement of the model method, focusing on mining the information inside the model. The main experiment was tested on STS and SentEval tasks. Judging from the results, SimCSE is still much better, and SimCSE is simpler to operate. However, this article also provides a different idea.

Label Denoise paper summary-Co-training series

In the research on label denoising, there is a method that hopes to update network parameters by selecting clean instances/clean sets. It is generally believed that data with smaller loss is more reliable and can be considered as clean sets.
Reference link:
Label Denoise paper summary-Co-training series

CLEAR: Contrastive Learning for Sentence Representation

CLEAR designed mask language modal to represent word-level features and used contrastive learning to represent sentence-level features. Contrastive learning brings the results after data enhancement of the same sentence closer (as positive examples), and enhances the data of different sentences and different sentences further away (as negative examples). By bringing sentences with similar meanings closer to each other, sentence-level semantic information can be better learned.

The contributions of this article are as follows:

1. Four data enhancement methods are designed: random-words-deletion, spans-deletion, random continuous token deletion, synonym-substitution, and reordering.

2. Design a contrastive learning idea to better represent sentence-level semantics.

3. It has achieved good results on many downstream tasks.

The effect is actually average compared to the latest one:

ESimCSE

Through contrastive learning, a self-supervised method is used to calculate the cross entropy for loss and learn the similarity of positive and negative samples through softmax classification. ESimCSE is an upgraded version of SimCSE. SimCSE learns the relationship between text matching by dropout two sentences to generate two similar positive and negative samples for comparative learning. ESimCSE solves two problems left by SimCSE:
1. The positive example pairs constructed by SimCSE through dropout contain information of the same length (reason: Position Embedding of Transformer), which will make the model tend to think that sentences of the same or similar length are semantically more Similar;
2. A larger batch size will cause SimCSE performance to decrease;

ESimCSE's methods for constructing positive sample pairs: **Word Repetition (word repetition)** and **Momentum Contrast (momentum contrast learning)** expand negative sample pairs.

RankCSE

Unofficial code implementation: https://github.com/perceptiveshawty/RankCSE

Contrastive learning not only needs to consider whether there are positive and negative pairs between samples, but also needs to consider more fine-grained similarity relationships. The anchor point's scoring of positive and negative samples cannot only use InfoNCE to separate the representation of positive samples and all negative samples, but also needs to add sorting information.

(1) standard contrastive learning objective (§4.2);
(2) ranking consistency loss which ensures ranking consistency between two representations with different dropout masks (§4.3);
(3) ranking distillation loss which distills listwise ranking knowledge from the teacher

Two kinds of sorting: ensuring the consistency of two dropout sortings; using the teacher model to distill the sorting information between samples to the student model;

  • Changing the equation:
    ζ info NCE = − ∑ i = 1 N logexp ( d ( f ( xi ) , f ( xi ) ′ ) / τ 1 ∑ j = 1 N exp ( d ( f ( xi ) , f ( xj ) ′ ) / τ 1 ) \zeta_{infoNCE} = -\sum_{i=1}^N log\frac{exp(d(f(x_i),f(x_i)')/\tau_1}{\sum_{ j=1}^N exp(d(f(x_i),f(x_j)')/\tau_1)}ginfoNCE=i=1Nlogj=1Nexp(d(f(xi),f(xj) )/t1)exp(d(f(xi),f(xi) )/t1

  • Consistency sorting: Align the infoNCE denominator of two forwards. Specifically, for a sample, the two aligned sorting distributions are calculated for the emb of the first forward propagation of the sample and the second forward propagation of all inbatch samples. The sorting distribution and the inverted sorting distribution. Compute the JS divergence of these two distributions.
    image.png

  • Teacher ranking distillation: Use the trained SIMCSE as the teacher to obtain the ranking distribution calculated by the first forward propagation emb of each sample and the second forward propagation emb of all inbatch samples as soft labels, using the list-wise method. Distillation. Note that since the scores of anchor points and positive samples are too high, these scores are discarded. During distillation, the labels from the two teacher models are mixed according to weights. The sorted loss can be expressed as follows:
    ζ rank = ∑ i = 1 N rank ( S ( xi ) , ST ( xi ) ) \zeta_{rank} = \sum_{i=1}^Nrank(S(x_i),S^ T(x_i))grank=i=1Nrank(S(xi),ST(xi))
    Among them, N represents the number of samples in the batch. Note that it is not the number of docs corresponding to a certain query.
    Specifically, the distillation loss can use ListNet or ListMLE:

    • Among them, ListNet uses a simplified version of Top1:
      ζ L ist N et = − ∑ i = 1 N softmax ( ST ( xi ) / τ 3 ∗ log ( softmax ( S ( xi ) / τ 2 ) ) \zeta_{ListNet} = - \sum_{i=1}^Nsoftmax(S^T(x_i)/\tau_3 * log(softmax(S(x_i)/\tau_2))gL i s tN e t=i=1Nsoftmax(ST(xi) / t3log(softmax(S(xi) / t2))
    • ListMLE directly maximizes the maximum likelihood estimation based on the order of groundtruth, so it only uses the ranking obtained after the teacher model is scored, and does not directly use the model to score.
      ζ L ist MLE = − ∑ i = 1 N log ( π i T ∣ S ( xi ) , τ 2 ) ) \zeta_{ListMLE} = -\sum_{i=1}^N log(\pi_i^T|S (x_i),\tau_2))gListMLE=i=1Nl o g ( piTS(xi),t2))
      π i T \pi_i^T PiiTIndicates the sorting of the teacher model. For details, see the definition section below. Also note that the model prediction scores here are not normalized by softmax.

    Some of them are defined as follows:


Definitions :
ζ final = ζ info NCE + β ζ consistency + γ ζ rank \zeta_{final} = \zeta_{infoNCE}+\beta\zeta_{consistency}+\gamma\zeta_{rank};gfinal=ginfoNCE+b gconsistency+c grank

Final Results

data augmentation

Conditional BERT Contextual Augmentation

The model structure of CBert is completely the same as that of Bert. The only difference lies in "input representation" and "training process". In CBert, label information is represented by Segmentation Embedding. Segmentation Embedding in the original text of Bert has only two values, which need to be adjusted to num_classes in CBert. In this way, label information is integrated into the MLM task, and when predicting replacement words, not only the context but also the label information is considered. This is the Conditional MLM named in the paper.

The training process is very similar to Bert, except that CBert uses the CMLM task instead of the MLM task when fine-tuning the "annotated training corpus".

The last step is to enhance the training data, and perform the Conditional MLM task on the Fine-Tune CBert on the training corpus. Note that when predicting replacement words, you should not select the word corresponding to the highest probability, but should be in the TopN range (or other effective method) to randomly select a replacement word to increase the diversity of the data distribution.
Insert image description here
https://blog.csdn.net/weixin_44815943/article/details/124122407

Augmented SBERT

Using cross-encoder to weakly label all possible sentence pair combinations will result in huge overhead and may even lead to a decrease in model performance. Therefore, we need a suitable sampling strategy to reduce weakly labeled sentence pairs and improve model expressiveness.
(1) Random Sampling (RS)
(2) Kernel Density Estimation (KDE): The purpose is to ensure that the distribution of silver data and gold data remains consistent. To this end, we weakly annotate a large number of random sentence pairs, but only retain certain combinations. For example, for classification tasks, only positive sentence pairs are retained; for regression tasks, kernel density estimation (KDE) is used to estimate the continuous density functions Fgold(s) and Fsilver(s) for the score s.
However, the KDE sampling strategy is not computationally efficient and requires a large number of random samples. We did not use this method later.
(3) BM25 Sampling (BM25): Using the Okapi BM25 algorithm. We utilize ElasticSearch. Extract the k most similar sentences for each sentence. These sentence pairs are then weakly annotated using cross-encoder and are used as silver data. This method operates very efficiently. This article recommends this method.
(4) Semantic Search Sampling (SS): One disadvantage of BM25 is that it can only find sentences with overlapping vocabulary, so synonyms, sentences with no or only little overlap will not be selected. In this method, we use cosine-similarity to select the k most similar statements. Faiss can also be used.
(5) BM25 + Semantic Search Sampling (BM25- SS)
Insert image description here

https://blog.csdn.net/zephyr_wang/article/details/119581505

ViLBERT

https://zhuanlan.zhihu.com/p/264488613
Insert image description here
Insert image description here

2.1 Co-attention Transformer layer

The co-attention transformer layer introduced in this article is shown in Figure 1b. Given vision and language features, the keys and values ​​of the image modality are input to the attention unit of the text modality (and vice versa), and the attention unit generates attention-pooled features for each modality based on the other modality, in the vision stream It is expressed as language attention based on image conditions, and in the language flow, it is vision attention under language conditions. Like BERT, the attention feature is added to the initialized output residual.

2.2 Image representation

Generate image rpn and its visual features based on a pretrained object-detection network. Unlike words, the image area is unnecessary. This paper uses a 5-dim vector to position encode the area. The five elements are the coordinates of the upper left corner and lower right corner of the normal bounding boxes and the coverage of the image area. proportion, and then use a mapping to match it with the visual feature dimension and sum. Use a specific image token as the beginning of the image sequence and use its output to characterize the entire image.

2.3 Pre-training tasks
Two pre-training tasks are used when training ViLBERT:

(1)Occlusion multi-modal modeling

Like standard BERT, approximately 15% of the word and image rpn input is masked, and the masked elements are predicted through the remaining input sequence. When masking the image, the probability of 0.9 is direct occlusion, and the probability of 0.1 remains unchanged. The text mask is consistent with bert's. Vilbert does not directly predict the feature value of the masked image area, but predicts the distribution of the corresponding area on the semantic category, using the output of the pretrain object-detection model as ground-truth to minimize the KL divergence of the two distributions. as goal.

(2) Predict multi-modal alignment

As shown in Figure 4-b, its goal is to predict whether the image-text pair matches the alignment, that is, whether the article correctly describes the image. The output of the starting IMG token of the image feature sequence and the starting CLS token of the text sequence is used as the overall representation of the visual and language input. Borrowing another common structure in the vision-and-language model, the output of the IMG token and the output of the CLS token are element-wise product as the final overall representation. Then use a linear layer to predict whether the image and text match.
Insert image description here

faster r-cnn

https://zhuanlan.zhihu.com/p/31426458
Insert image description here
Faster RCNN can actually be divided into 4 main contents:

Conv layers. As a CNN network target detection method, Faster RCNN first uses a set of basic conv+relu+pooling layers to extract the feature maps of the image. This feature map is shared for subsequent RPN layers and fully connected layers.
Region Proposal Networks. RPN network is used to generate region proposals. This layer uses softmax to determine whether the anchors are positive or negative, and then uses bounding box regression to correct the anchors to obtain accurate proposals.
Roy Pooling. This layer collects the input feature maps and proposals, synthesizes this information and extracts the proposal feature maps, and sends them to the subsequent fully connected layer to determine the target category.
Classification. Use proposal feature maps to calculate the category of the proposal, and at the same time bounding box regression again to obtain the final precise position of the detection frame.

Insert image description here

Multimodal feature fusion

Various operations of multi-modal fusion

Bilinear feature fusion

Detailed explanation, improvement and application of Bilinear Pooling
Bilinear Attention Networks Notes
Bilinear Attention Network Model "Bilinear Attention Networks"

Insert image description here

contrastive learning

https://www.cnblogs.com/xyzhrrr/p/15864522.html
Application of contrastive learning in semantic representation: SBERT/SimCSE/ConSERT/ESimCSE recurrence

semantic computing

Comparison of the effects of sentence semantic representation methods

What is the current sota method for Sentence Embedding?

Guess you like

Origin blog.csdn.net/u014665013/article/details/129164320