Chen Danqi redefines the text similarity problem and proposes that C-STS and GPT-4 cannot be solved well

Chen Danqi's work is good. The previous text similarity is basically just a score, but it can be divided into many similar situations from different perspectives. This is equivalent to redefining the task. In addition, data construction is also generated through models, everything is very convenient and fast. The final experiment also shows that GPT4 also makes mistakes from time to time, and this development direction can be further studied and discussed.

Let's take a closer look at the author's point of view.

c9104b2e2112ec8240c6c42d8aec0daa.png

Paper: C-STS: Conditional Semantic Textual Similarity
Address: https://arxiv.org/abs/2305.15093
Unit: Princeton, Allen AI, etc.

Enter the NLP group —> join the NLP exchange group

Semantic textual similarity (STS) has been a cornerstone task in NLP, measuring the degree of similarity between a pair of sentences, with applications in information retrieval, question answering, and embedding methods.

However, this is an inherently ambiguous task, and sentence similarity depends on specific aspects of interest.

d95a095d294dc12883b96c42c1450d81.png

We address this ambiguity by proposing a new task called Conditional STS ( C-STS ), which measures similarity in terms (here conditional) articulated by natural language.

For example, the similarity between the sentences " NBA player shoots a 3-pointer " and " A person throws a tennis ball in the air " is higher (up) for the " Motion of the ball " condition. and lower " ball size " (one big and one small).

C-STS has dual advantages : (1) it reduces the subjectivity and ambiguity of STS, and (2) different conditions can be used for fine-grained similarity evaluation.

8199e12a73ac7fb9925d1f2de8bdbcda.png
Intelligent construction process of data

C-STS contains nearly 20,000 instances from different domains, and we evaluate several state-of-the-art models to demonstrate that even the highest performing fine-tuning and contextual learning models (GPT-4, Flan, SimCSE) are found to have Challenging with a Spearman correlation score of <50.

3422fd5aef396b27d809f2a49bb83484.png c615be6ef407687c81c6201e4549f8d3.png

We encourage the community to evaluate their models on C-STS to provide a more comprehensive view of semantic similarity and natural language understanding.

Experiment and Analysis

de25c13a4e7641a1ac078361a86bb00d.png 51e8f3a5123237821f00098387b3f09f.png 5fd8a270706a774145d380c966319d95.png fec9358625fef3f5d4997a098a293791.png bf3c93d660ed51835d1067a4cf4dbe52.png

a9fe3db73a8b3a07f38345591596e998.png

Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132074344