A Comprehensive Interpretation of Contrastive Learning Applied in NLP

Contrastive Learning has a wide range of applications in Natural Language Processing (NLP). Contrastive learning is an unsupervised learning method that aims to cluster similar samples together and separate dissimilar samples. In NLP, the goal of contrastive learning is to learn vector representations with semantic similarity.

Here are some common applications of contrastive learning in NLP:

  1. Text Similarity Computation: Contrastive learning can learn to map semantically similar text pairs into similar vector spaces. By calculating the similarity between text pairs, it can be used for tasks such as text matching, restatement detection, question answering systems, and information retrieval.

  2. Text classification: Contrastive learning can classify texts with similar semantics into the same category by learning vector representations of texts. This can be used for tasks like sentiment analysis, topic classification, spam classification, etc.

  3. Word meaning representation learning: Contrastive learning can learn to map words with similar meanings into similar vector spaces. By calculating the similarity between words, it can be used for tasks such as word meaning similarity calculation, word recommendation, and word meaning disambiguation.

  4. Sentence representation learning: Contrastive learning can learn the vector representation of sentences and express the semantic information of sentences. This is very useful for tasks like text generation, sentence similarity computation, sentence classification, etc.

  5. Language model pre-training: Contrastive learning has also been widely used in language model pre-training. Through contrastive learning, the model can learn a better representation of the context, which provides better language understanding and generation capabilities.

This article mainly describes three relatively effective comparative learning methods, SimCLR (2019) , SimCSE (2021) and ArcCon (2022), from principle explanation to code implementation.

Table of contents

1. SimCLR

1.1 SimCLR contrastive loss function

1.2 Code Implementation 1

1.3 Code implementation 2

2. SimCSE

2.1 SimCSE Contrastive Loss Function 

 2.2 Code implementation

3. ArcCon

3.1 Loss function

3.2 Code implementation


1. SimCLR _

The SimCLR approach was originally proposed in a 2019 paper titled "A Simple Framework for Contrastive Learning of Visual Representations". The paper was co-authored by Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton, among others. The paper details the principles and experimental results of the SimCLR method, and demonstrates the ability of unsupervised learning to achieve supervised learning performance in the image domain.

The paper was published at CVPR (Computer Vision and Pattern Recognition), the top conference in the field of computer science in 2019. Due to the simplicity and power of the SimCLR method, this paper has attracted a lot of attention and has become an important milestone in the field of contrastive learning. The SimCLR method enables the model to learn feature representations with rich semantic information through key mechanisms such as data augmentation, contrastive loss, and temperature parameters. By clustering similar samples together and separating dissimilar samples, the SimCLR method has achieved good performance in fields such as computer vision and natural language processing.

1.1 SimCLR contrastive loss function

We randomly sample a mini-batch from N samples, and define the contrastive prediction task on pairwise augmented samples derived from the mini-batch, resulting in 2N data points. In specific use, the most direct, simplest and crude training method is: Take a data augmentation as an example, a training sample with a batch size of N, through data augmentation, becomes 2N samples, one of which is positive Sample pairs, 2N-2 negative sample pairs.

sim(u,v)=u^{T}v/\left \| u \right \|\left \| v \right \|   Represents the normalized dot product between u and v (ie cosine similarity)

Then define the loss function of a pair of positive examples ( i , j) as:

l_{i,j}=-{log}\tfrac{exp(sim(z_{i},z_{j})/\tau )}{\sum_{k=1}^{2N}\mathbb{I}_{[k\neq i]}]exp(sim(z_{i},z_{j})/\tau )} 

The final loss is the arithmetic mean of the losses of all positive pairs in the batch:

Example: The similarity matrix constructed as shown below, batch_size=4, [A, B, C, D] represents four sample vectors. [A+, B+, C+, D+] represents four generated adversarial samples, [A, A+ ] represents a pair of positive samples, and the sample pairs formed by the diagonal line are removed. The idea of ​​contrastive learning is to use the loss function to make the positive sample pair closer, and the negative sample pair widen the distance.

1.2 Code Implementation 1

class ContrastiveLoss(nn.Module):
    def __init__(self, batch_size, device='cuda', temperature=0.5):
        super().__init__()
        self.batch_size = batch_size
        self.register_buffer("temperature", torch.tensor(temperature).to(device))  # 设置温度的超参数
        self.register_buffer("negatives_mask", (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool).to(device)).float())  # 一个主对角线为0,其余位置全为1的mask矩阵
        
    def forward(self, emb_i, emb_j):  # emb_i, emb_j 是来自同一图像的两种不同的预处理方法得到的嵌入特征
        z_i = F.normalize(emb_i, dim=1)  # 对emb_i进行归一化,得到z_i形状为(bs, dim)
        z_j = F.normalize(emb_j, dim=1)  # 对emb_j进行归一化,得到z_j形状为(bs, dim)

        representations = torch.cat([z_i, z_j], dim=0)  # 将z_i和z_j按行拼接,得到形状为(2*bs, dim)的representations
        similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)  # 计算representations之间的余弦相似度得到相似度矩阵similarity_matrix,形状为(2*bs, 2*bs)
        
        sim_ij = torch.diag(similarity_matrix, self.batch_size)  # 取相似度矩阵similarity_matrix中位置为(batch_size, 2*batch_size)的对角线元素,得到相似度sim_ij,形状为(bs)
        sim_ji = torch.diag(similarity_matrix, -self.batch_size)  # 取相似度矩阵similarity_matrix中位置为(-batch_size, -2*batch_size)的对角线元素,得到相似度sim_ji,形状为(bs)
        positives = torch.cat([sim_ij, sim_ji], dim=0)  # 将sim_ij和sim_ji按行拼接,得到形状为(2*bs)的positives
        
        nominator = torch.exp(positives / self.temperature)  # 计算positives除以温度temperature的指数,得到形状为(2*bs)的nominator
        denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)  # 计算相似度矩阵similarity_matrix除以温度temperature的指数,并乘以negatives_mask进行对应位置的剔除,得到形状为(2*bs, 2*bs)的denominator
    
        loss_partial = -torch.log(nominator / torch.sum(denominator, dim=1))  # 计算partial loss,即-nominator除以denominator在dim=1上的和,得到形状为(2*bs)的loss_partial
        loss = torch.sum(loss_partial) / (2 * self.batch_size)  # 对loss_partial求和,再除以(2 * batch_size)得到平均损失loss
        return loss

1.3 Code implementation 2

import torch
from torch import nn
import torch.nn.functional as F
class ContrastiveLossELI5(nn.Module):
    def __init__(self, batch_size, temperature=0.5, verbose=True):
        super().__init__()
        self.batch_size = batch_size
        self.register_buffer("temperature", torch.tensor(temperature))
        self.verbose = verbose
            
    def forward(self, emb_i, emb_j):
        """
        emb_i and emb_j are batches of embeddings, where corresponding indices are pairs
        z_i, z_j as per SimCLR paper
        """
        z_i = F.normalize(emb_i, dim=1)
        z_j = F.normalize(emb_j, dim=1)
 
        representations = torch.cat([z_i, z_j], dim=0)
        similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
        if self.verbose: print("Similarity matrix\n", similarity_matrix, "\n")
            
        def l_ij(i, j):
            z_i_, z_j_ = representations[i], representations[j]
            sim_i_j = similarity_matrix[i, j]
            if self.verbose: print(f"sim({i}, {j})={sim_i_j}")
                
            numerator = torch.exp(sim_i_j / self.temperature)
            one_for_not_i = torch.ones((2 * self.batch_size, )).to(emb_i.device).scatter_(0, torch.tensor([i]), 0.0)
            if self.verbose: print(f"1{
   
   {k!={i}}}",one_for_not_i)
            
            denominator = torch.sum(
                one_for_not_i * torch.exp(similarity_matrix[i, :] / self.temperature)
            )    
            if self.verbose: print("Denominator", denominator)
                
            loss_ij = -torch.log(numerator / denominator)
            if self.verbose: print(f"loss({i},{j})={loss_ij}\n")
                
            return loss_ij.squeeze(0)
 
        N = self.batch_size
        loss = 0.0
        for k in range(0, N):
            loss += l_ij(k, k + N) + l_ij(k + N, k)
        return 1.0 / (2*N) * loss

2. SimCSE _

SimCSE (Simple Contrastive Learning of Sentence Embeddings) is a contrastive learning method for learning sentence embedding representations, proposed by the Facebook AI team. Here are the details of the paper:

This paper introduces a simple yet effective method for contrastive learning, which trains a model by comparing the embedding representations of sentence pairs with similar semantics. The core idea of ​​SimCSE is to learn a sentence embedding representation with semantic meaning by maximizing the similarity of related sentence pairs and minimizing the similarity of irrelevant sentence pairs.

To achieve this goal, two key technical strategies are proposed in the paper:

  1. Siamese network architecture: Using the Siamese network, two sentences are used as input and share the same weights to generate their embedding representations.
  2. Contrastive Loss Function: Using the Contrastive Loss function, sentence pairs with similar semantics are enhanced and compared with random negative samples. This helps to encourage similarity between positive samples and to distinguish negative samples from positive samples.

The paper demonstrates the superior performance of the SimCSE method on multiple natural language processing tasks, including text matching, text classification, and sentence retrieval, through a series of experiments. The simplicity and effectiveness of the SimCSE method make it one of the important techniques for learning sentence embedding representations.

2.1 SimCSE Contrastive Loss Function 

l_{i,j}=-{log}\tfrac{exp(sim(h_{i},h_{i}^{+})/\tau )}{\sum_{j=1}^{N}(exp(sim(h_{i},h_{j}^{+})/\tau + exp(sim(h_{i},h_{j}^{-})/\tau)}

The SimCSE comparative learning similarity matrix is ​​as follows, the diagonal elements are positive sample pairs, and when batch_size=4, 1 positive sample pair {A, A+} corresponds to N-1 negative sample pairs.

 2.2 Code implementation

class ContrastiveLoss1(nn.Module):
    def __init__(self, batch_size, temperature):
        super().__init__()
        self.batch_size = batch_size
        self.register_buffer("temperature", torch.tensor(temperature).to(device))  # 超参数 温度


    def forward(self, emb_i, emb_j):  # emb_i, emb_j 是来自文本,i为初始文本的embedding,j为添加扰动后的embedding
        z_i = nn.functional.normalize(emb_i)  # 按行计算
        z_j = nn.functional.normalize(emb_j)
        dis_matrix = torch.mm(emb_i,emb_j.T) / self.temperature
        cos_matix = dis_matrix / (emb_i.norm(2) * emb_j.norm(2))

        pos = torch.diag(cos_matix)
        dedominator = torch.sum(torch.exp(cos_matix),dim=1)
        loss = (torch.log(dedominator) - pos).mean()
        return loss

3. ArcCon

The paper "A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space" (2022) mainly proposes a framework for contrastive learning, which is used for pairwise comparison and triple-wise comparison in angular space. perspective, learning sentence representations. The following are the main ideas and methods of the paper:

  1. Introduction and Background: The paper points out that traditional contrastive learning methods have some limitations when learning sentence representations, such as complete similarity or difference in Euclidean space. To address these limitations, the paper proposes a contrastive learning method based on angle space, which can better capture the subtle differences between sentences.

  2. Angle-Space Contrastive Learning Framework: The paper proposes a framework for learning sentence representations from the perspective of pairwise and triplet contrastive angle spaces. The framework includes two key components: Angle Contrast and Triple Contrast.

    • Angle Contrast: Learn the angular relationship between sentences by maximizing the cosine similarity between positive samples and minimizing the cosine similarity between negative samples. By introducing an angular contrastive loss function, the angular relation is transformed into a contrastive problem of cosine similarity in feature space.

    • Triple Contrast: By constructing triplet samples, the sentence representation is further optimized by maximizing the cosine similarity between similar samples and minimizing the cosine similarity between heterogeneous samples. By introducing a triplet contrast loss function, the angular relationship of triplet samples is transformed into a cosine similarity contrast problem in feature space.

  3. Experimental design and results: The paper verifies the effectiveness of the method through experiments on multiple sentence similarity tasks and sentence classification tasks. Experimental results show that the angle-space contrastive learning method has better performance and generalization ability, can capture subtle differences in sentence semantics, and achieves superior performance on multiple tasks.

  4. Analysis and discussion: The paper further analyzes the characteristics and advantages of the angle-space contrastive learning method, and discusses its comparison and association with other related methods. The applicability and scalability of the method to different datasets and real-world applications are also explored.

  5. Conclusion and future work: The paper summarizes the advantages and contributions of angle-space contrastive learning, and proposes possible future research directions, such as the introduction of more refined angle-space contrast and triplet contrast strategies, as well as applications on a wider range of semantic tasks Expand etc.

The main idea and method of the paper revolves around angular-space contrastive learning to learn sentence representations by maximizing the cosine similarity between positive samples and minimizing negative samples. Through comparative optimization of angles and triplet relations, the method can better capture the subtle differences between sentences and achieve better performance in multiple natural language processing tasks.

 

3.1 Loss function

After getting the positive and negative sentence pairs, we put them into a training target for model fine-tuning. Currently the most widely used training objective is NT-Xent loss (Chen et al., 2020; Gao et al, 2021), which has been used in previous sentence and image representation learning methods: 

l_{i,j}=-{log}\tfrac{exp(sim(h_{i},h_{i}^{+})/\tau )}{\sum_{j=1}^{N}exp(sim(h_{i},h_{j})/\tau)}

Where: sim(h_i, h_j)is the cosine similarity \frac{h_{i}^{T}h_j}{\left \| h_i \right \|*\left \| h_j \right \|}, τ is the temperature hyperparameter, n is the number of sentences in batch_size. Although the training objective tries to bring representations with similar semantics closer and push dissimilar representations away, these representations may still not be discriminative enough and not very robust to noise. Note the angle \theta _{i,j}as follows: 

\theta _{i,j}=arccos(\frac{h_{i}^{T}h_j}{\left \| h_i \right \|*\left \| h_j \right \|})

NT - Comparison of Xent loss with Arc Con loss. For the sentence representation hi, we try to make \theta _{i,i^{*}}smaller and \theta _{i,j}larger, so the optimization direction follows the arrow. With an additional interval m, ArcCon is more discriminative and resistant to noise.  

l_{i,j}=-{log}\tfrac{e^{cos(\theta _{i,i^{*}}+m)/\tau }}{e^{cos(\theta _{i,i^{*}}+m)/\tau }+\sum_{j\neq i}^{}e^{cos(\theta _{j,i^{}})/\tau} } 

3.2 Code implementation

class ArcConLoss(nn.Module):
    def __init__(self, batch_size, temperature, margin):
        super().__init__()
        self.batch_size = batch_size
        self.temperature = temperature
        self.margin = margin
    def forward(self, emb_i, emb_j):
        # z_i = nn.functional.normalize(emb_i)
        # z_j = nn.functional.normalize(emb_j)
        z_i = emb_i
        z_j = emb_j
        # 计算向量数量和形状
        num_vectors = z_i.shape[0]
        vector_shape = z_i.shape[1:]

        # 初始化相似度矩阵
        similarity_matrix = torch.zeros((num_vectors, num_vectors))

        # 计算相似度矩阵
        for i in range(num_vectors):
            for j in range(num_vectors):
                similarity = cosine_similarity(z_i[i].view(1, -1), z_j[j].view(1, -1))
                similarity_matrix[i, j] = similarity

        # print(similarity_matrix)
        # 提取对角线元素
        diagonal_elements = torch.diag(similarity_matrix)

        # 创建掩码矩阵,对角线元素为 False,其他元素为 True
        mask = ~torch.eye(similarity_matrix.size(0), dtype=torch.bool)

        # 使用掩码矩阵获取除对角线以外的元素
        other_elements = torch.masked_select(similarity_matrix, mask)

        theta_i_i_star = torch.acos(diagonal_elements)
        theta_j_i = torch.acos(other_elements)
        numerator = sum(torch.exp(torch.cos(theta_i_i_star + self.margin) / self.temperature))
        denominator = numerator + torch.sum(torch.exp(torch.cos(theta_j_i)) / self.temperature)

        # loss = (torch.log(denominator) - torch.log(numerator)) / self.batch_size
        loss = (torch.log(denominator) - torch.log(numerator))
        return loss

Guess you like

Origin blog.csdn.net/weixin_43734080/article/details/132415299