Artificial intelligence task 1-[NLP series] application of sentence embedding and multi-model implementation

Hello everyone, I am Weixue AI. Today I will introduce the application of artificial intelligence task 1-[NLP series] sentence embedding and multi-model implementation. Sentence embedding is the mapping of sentences to a fixed-dimensional vector representation, which has a wide range of applications in natural language processing (NLP). By converting sentences into vector representations, computers can better understand and process text data.

This paper uses a multi-model implementation of word embedding, including: Word2Vec, Doc2Vec, and BERT models, and applies them to sentence embedding tasks. These pre-trained models have learned rich semantic information from massive text data through large-scale unsupervised learning, and are able to generate high-quality sentence embeddings.

Table of contents

  1. introduction
  2. Project Background and Significance
  3. Sentence Embedding Basics
  4. Method to realize
    1. Word2Vec
    2. Doc2Vec
    3. BERT
  5. Project practice and code
    1. data preprocessing
    2. Sentence embedding implementation
  6. Summarize
  7. References

introduction

With the development of artificial intelligence and big data, natural language processing (NLP) has been widely used in many fields, such as search engines, recommendation systems, automatic translation, etc. Among them, sentence embedding is one of the key technologies of NLP, which can convert natural language sentences into computer-understandable vectors, so that machines can process and understand natural language. This article will introduce in detail the application projects of sentence embedding in NLP, as well as several common implementations of Chinese text sentence embedding.

Project Background and Significance

In natural language processing, the process of converting sentences into vectors is called sentence embedding. This is because computers do not understand natural language directly, but do so by manipulating numerical data such as vectors. Sentence embedding can capture the semantic information of sentences and help machines understand and process natural language.

Sentence embedding has a wide range of applications, such as sentiment analysis, text classification, semantic search, machine translation, etc. For example, in sentiment analysis, sentence embedding can convert text into vectors, and then use machine learning models to predict the sentiment of the text. In machine translation, sentence embeddings help machines understand sentences in the source language and translate them into sentences in the target language.

The application of sentence embedding mainly includes the following aspects:

Text Classification/Sentiment Analysis: Sentence embeddings can be used for text classification tasks such as classifying movie reviews into positive and negative sentiment. Sentence embedding-based models can learn the semantic information of sentences and apply them to sentiment classification.

Semantic similarity: By computing the similarity between sentence embeddings, the semantic similarity between sentences can be measured. This is very useful in tasks such as question answering systems, recommender systems, etc., to help find other sentences that are most relevant to the input sentence.

Machine Translation: Sentence embeddings can be used for sentence alignment and translation modeling in machine translation tasks. By encoding source language sentences and target language sentences into embedding vectors, the correspondence and semantic information between sentences can be captured to improve translation quality.

Sentence Generation: Using pre-trained language models and sentence embeddings, coherent and semantically correct sentences can be generated. Sentence embeddings can be used as input for generative tasks, ensuring that the generated sentences are contextually relevant to the input.

Information retrieval/similar sentence lookup: By converting sentences into embedding vectors, indexing and fast similar sentence lookup can be done. This has important application value in fields such as search engines and knowledge graphs.

Sentence Embedding Basics

Sentence embedding is a technique for converting natural language sentences into fixed-length vectors of real numbers. This vector is able to capture the semantic information of the sentence, such as the topic of the sentence, emotion, tone, etc. Sentence embeddings are usually learned through neural network models. These models can be unsupervised like Word2Vec, Doc2Vec, or supervised like BERT.

Method to realize

Next, we will introduce three common implementations of Chinese text sentence embedding.

Method 1: Word2Vec

Word2Vec is a common word embedding method that converts words into vectors. The idea of ​​this method is to average all the word vectors in a sentence to get the sentence vector.
insert image description here

There are two implementations of Word2Vec: CBOW (Continuous Bag-of-Words) and Skip-gram.

The CBOW model aims to predict the central word based on the context, while the Skip-gram model predicts the context based on the central word. Here's the basic math behind the two models:

CBOW model:

Suppose we have a head word wt w_twt, and the context window size is mmm , then the context word can be expressed aswt − m , wt − m + 1 , . . . , wt − 1 , wt + 1 , . . . , wt + m w_{tm}, w_{t-m+1} , ..., w_{t-1}, w_{t+1}, ..., w_{t+m}wtm,wtm+1,...,wt1,wt+1,...,wt+m

The CBOW model tries to predict the head word based on the context words, and its goal is to maximize the conditional probability of the head word given the context.

Specifically, the CBOW model obtains the context representation v = 1 2 m ∑ i = 1 2 mvwti \mathbf{v} = \frac{1}{2m} \sum_{ i=1}^{2m} \mathbf{v}_{w_{t_i}}v=2 m1i=12 mvwti. Then, the context represents v \mathbf{v}v is input into a hidden layer, and the outputh = σ ( W v + b ) \mathbf{h} = \sigma(\mathbf{W}\mathbf {v} + \mathbf{b})h=s ( Wv+b ) . Finally, compare the output of the hidden layer with the center wordwt w_twtThe relevant one-hot encoded representations are compared, and the softmax function is used to obtain the probability distribution y ^ \hat{\mathbf{y}} of each wordy^. The goal of the model is to maximize the log probability of the actual head word: max ⁡ log ⁡ P ( wt ∣ wt − m , . . . , wt − 1 , wt + 1 , . . . , wt + m ) \max \log P(w_t | w_{tm}, ..., w_{t-1}, w_{t+1}, ..., w_{t+m})maxlogP(wtwtm,...,wt1,wt+1,...,wt+m)

Skip-gram model:

The Skip-gram model is the opposite of the CBOW model, which tries to predict context words from the center word.

Specifically, the Skip-gram model takes the head word wt w_twtThe word vector vwt \mathbf{v}_{w_t}vwtInput to the hidden layer, and get the output of the hidden layer through a nonlinear function h = σ ( W vwt + b ) \mathbf{h} = \sigma(\mathbf{W}\mathbf{v}_{w_t} + \ mathbf{b})h=s ( W vwt+b ) . Then, compare the output of the hidden layer with the context wordswt − m , wt − m + 1 , . . . , wt − 1 , wt + 1 , . . . , wt + m w_{tm}, w_{t-m+ 1}, ..., w_{t-1}, w_{t+1}, ..., w_{t+m}wtm,wtm+1,...,wt1,wt+1,...,wt+mThe associated one-hot encoded representations are compared sequentially, and the softmax function is used to obtain the probability distribution y ^ \hat{\mathbf{y}} for each wordy^. The goal of the model is to maximize the log probability of the actual context word: max ⁡ ∑ i = 1 2 m log ⁡ P ( wti ∣ wt ) \max \sum_{i=1}^{2m} \log P(w_{t_i } | w_{t})maxi=12 mlogP(wtiwt)

In the actual training process, Word2Vec uses negative sampling (negative sampling) to approximate the calculation of the softmax function, speed up the training speed of the model, and achieve better performance.

Hope the above mathematical representation output using LaTeX was helpful!

Method 2: Doc2Vec

Doc2Vec is a method to directly obtain sentence vectors, which is an extension of Word2Vec. Doc2Vec not only considers the context of words, but also considers the global information of the document.

Suppose we have a corpus of N documents, each document consisting of a sequence of words. The goal of Doc2Vec is to generate a fixed-length vector representation for each document.

Doc2Vec uses two different models to achieve this goal: PV-DM and PV-DBOW respectively.

For the PV-DM model, during training, each document is mapped to a unique vector (paragraph vector), and each word is also mapped to a vector. In the prediction phase, the model takes in a portion of text (could be one or more words) and tries to predict the missing portion of text (usually a word). The model's loss function is calculated based on the difference between the prediction and the true value, and then backpropagated to update the vector representation of the document and word.

For the PV-DBOW model, it ignores the order of words within a document and only focuses on the overall representation of the document. In this model, a document is mapped to a vector, and the goal of the model is to predict the document through the information of the context words. Likewise, the model uses loss functions and backpropagation to update vector representations of documents and words.

Overall, Doc2Vec captures the semantic information of documents by representing each document as a fixed-length vector. These vectors can be used to measure similarity between documents, cluster documents, or as input for other tasks.

Using mathematical symbols to describe the specific details of Doc2Vec, you can refer to the following formula:

PV-DM model:

  • Input: a document d consisting of a sequence of words ( w 1 , w 2 , . . . , wn ) (w_1, w_2, ..., w_n)(w1,w2,...,wn) , wherennn is the number of words in the document.
  • Document vector: pv dm ( d ) pv_{\text{dm}}(d)pvdm( d ) , a vector representation representing document d.
  • Word vectors: each word wi w_iwiThere is a corresponding vector representing wi w_iwi
  • Prediction: Given an input part of text ( w 1 , w 2 , . . . , wk ) (w_1, w_2, ..., w_k)(w1,w2,...,wk) , the model tries to predict the missing textwk + 1 w_{k+1}wk+1
  • Loss Function: Calculate the difference between the predicted value and the true value using cross-entropy or other appropriate loss function.
  • Training: Update document vectors and word vectors via backpropagation and gradient descent algorithms.

PV-DBOW model:

  • Input: a document d consisting of a sequence of words ( w 1 , w 2 , . . . , wn ) (w_1, w_2, ..., w_n)(w1,w2,...,wn) , wherennn is the number of words in the document.
  • Document vector: pv dbow ( d ) pv_{\text{dbow}}(d)pvdbow( d ) , a vector representation representing document d.
  • Word vectors: each word wi w_iwiThere is a corresponding vector representing wi w_iwi
  • Prediction: Given a document d, the model tries to predict context words related to that document.
  • Loss Function: Calculate the difference between the predicted value and the true value using cross-entropy or other appropriate loss function.
  • Training: Update document vectors and word vectors via backpropagation and gradient descent algorithms.

insert image description here

Method 3: BERT

BERT is a Transformer-based deep learning model that can obtain deep semantic information of sentences.

The mathematics of the BERT model is based on two key concepts: MLM and NSP.

First, we represent the input text sequence as a series of word vectors, and add relative position encoding to each word vector. Then, feature extraction is performed through multiple stacked Transformer layers.

In the MLM stage, BERT will perform a random mask operation on a part of the words in the input sequence, that is, replace the embedding vectors of these words with a special mark "[MASK]". The model then predicts these masked words by contextual context.

In the NSP stage, BERT will take two sentences as input and judge whether they are consecutive sentences in the original text. This task aims to help the model learn semantic information at the sentence level.

Specifically, the mathematical principle of the BERT model includes the following steps:

  1. Input embedding layer: The input is a series of word indexes, which are mapped to word vector representations.
  2. Positional encoding: Adds relative positional encoding to each input so that the model can understand the order relationship between words.
  3. Transformer layer: Feature extraction is performed through multiple stacked Transformer layers, each layer consists of a multi-head self-attention mechanism and a feedforward neural network.
  4. Masked Language Model (MLM): Mask a part of the words in the input sequence, and predict these masked words through the context.
  5. Next Sentence Prediction (NSP): Take two sentences as input and judge whether they are consecutive sentences in the original text.
    insert image description here

Project practice and code

Next, we will use an example to show how to implement sentence embedding for Chinese text. We will use the Python language and related NLP libraries (such as gensim, torch, transformers, etc.) to do it.

data preprocessing

First, we need to preprocess the data, including word segmentation, removing stop words, etc. The following is a simple data preprocessing code example:

import jieba

def preprocess_text(text):
    # 使用jieba进行分词
    words = jieba.cut(text)
    
    # 去除停用词
    stop_words = set(line.strip() for line in open('stop_words.txt', 'r', encoding='utf-8'))
    words = [word for word in words if word not in stop_words]
    
    return words

Sentence embedding implementation

Next, we show how to implement sentence embeddings using the three methods described above.

Method 1: Word2Vec + Text Vector Averaging

from gensim.models import Word2Vec

def sentence_embedding_word2vec(sentences, size=100, window=5, min_count=5):
    # 训练Word2Vec模型
    model = Word2Vec(sentences, size=size, window=window, min_count=min_count)

    # 对每个句子的词向量进行平均
    sentence_vectors = []
    for sentence in sentences:
        vectors = [model.wv[word] for word in sentence if word in model.wv]
        sentence_vectors.append(np.mean(vectors, axis=0))
    
    return sentence_vectors

Method 2: Doc2Vec

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

def sentence_embedding_doc2vec(sentences, vector_size=100, window=5, min_count=5):
    # 将句子转化为TaggedDocument对象
    documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
    
    # 训练Doc2Vec模型
    model = Doc2Vec(documents, vector_size=vector_size, window=window, min_count=min_count)
    
    # 获取句子向量
    sentence_vectors = [model.docvecs[i] for i in range(len(sentences))]
    
    return sentence_vectors

Method 3: BERT

import torch
from transformers import BertTokenizer, BertModel

# 加载预训练的BERT模型和分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese')

# 输入待转换的句子
sentence = "这是一个示例句子。"

# 使用分词器将句子分成tokens
tokens = tokenizer.tokenize(sentence)

# 添加特殊标记 [CLS] 和 [SEP]
tokens = ['[CLS]'] + tokens + ['[SEP]']

# 将tokens转换为对应的id
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# 创建输入tensor
input_tensor = torch.tensor([input_ids])

# 使用BERT模型获取句子的嵌入向量
with torch.no_grad():
    outputs = model(input_tensor)
    sentence_embedding = outputs[0][0][0]  # 取第一个句子的第一个token的输出作为句子的嵌入向量

# 输出句子的嵌入向量
print(sentence_embedding)
print(sentence_embedding.shape)

Summarize

This article introduces in detail the application projects of sentence embedding in NLP, as well as several common implementations of Chinese text sentence embedding. We show through practice and code examples how to implement sentence embeddings using Word2Vec + text vector averaging, Doc2Vec, and BERT. I hope this article can help readers better understand sentence embedding and apply sentence embedding technology in actual projects.

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/132281735
Recommended