Python - a simple demo of sentenceSimilarity (test sentence similarity)

1. What is sentenceSimilarity ?

sentenceSimilarity belongs to the field of machine learning

The sentenceSimilarity library in Python is a tool library for calculating sentence similarity, mainly used in applications related to natural language processing. The library supports multiple models to calculate sentence similarity, including TF-IDF, LSI, LDA and other models.

When using this library, you need to use a tokenizer to segment and preprocess the text, and pass the processed sentences into the SentenceSimilarity instance for training.

When calculating the similarity, you only need to pass the two sentences that need to calculate the similarity into the similarity method as parameters.

Sentence similarity calculation is widely used in various tasks in natural language processing, such as machine translation, text classification, information retrieval, etc.

By calculating the similarity between different texts, applications such as automated text mining, information extraction, and knowledge management can be realized.

In addition, there are also extensive applications in social networks and recommendation systems, such as collaborative filtering and content-based recommendations.

2. Code package

sentenceSimilarity-master.zip - Lanzout Cloud file size: 27.6 K | https://wwwf.lanzout.com/iblEj0wrt0sh

Just start demo.py directly

3. demo code

#encoding=utf-8

from zhcnSegment import *
from fileObject import FileObj
from sentenceSimilarity import SentenceSimilarity
from sentence import Sentence

if __name__ == '__main__':
    # 读入训练集
    file_obj = FileObj(r"testSet/trainSet.txt")
    train_sentences = file_obj.read_lines()

    # 读入测试集1
    file_obj = FileObj(r"testSet/testSet1.txt")
    test1_sentences = file_obj.read_lines()

    # 读入测试集2
    # file_obj = FileObj(r"testSet/testSet2.txt")
    # test2_sentences = file_obj.read_lines()

    # 分词工具,基于jieba分词,我自己加了一次封装,主要是去除停用词
    seg = Seg()

    # 训练模型
    ss = SentenceSimilarity(seg)
    ss.set_sentences(train_sentences)
    ss.TfidfModel()         # tfidf模型
    # ss.LsiModel()         # lsi模型
    # ss.LdaModel()         # lda模型
    # 创建 SentenceSimilarity 实例并进行训练


    # 计算句子相似度
    right_count = 0
    for i, test_word in enumerate(test1_sentences):
        result = ss.similarity(test_word)
        score, idx = result.score, result.id
        print(f"【{i}】{test1_sentences[i]} => 【{idx}】{train_sentences[idx]}, score={score}")
        if score > 0.8:
            right_count += 1
    res = str(float(right_count) / len(train_sentences) * 100)
    print(f"相似率为:{res}%")

4. Running results

 

Sentence similarity calculation is a very important task in natural language processing, and the process of model building and similarity calculation uses some machine learning methods.

In this process, a large amount of data preprocessing is required, including word segmentation, removal of stop words, construction of vocabulary and statistical text feature values, etc. Then use these feature values ​​as the input of the model, and use traditional machine learning methods (such as LSI, LDA, and TF-IDF, etc.) for training and prediction, so as to achieve the goal of sentence similarity calculation.

Therefore, sentence similarity calculation is a combination of machine learning and natural language processing, and these models are also widely used in text classification, sentiment analysis and other fields.

Guess you like

Origin blog.csdn.net/Pan_peter/article/details/130785220