The construction of power dispatching knowledge question and answer system based on inverted table (100 lines of code will take you to realize the nlp question and answer system)

The data required by the question and answer system has been provided, and the corresponding answer can be found for each question, so it can be understood that each sample data is <question, answer>. The core of the system is that when the user enters a question, it must first find the question that is closest to the question that has been stored in the library, and then directly return the corresponding answer.
Since the author is a student of electricity, here is a question and answer system based on the power dispatch knowledge text

In the original form, I prepared 205 questions and answers related to scheduling.

Language: python3.7

Step 1: Read the data

import pandas as pd

import numpy as np
import jieba
import re

csv='电力调度问答.csv'
file_txt=pd.read_csv(csv, header=0,encoding='gbk')#[205 rows x 2 columns]
file_txt=file_txt.dropna()#删除空值[[205 rows x 2 columns]
print(file_txt.head())#查看前5行

Step 2: Filter stop words, punctuation marks, single-character words

Chinese stop word link;
nlp Chinese stop word data set


# 定义删除除字母,数字,汉字以外的所有符号的函数
def remove_punctuation(line):
    line = str(line)
    if line.strip() == '':
        return ''
    rule = re.compile(u"[^a-zA-Z0-9\u4E00-\u9FA5]")
    line = rule.sub('', line)
    return line

#停用词
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]
    return stopwords

stopwords = stopwordslist("停用词.txt")

#去除标点符号
file_txt['clean_review']=file_txt['问题'].apply(remove_punctuation)
#去除停用词
file_txt['cut_review']=file_txt['clean_review'].apply(lambda x:" ".join([w for w in list(jieba.cut(x)) if w not in stopwords and len(w)>1]))
print(file_txt.head())

The cut_review obtained is the keyword information of the question

Check out cut_review

The third step: text vectorized representation
because we input a question, and then find a question similar to the user's question from the system, and output the answer. The similarity needs to be calculated, and the text vectorized representation is required before this.
I use tf-idf to express, directly import the package and use it.

from sklearn.feature_extraction.text import CountVectorizer#词袋
from sklearn.feature_extraction.text import TfidfTransformer#tfidf

count_vect = CountVectorizer()
X= count_vect.fit_transform(file_txt['cut_review'])

#tf-idf
tfidf_transformer = TfidfTransformer()
X_fidf = tfidf_transformer.fit_transform(X)
print(X_fidf)

Step 4: Original Index
My original index here is
{'Issue IID':[Keyword 1,Keyword 2...],'Issue 2ID':[Keyword 2,Keyword 3...]...}
where the ID is The number of rows for question 1, that is, question 1 is the first question, and the ID is 1.

for i in range(len(file_txt)):
    left, rights = i,file_txt.iloc[i]['cut_review'].split()

Because there are too many, I modify the code here, assuming there are only 5 problems

for i in range(len(file_txt.head())):
    left, rights = i,file_txt.iloc[i]['cut_review'].split()
    print('left is ',i,'rights is ',rights)

The original index does not need to appear in the general code. I wrote it just for your convenience.

Step 5: Inverted index implementation
Because we need to calculate the similarity between the user's question and the library's question, and then return the answer to the question with high similarity. If we traverse every question in the library, and then calculate the similarity with the question raised by the user, if the amount of data is large, the time cost is too great.
Therefore, an inverted index is needed here.
The original index mentioned above is
{'Issue IID':[Keyword 1,Keyword 2...],'Issue 2ID':[Keyword 2,Keyword 3...]...}

The processed inverted index is
{'Keyword 1':[Question 1ID],'Keyword 2':[Question 1ID, Question 2ID...}

Then, for the user's question, first segment the word and find the key word of the question. Then, according to the keyword, find all the question IDs that contain the keyword. Then calculate the similarity between these questions and the questions raised by users.
Through the inverted list, we don't need to traverse all the questions in the library when calculating the similarity, but only traverse the questions that contain the keywords of the user's question.

result={
    
    }
for i in range(len(file_txt)):
    left, rights = i,file_txt.iloc[i]['cut_review'].split()
    for right in rights:
        if right in result.keys():
            result[right].append(left)
        else:
            result[right] = [left]

In the same way, because the amount of original data is too large, I assume there are only 5 problems, then let's see what the inverted index is. Experience what is inverted index

result={
    
    }
for i in range(len(file_txt.head())):
    left, rights = i,file_txt.iloc[i]['cut_review'].split()
    for right in rights:
        if right in result.keys():
            result[right].append(left)
        else:
            result[right] = [left]

print(result)

As shown in the figure, when there are only 5 questions, there are only 1 question including keyword formulation, and there are questions 1, 3...

Step 6: Segment the questions entered by the user, extract keywords, and find all matched question IDs

Suppose the question entered by the user is: sentence="What are the neutral point grounding methods?"
The key words obtained are: ['neutral point','grounding','method']

sentence="中性点接地方式有哪些"
clean_reviewyonghu=remove_punctuation(sentence)#去除标点
cut_reviewyonghu=[w for w in list(jieba.cut(clean_reviewyonghu)) if w not in stopwords and len(w)>1]#去除停用词,单字词
#print(cut_reviewyonghu)
# ['中性点', '接地', '方式']
Problem_Id=[]
for j in cut_reviewyonghu:
    if j in result.keys():
       Problem_Id.extend(result[j])
id=(list(set(Problem_Id)))#去重之后的ID
print(id)

The problem ID obtained is
that there are 17 problems corresponding to the problem in the database

Step 7: Calculation of similarity Calculate
the similarity of documents one by one from the question "what are the neutral point grounding methods" that the user asked and the 17 questions found.

There are many ways to calculate similarity. The similarity calculation method I used below did not use the text vectorization in the third step.

There are many ways to calculate similarity. For details, please refer to my previous blog
text similarity calculation method and code python implementation

#余弦相识度计算方法
def cosine_similarity(sentence1: str, sentence2: str) -> float:
    """
    :param sentence1: s
    :param sentence2:
    :return: 两句文本的相识度
    """
    seg1 = [word for word in jieba.cut(sentence1) if word not in stopwords]
    seg2 = [word for word in jieba.cut(sentence2) if word not in stopwords]
    word_list = list(set([word for word in seg1 + seg2]))#建立词库
    word_count_vec_1 = []
    word_count_vec_2 = []
    for word in word_list:
        word_count_vec_1.append(seg1.count(word))#文本1统计在词典里出现词的次数
        word_count_vec_2.append(seg2.count(word))#文本2统计在词典里出现词的次数

    vec_1 = np.array(word_count_vec_1)
    vec_2 = np.array(word_count_vec_2)
    #余弦公式

    num = vec_1.dot(vec_2.T)
    denom = np.linalg.norm(vec_1) * np.linalg.norm(vec_2)
    cos = num / denom
    sim = 0.5 + 0.5 * cos

    return sim

str1=sentence#用户所提问题
similarity={
    
    }#存储结果
if len(id)==0:
    print('数据库里没有该问题,请重新提问')
else:
    for i in id:
        str2 = file_txt.iloc[i]['问题']
        sim1 = cosine_similarity(str1, str2)  # 余弦相识度
        print('用户所提问题和问题{0}的相似度是{1}'.format(i, sim1))
        similarity[i] = sim1
print(similarity)




Step 8: Give answers
Sort the similarity={} obtained in Step 7 and output the answers to the 2 questions with the highest similarity

jieguo=sorted(similarity.items(),key=lambda d:d[1],reverse=True)[:2]#降序
print(jieguo)
print('用户所提的问题是:',sentence)

for i,j in jieguo:
    print('数据库相似的问题是{0} 答案是{1}'.format(i,file_txt.iloc[i]['答案']))

The answer is as follows: the answer to question 33 is the answer we are looking for

The total code after perfect
finishing

import pandas as pd

import numpy as np
import jieba
import re


# 定义删除除字母,数字,汉字以外的所有符号的函数
def remove_punctuation(line):
    line = str(line)
    if line.strip() == '':
        return ''
    rule = re.compile(u"[^a-zA-Z0-9\u4E00-\u9FA5]")
    line = rule.sub('', line)
    return line

#停用词
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='gbk').readlines()]
    return stopwords

#余弦相识度计算方法
def cosine_similarity(sentence1: str, sentence2: str,stopwords) -> float:
    """
    :param sentence1: s
    :param sentence2:
    :return: 两句文本的相识度
    """
    seg1 = [word for word in jieba.cut(sentence1)  if word not in stopwords ]
    seg2 = [word for word in jieba.cut(sentence2)  if word not in stopwords ]
    word_list = list(set([word for word in seg1 + seg2]))#建立词库
    word_count_vec_1 = []
    word_count_vec_2 = []
    for word in word_list:
        word_count_vec_1.append(seg1.count(word))#文本1统计在词典里出现词的次数
        word_count_vec_2.append(seg2.count(word))#文本2统计在词典里出现词的次数

    vec_1 = np.array(word_count_vec_1)
    vec_2 = np.array(word_count_vec_2)
    #余弦公式

    num = vec_1.dot(vec_2.T)
    denom = np.linalg.norm(vec_1) * np.linalg.norm(vec_2)
    cos = num / denom
    sim = 0.5 + 0.5 * cos

    return sim

def main():
    #读取数据
    csv = '电力调度问答.csv'
    file_txt = pd.read_csv(csv, header=0, encoding='gbk')  # [205 rows x 2 columns]
    file_txt = file_txt.dropna()  # 删除空值[[205 rows x 2 columns]
    #停用词加载
    stopwords = stopwordslist("停用词.txt")

    # 去除标点符号
    file_txt['clean_review'] = file_txt['问题'].apply(remove_punctuation)
    # 去除停用词
    file_txt['cut_review'] = file_txt['clean_review'].apply(
        lambda x: " ".join([w for w in list(jieba.cut(x)) if w not in stopwords and len(w) > 1]))

    #所有问题组合起来的倒排表 result
    result = {
    
    }
    for i in range(len(file_txt)):
        left, rights = i, file_txt.iloc[i]['cut_review'].split()
        for right in rights:
            if right in result.keys():
                result[right].append(left)
            else:
                result[right] = [left]

    #用户问题
    sentence=input('请输入问题:')
    clean_reviewyonghu = remove_punctuation(sentence)  # 用户问题去除标点
    cut_reviewyonghu = [w for w in list(jieba.cut(clean_reviewyonghu)) if
                        w not in stopwords and len(w) > 1]  # 用户问题去除停用词,单字词 得到关键词
    #print(cut_reviewyonghu)
    #查找用户问题关键词在数据库中对应的问题id
    Problem_Id = []
    for j in cut_reviewyonghu:
        if j in result.keys():
            Problem_Id.extend(result[j])
    id = (list(set(Problem_Id)))  # 去重之后的ID

    #计算余弦相似度
    str1 = sentence  # 用户所提问题
    similarity = {
    
    }  # 存储结果
    if len(id) == 0:
        print('数据库里没有该问题,请重新提问')
    else:
        for i in id:
            str2 = file_txt.iloc[i]['问题']
            sim1 = cosine_similarity(str1, str2,stopwords)  # 余弦相识度
            # print('用户所提问题和问题{0}的相似度是{1}'.format(i, sim1))
            similarity[i] = sim1
    #输出和用户问题相似度最高几个问题的答案
    jieguo = sorted(similarity.items(), key=lambda d: d[1], reverse=True)[:2]  # 降序
    print(jieguo)
    print('用户所提的问题是:', sentence)

    for i, j in jieguo:
        print('数据库相似的问题是{0} 答案是{1}'.format(i, file_txt.iloc[i]['答案']))



if __name__=='__main__':
    main()


to sum up

This is a simple question and answer system. In real life, the service should also include voice
, that is, first speech is converted to text, then text error correction, and finally the question and answer system.
What we need to maintain is that the more questions in the database (that is, the question answer table corresponding to this article)
, the better the effect of the question system.
If you need to improve the effect and speed, you can modify the stop words (the stop words given in this article are for all texts, not specifically designed for electricity), so that the final keywords only contain words in the power field.
Furthermore, modify the stuttering participle so that some words are not divided into single characters.

It can be packaged as software again, if I don't bother to package it as interface software, I won't demonstrate it.
If you only do the inverted indexing step, you can find similar questions in the database based on the questions asked by the user, and then display the similar questions and answers one by one, which is a search system (similar to Baidu, enter a sentence, a large pop-up Heap related stuff)

Insert picture description here
The new computer in electrical engineering: Yu Dengwu. Writing blog posts is not easy. If you think this article is useful to you, please give me a like and support, thank you

I am an electrical student, how do I understand these. Ugh

Insert picture description here

Guess you like

Origin blog.csdn.net/kobeyu652453/article/details/108901289