Python builds a question answering system for Jingdong Mall based on knowledge graph-user question preprocessing

This article mainly introduces the processing of user questions, that is, the process from acquiring user questions to understanding user intentions. It mainly involves named entity recognition (this task is simple, so I use part-of-speech tagging instead), question classification, and filling questions Template these sections. During the introduction, some code may be used to illustrate, but the code listed below is not complete. For the complete code, please refer to github. These codes are just to help understand the whole process, so that it is easy to understand the ins and outs of functions when looking at the code.
 

key information extraction

This part is to obtain the computer name in the user's question through named entity recognition, and in the experiment process, in this part we actually use part-of-speech tagging, because the part-of-speech tagging tool can tag the computer name, as follows when using jieba for part-of-speech tagging The meaning of:

nm 评论

Therefore, as long as we tag the part of speech of the question and find the word corresponding to nr, it will be the name of the computer. This is the main idea of ​​this part. Next, let's see how the code is implemented.

  • User question part-of-speech tagging

The implementation of the function is the following function:

    def question_posseg(self):
        jieba.load_userdict("./data/userdict3.txt")
        clean_question = re.sub("[\s+\.\!\/_,$%^*(+\"\')]+|[+——()?【】“”!,。?、~@#¥%……&*()]+","",self.raw_question)
        self.clean_question=clean_question
        question_seged=jieba.posseg.cut(str(clean_question))
        result=[]
        question_word, question_flag = [], []
        for w in question_seged:
            temp_word=f"{w.word}/{w.flag}"
            result.append(temp_word)
            # 预处理问题
            word, flag = w.word,w.flag
            question_word.append(str(word).strip())
            question_flag.append(str(flag).strip())
        assert len(question_flag) == len(question_word)
        self.question_word = question_word
        self.question_flag = question_flag
        print(result)
        return result

Jieba's part-of-speech tagging and word segmentation are performed simultaneously, so if the word segmentation is not accurate, then the part-of-speech tagging will often be wrong. For example, the computer rescuer is divided into:

拯  救  者

In this way, the entity of the savior will definitely not be recognized, so I added a custom dictionary and added all the computer names involved to this dictionary, so that jieba will refer to the dictionary when segmenting words Words are segmented and part-of-speech tagged. The format of the custom dictionary is as follows:

联想 小新14 15 nm
联想Yoga Pro14s 15 nm
联想小新Air14 15 nm
联想小新Pro14 15 nm
联想Yoga Pro 15 nm
联想小新Pro16 15 nm
联想拯救者Y9000P 15 nm
联想Yoga Pro 15 nm
联想Yoga 15 nm
联想拯救者R9000X 15 nm
联想Yoga 15 nm
联想拯救者Y9000X 15 nm
联想拯救者Y9000P 15 nm
联想Yoga Pro14s 15 nm
联想拯救者Y9000X 15 nm
拯救者 15 nm
小新 15 nm
联想拯救者Y7000P nm
联想小新Pro 13 15 nm
联想拯救者Y7000 15 nm
联想Yoga Pro14s 15 nm
联想小新Air14Plus 15 nm
联想Yoga 13s nm
小新 15 nm
联想拯救者Y7000P

In this way, the computer names involved will not be separated by jieba word segmentation. In addition, special characters in the question need to be removed. After these operations are completed, word segmentation and part-of-speech tagging can be performed, and the results can be returned after processing.

  • Question classification and template filling

The next step is to classify the user's questions and obtain the corresponding question templates, so as to understand the user's intentions. First, use the various questions introduced above based on the user's habits to train a classifier. In the sklearn I use here Bayesian classifier.
The first is to organize the training data:

    def read_train_data(self):
        train_x=[]
        train_y=[]
        file_list=getfilelist("./data/question/")
        # 遍历所有文件
        for one_file in file_list:
            # 获取文件名中的数字
            num = re.sub(r'\D', "", one_file)
            # 如果该文件名有数字,则读取该文件
            if str(num).strip()!="":
                # 设置当前文件下的数据标签
                label_num=int(num)
                # 读取文件内容
                with(open(one_file,"r",encoding="utf-8")) as fr:
                    data_list=fr.readlines()
                    for one_line in data_list:
                        word_list=list(jieba.cut(str(one_line).strip()))
                        # 将这一行加入结果集
                        train_x.append(" ".join(word_list))
                        train_y.append(label_num)
        return train_x,train_y

This is followed by training a multi-class Bayesian classifier model, and returns:

    # 训练并测试模型-NB
    def train_model_NB(self):
        X_train, y_train = self.train_x, self.train_y
        self.tv = TfidfVectorizer()

        train_data = self.tv.fit_transform(X_train).toarray()
        clf = MultinomialNB(alpha=0.01)
        clf.fit(train_data, y_train)
        return clf

Use the trained model to classify new questions:

    # 预测
    def predict(self,question):
        question=[" ".join(list(jieba.cut(question)))]
        test_data=self.tv.transform(question).toarray()
        y_predict = self.model.predict(test_data)[0]
        print("question type:",y_predict)
        return y_predict

Returns the category number of the user's question, which corresponds to a question template:

0:nm 评论
1:nm 价格
2:nm 类型
3:nm 简介
4:nm 配置
5:nnt 介绍
6:nnt 电脑重量
7:nnt 电脑评论数 大于 x
8:nnt 电脑好评度 大于 x
9:nnt 颜色
10:nnt 处理器
11:nnt 显卡
12:nnt 包邮
13:nnt 有货
14:nnt 快递
15:nnt 屏幕刷新率
16:nnt 固态
17:nnt 产地
18:nnt 厚度
19:nnt 屏幕色域

The corresponding relationship between numbers and templates can be stored in a dictionary in advance. After the number is predicted, the number can be directly used as the key value of the dictionary, so that the problematic template can be queried. For example, if the predicted result is 2, the corresponding question template nm type means to ask about the type of a certain computer, combined with the computer name obtained in the previous stage, a new question can be formed, such as:

拯救者  类型

The key parts involved are introduced above, here is a summary, when encountering a new problem:

1、得到原始问题;
2、抽取问题中的关键信息,比如从问题“拯救者Y9000P介绍”中抽取出拯救者Y9000P这个电脑名称;
3、使用分类模型来预测问题类别,预测出模板编号,查询字典得到问题模板;如dict[5]:nnt 介绍;
4、替换模板中的抽象内容,得到:拯救者Y9000P 介绍。

Guess you like

Origin blog.csdn.net/weixin_45897172/article/details/130998530