Intelligent campus knowledge graph question and answer recommendation system based on Bert+Attention+LSTM - NLP natural language processing algorithm application (including all Python engineering source codes and training models) + data set


Insert image description here

Preface

This project makes full use of Google's Bert model, a large-scale corpus pre-training model based on Attention, and the LSTM named entity recognition network. The goal of the project is to design a set of general question and answer system processing logic to achieve intelligent question and answer tasks.

First, we adopt the Bert model, which is a very powerful pre-trained model in the field of natural language processing. It has deep understanding of context and information extraction capabilities, helping to understand complex natural language problems.

Next, we built an LSTM named entity recognition network. This network can identify named entities in text, such as names of people, places, organizations, etc. This is very important for the accuracy of the question answering system as it helps identify the entities mentioned in the question and provide relevant information.

In the design of the project, we focused on universal processing logic so that the question and answer system can adapt to questions in various fields and topics. This general logic includes steps such as question parsing, text understanding, named entity recognition, and answer generation.

Finally, we successfully implemented an intelligent question answering system that can accept questions raised by users and understand the questions and provide accurate answers based on the Bert model and LSTM named entity recognition network. The system's versatility gives it broad potential across multiple fields and applications, from answering frequently asked questions to handling knowledge queries in specialized fields.

overall design

This part includes the overall system structure diagram and system flow chart.

Overall system structure diagram

The overall structure of the system is shown in the figure.

Insert image description here

System flow chart

The system flow is shown in the figure.

Insert image description here
The Neo4j database process is shown in the figure.

Insert image description here

Operating environment

This section includes the Python environment and the server environment.

Python environment

Python 3.7 and above configuration is required. Download Anaconda in the Windows environment to complete the required configuration of Python. The download address is https://www.anaconda.com/ . You can also download a virtual machine to run the code in the Linux environment. TensorFlow 1.0, NumPy, py-Levenshtein, jieba, Scikit-learn are downloaded according to the root directory file requirement.txt.

pip install -r requirement.txt -i https://pypi.tuna.tsinghua.edu.cn/simple 

Server environment

Mac/Windows 10 users can access the server through SSH (Secure Shell) directly from the terminal. Windows 7 users can install OpenSSH for access.

OpenSSH is a free and open source implementation of the SSH protocol, which can be used for remote control or file transfer between computers. The traditional way to achieve this function is to transmit the password in clear text. Disadvantages: Telnet (Terminal Emulation Protocol), RCP, FTP, Login, and RSH are not secure.

OpenSSH provides server-side background programs and client-side tools to encrypt data during remote control and file transfer, thereby replacing the original similar services. The download address is https://www.mls-software.com/opensshd.html . After downloading, you can complete the installation according to the default. Open the cmd command window to operate remotely, as shown in the figure.

Insert image description here

Module implementation

This project includes 5 modules: constructing data sets, identifying networks, named entity error correction, retrieval problem categories, and query results. The function introduction and related codes of each module are given below.

1. Construct a data set

The data is crawled from the Beijing University of Posts and Telecommunications library website, and mainly includes information such as the teacher’s phone number, research direction, gender, course credits, and semester. The crawled information is constructed into a question form according to Chinese custom through loop statements, and the constructed statements are marked. Useless entities are marked as 0, and useful entities are divided into three categories: TEA (teacher), COU (course), DIR (direction of research). The annotation method starts with B - entity category annotation for entities, and starts with I - entity category annotation for non-entities. The training set data is as shown in the figure.
Insert image description here
The relevant code for loading the training set is as follows:

def _read_data(cls, input_file):
    #读取数据集文件
    with codecs.open(input_file,'r',encoding='utf-8') as f:
        lines = []
        words = []
        labels = []
        for line in f:
            contends = line.strip()
            tokens = contends.split('\t')
            if len(tokens) == 2:
                words.append(tokens[0])
                labels.append(tokens[1])
            else:
                if len(contends) == 0:
                    l=''.join([label for label in labels if len(label) > 0])
                    w = ' '.join([word for word in words if len(word) > 0])
                    lines.append([l, w])
                    words = []
                    labels = []
                    continue
            if contends.startswith("-DOCSTART-"):
                words.append('')
                continue
        return lines
#读取训练集
def get_train_examples(self, data_dir):
    return self._create_example(
        self._read_data(os.path.join(data_dir, "train.txt")), "train"
    )
#读取验证集
def get_dev_examples(self, data_dir):
    return self._create_example(
    self._read_data(os.path.join(data_dir,"dev.txt")),"dev"
    )
#读取测试集
def get_test_examples(self, data_dir):
    return self._create_example(
        self._read_data(os.path.join(data_dir, "test.txt")), "test")

2. Identify the network

Use Google's Bert to call the LSTM model code, modify it, and train.

def train_ner():  #定义训练
    import os
    from bert_base.train.train_helper import get_args_parser
    from bert_base.train.bert_lstm_ner import train
    args = get_args_parser()
    if True:
        import sys
        param_str = '\n'.join(['%20s = %s' % (k, v) for k, v in sorted(vars(args).items())])
        print('usage: %s\n%20s   %s\n%s\n%s\n' % (' '.join(sys.argv), 'ARG', 'VALUE', '_' * 50, param_str))
    print(args)
    os.environ['CUDA_VISIBLE_DEVICES'] = args.device_map
    train(args=args)
#数据处理代码
def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, output_dir, mode):
#将一个样本进行分析,字和标签转化为ID,结构化到输入特征对象中
    label_map = {
    
    }
    #1表示从1开始对标签进行索引化
    for (i, label) in enumerate(label_list, 1):
        label_map[label] = i
    #保存label->index 的映射
    if not os.path.exists(os.path.join(output_dir, 'label2id.pkl')):
        with codecs.open(os.path.join(output_dir,'label2id.pkl'),'wb')as w:
            pickle.dump(label_map, w)
    textlist = example.text.split(' ')
    labellist = example.label.split(' ')
    tokens = []
    labels = []
    for i, word in enumerate(textlist):
    #分词,不在BERT的vocab.txt中,则进行WordPiece处理,分字可替换为list(input)
        token = tokenizer.tokenize(word)
        tokens.extend(token)
        label_1 = labellist[i]
        for m in range(len(token)):
            if m == 0:
                labels.append(label_1)
            else:  #一般不会出现else分支
                labels.append("X")
    #tokens = tokenizer.tokenize(example.text)
    #序列截断
    if len(tokens) >= max_seq_length - 1:
        tokens = tokens[0:(max_seq_length - 2)]  
#-2的原因是因为序列需要加一个句首和句尾标志
        labels = labels[0:(max_seq_length - 2)]
    ntokens = []
    segment_ids = []
    label_ids = []
    ntokens.append("[CLS]")  #句子开始设置CLS标志
    segment_ids.append(0)
    #append("O") or append("[CLS]") not sure!
    label_ids.append(label_map["[CLS]"])  
#O或者CLS会减少标签个数,但句首和句尾使用不同的标志标注
    for i, token in enumerate(tokens):
        ntokens.append(token)
        segment_ids.append(0)
        label_ids.append(label_map[labels[i]])
    ntokens.append("[SEP]")  #句尾添加[SEP]标志
    segment_ids.append(0)
    #append("O") or append("[SEP]") not sure!
    label_ids.append(label_map["[SEP]"])
    input_ids = tokenizer.convert_tokens_to_ids(ntokens)  
#将序列中的字(ntokens)转化为ID形式
    input_mask = [1] * len(input_ids)
    #label_mask = [1] * len(input_ids)
    #使用padding 
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        label_ids.append(0)
        ntokens.append("**NULL**")
        #label_mask.append(0)
    #print(len(input_ids))
    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    assert len(label_ids) == max_seq_length
    #assert len(label_mask) == max_seq_length
    #打印部分样本数据信息
    if ex_index < 5:
        tf.logging.info("*** Example ***")
        tf.logging.info("guid: %s" % (example.guid))
        tf.logging.info("tokens: %s" % " ".join(
            [tokenization.printable_text(x) for x in tokens]))
        tf.logging.info("input_ids:%s"% " ".join([str(x) for x in input_ids]))
        tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
        tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
        tf.logging.info("label_ids: %s" % " ".join([str(x) for x in label_ids]))
        # tf.logging.info("label_mask: %s" % " ".join([str(x) for x in label_mask]))
    #结构化为一个类
    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=label_ids,
        #label_mask = label_mask
    )
    #mode='test'的时候才有效
    write_tokens(ntokens, output_dir, mode)
    return feature

3. Named entity error correction

Correct the identified course entities based on course.txtthe full names of all courses stored in , using the shortest edit distance matching method combined with the inclusion method.

class Select_course:
    def __init__(self):
        self.f = csv.reader(open('QA/dict/course.txt','r'))
        self.course_name = [i[0].strip() for i in self.f]
        self.led = 3
        self.limit_num = 10
        self.select_word = []
        self.is_same = False
        self.have_same_length = False
        self.input_word = ''
        self.is_include = False
        #print(self.course_name)
        #print('列表创建完毕....')
    #包含搜索
    def select_first(self, input_word):
        self.select_word = []
        self.is_same = False
        self.is_include = False
        self.have_same_length = False
        self.input_word = input_word
        if input_word in self.course_name:
            self.is_same = True
            self.select_word.append(input_word)
                if self.is_same == False:
            for i in self.course_name:
                mark = True
                for one_word in input_word:
                    if not one_word in i:
                        mark = False
                if mark:
                    self.select_word.append(i)
            if len(self.select_word) != 0:
                self.is_include = True
        #print('第一轮筛选:')
        #print(self.select_word)
    #模糊搜索
    def select_second(self):
        self.led = 3
        if self.is_same or self.is_include:
            return
        for name in self.course_name:
                ed = ls.distance(self.input_word, name)
                if ed <= self.led:
                    self.led = ed
                    self.select_word.append(name)
        select_word_copy1 = copy.deepcopy(self.select_word)
        for name in select_word_copy1:
            ed = ls.distance(self.input_word, name)
            if ed > self.led:
                self.select_word.remove(name)
            if ed == self.led and len(name) == len(self.input_word):
                self.hava_same_length = True
        #print('第二轮筛选:')
        #print(self.select_word)

The identified teacher entity is corrected based on the full names of all teachers stored in teacher.csv, based on the shortest edit distance matching method, and the error correction logic is made to conform to the rules of users entering incorrect names.

class Select_name:
    def __init__(self):  #定义初始化
        self.f = csv.reader(open('QA/dict/teacher.csv','r'))
        self.teacher_name = [i[0] for i in self.f]
        self.led = 3
        self.limit_num = 10
        self.select_word = []
        self.have_same_length = False
        self.is_same = False
        self.input_word = ''
        #print(self.teacher_name)
        #print('列表创建完毕....')
    def select_first(self, input_word):  #定义首选
        self.select_word = []
        self.have_same_length = False
        self.is_same = False
        self.input_word = input_word
        if input_word in self.teacher_name:
            self.is_same = True
            self.select_word.append(input_word)
                if self.is_same == False:
            for name in self.teacher_name:
                ed = ls.distance(self.input_word, name)
                if ed <= self.led:
                    self.led = ed
                    self.select_word.append(name)
                    select_word_copy1 = copy.deepcopy(self.select_word)
            for name in select_word_copy1:
                ed = ls.distance(self.input_word, name)
                if ed > self.led:
                    self.select_word.remove(name)
                if ed == self.led and len(name) == len(self.input_word):
                    self.hava_same_length = True
        #print('第一轮筛选:')
        #print(self.select_word)
        return
        def select_second3(self):  #定义后续筛选
        if self.is_same == True or len(self.input_word) != 3:
            return 
        select_word_copy2 = copy.deepcopy(self.select_word)
        if self.hava_same_length:
            for name in select_word_copy2:
                if len(self.input_word)!=len(name):
                    self.select_word.remove(name)
        #print('第二轮筛选:')
        #print(self.select_word)
        def select_third3(self):
        if self.is_same == True or len(self.input_word) != 3:
            return 
        select_word_copy3 = copy.deepcopy(self.select_word)
        self.select_word = []
        for name in select_word_copy3:
            if name[0] == self.input_word[0] and name[2] == self.input_word[2]:
                self.select_word.append(name)
        for name in select_word_copy3:
          if not(name[0]==self.input_word[0]and name[2]== self.input_word[2]):
                self.select_word.append(name)
        #print('第三轮筛选:')
        #print(self.select_word)
    def limit_name_num(self):
        while(len(self.select_word)>self.limit_num):
            self.select_word.pop()
        #print('列表大小限制:')
        #print(self.select_word)

4. Search question category

The following is a list of keywords in the three categories:

self.direction_qwds= ["What to do", "What to do", "Expertise", "Specialization", "Interest", "Direction", "Aspect", "Research", "Scientific Research"]

self.location_qwds= ["address" "location" "place" "where" "where" "where" to find" "office" "location" "to see"]

self.telephone_qwds= ["Landline" "Landline" "Telephone" "Number" "Contact"]

The problem is classified based on the identified entity categories and retrieved keywords. The relevant codes are as follows:

	if self.check_words(self.direction_qwds,question)and('teacher' in types):        					question_type = 'teacher_direction'  
        question_types.append(question_type)  
	if self.check_words(self.location_qwds, question)and ('teacher' in types):        question_type = 'teacher_location'  
        question_types.append(question_type)  
	if self.check_words(self.telephone_qwds,question)and ('teacher' in types):        question_type = 'teacher_telephone'  
        question_types.append(question_type)  

5. Query results

According to the specific question category identified, the question is translated into a database query statement. The relevant code is as follows:

if final_question_type == 'teacher_direction':  
	sql = "MATCH (m:Teacher) where m.name = '{0}' return m.name, m.research_direction".format(i)  
if final_question_type == 'teacher_location':  
	sql = "MATCH (m:Teacher) where m.name = '{0}' return m.name, m.office_location".format(i)  
if final_question_type == 'teacher_telephone':  
	sql = "MATCH (m:Teacher) where m.name = '{0}' return m.name, m.telephone".format(i)  
#连接数据库
def __init__(self):  
         self.g = Graph(  
            "http://10.3.55.50:7474/browser",  
            user="********",  
            password="********")  
          self.num_limit = 30  
#查询结果并返回编写的模版答案语句
def search_main(self, sqls, final_question_types):  
         final_answers = []  
         temp_data = []  
         data = []  
         for i in sqls:  
             for one_sql in i:  
                 temp_data.append(self.g.run(one_sql).data()[0])  
                 #print(temp_data)  
             data.append(temp_data)  
             temp_data = []  
         #print(data)  
         temp_answer = []  
         answer = []  
         for i in zip(final_question_types, data):  
             for one_type_and_data in zip(i[0],i[1]):  
          temp_answer.append(self.answer_prettify(one_type_and_data[0],one_type_and_data[1]))  
             answer.append(temp_answer)  
             temp_answer = []  
         return answer  

Repeat the query to eliminate wrong alternatives. For example, it is recognized that the teacher's name entered by the user is Wang Hong, but the query shows that Beijing University of Posts and Telecommunications does not have Wang Hong, but Wang Chunhong and Wang Xiaohong exist. At this time, the user is repeatedly asked to determine the unique entity object.

ask_again = ''  
final_question_types = []  
for i in zip(tags, pre_words):  
	#print(i)  
	if len(i[1]) == 1:   
	    final_question_types.append(classifier.classify(text, i[0]))  
	    final_words.append(i[1][0])  
	 if len(i[1]) > 1:  
	    print('>1')  
	    if i[0] == 'teacher':  
	        ask_again = '请问您要询问的是哪个老师的信息:{0}'.format(','.join(i[1]))  
	    if i[0] ==  'course':  
	        ask_again = '请问您要询问的是哪门课程的信息:{0}'.format(','.join(i[1]))  
	    #print(ask_again)  
	    answer_again = input(ask_again)  
	    final_words.append(answer_again)  
	    final_question_types.append(classifier.classify(text, i[0])) 

System test

This part includes the named entity recognition network test and the overall test of the knowledge graph question and answer system.

1. Named entity recognition network test

Enter commonly used questions. From the test results, it can be seen that the test can basically realize the identification of teachers and course entities. The model training effect is shown in the figure.

Insert image description here

2. Overall test of knowledge graph question answering system

Enter commonly used questions. From the answers returned by the question and answer system, we can see that the system is running in good condition and can basically answer the questions raised by users. The effect is as shown in the figure.

Insert image description here

Project source code download

For details, please see my blog resource download page


Download other information

If you want to continue to understand the learning routes and knowledge systems related to artificial intelligence, you are welcome to read my other blog " Heavyweight | Complete Artificial Intelligence AI Learning - Basic Knowledge Learning Route, all information can be downloaded directly from the network disk without following any routines.
This blog refers to Github’s well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc., which has nearly 100G of related information. I hope it can help all my friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/132665092