Based on TextRank+Seq2Seq+Pyqt5 article abstract title keyword auxiliary generation system (including all python project source code) + training data set

insert image description here


foreword

Based on academic papers, Wikipedia and other data sets, this project optimizes and improves the model by applying TextRank and Seq2Seq algorithms, and is committed to building an integrated auxiliary generation system for article abstracts, titles and keywords. In order to provide a user-friendly experience, we designed a visual interface and packaged the entire program as an executable file, which can be run directly on the PC.

First, we use rich datasets such as academic papers and Wikipedia as training samples. By training and optimizing the TextRank algorithm, we can extract key sentences and keywords in the article to form a refined abstract.

Second, we utilize the Seq2Seq algorithm for model improvement to further improve the quality of article-related content generation. The Seq2Seq model is a sequence-to-sequence model based on a recurrent neural network (RNN), which can convert an input sequence into an output sequence. By using the Seq2Seq algorithm, we can generate generalized and attractive titles based on article content.

At the same time, we also provide users with a visual interface to easily use this integrated auxiliary generation system. Users can input article content, and the system will automatically generate article abstracts, titles and keywords, and display the results intuitively on the interface.

For the convenience of users, we package the entire project as an executable file so that it can be run directly on the PC. In this way, users can directly open the program and use the functions provided by the system without complicated configuration and settings.

Through the implementation of this project, users can quickly generate accurate, concise and attractive article abstracts, titles and keywords, greatly improving work efficiency and creation quality. At the same time, the design of the visual interface and executable files makes the use of the system more convenient and intuitive, meeting the needs of users for ease of use.

overall design

This part includes the overall structure diagram of the system and the system flow chart.

System overall structure diagram

The overall structure of the system is shown in the figure.

insert image description here

System flow chart

The system flow is shown in the figure.

insert image description here

operating environment

This part includes Python environment, TextRank environment, TensorFlow environment, PyQt5 and Qt Designer runtime environment.

Python environment

Version: Python 3.5.

TextRank environment

numpy-1.9.3.tar.gzDownload the , , networkx-2.4.tar.gzand math-0.5.tar.gzfiles from the Tsinghua warehouse image , and after decompressing them locally, use the cmd command line to enter the console, switch to the corresponding directory, and execute the following command to complete the installation.

python setup.py install 

TensorFlow environment

In the terminal, use the pip command to install, use the pip command to install tensorflow, tarfile, matplotlib, jieba dependent packages, and realize the preparation of TensorFlow platform-related models.

pip3 install <package name>

PyQt5 and Qt Designer operating environment

Use the pip command to install the PyQt5 toolkit corresponding to the Python language, and configure PyUIC5 and PyQt5-tools in the environment for rapid development and conversion of graphical interfaces. Add the above tools to the ExternalTools of the PyCharm editor.

Open Qt Designer from the Tools-External Tools of the PyCharm editor, as shown in the figure below, indicating that Qt Designer is installed successfully.

insert image description here

Qt Designer tool map

module implementation

This project includes 6 modules: data preprocessing, extraction summary, model construction and compilation, model training and storage, graphical interface development and application packaging. The function introduction and related code of each module are given below.

1. Data preprocessing

The data preprocessing download address is http://www.sogou.com/labs/resource/cs.php , and the unprocessed raw data is shown in the figure below.

insert image description here

For its encoding form, because the file is too large, it is impossible to obtain the encoding by opening the file. GBK18030 can be used to encode it. The process is as follows.

1) Data extraction and division

Use regular expressions to extract the content of the data, divide the training set and verification set according to the proportion, remove the data whose text content length does not meet the requirements, store the divided title and content, and use software to change the encoding format of the generated file to utf -8. The relevant code is as follows:

a) Regular expression matching:

for_title='<contenttitle>(.*)</contenttitle>'#筛选标题,去除标签
for_content='<content>(.*)</content>'#筛选内容,去除标签
p1=re.compile(for_title)
p2=re.compile(for_content)

b) Data filtering for writing process:

for i in range(4,len(data.values)+1,6):#针对位置选择相应的数据
    n=p2.findall(str(data.values[i]))
    text=n[0]
    word=text
    result=''
    for w in word:
        result=result+str(w.word)+' ' 
#对意外的情况进行替换
    result=result.replace(u'\ u3000','').replace(u'\ ue40c','')
#检查数据长度是否复合需求,太长或者太短,都要舍弃 
    if len(result)>=1024 or len(result)==0: 
        id.append(i)
        continue      
    if i<for_train:
        f_content_train.write(result+'\n')
    else:
        f_content_test.write(result+'\n')
    print((i/6)/len(range(3,len(data.values)+1,6)))

2) Replacement and word segmentation

Perform label replacement on the obtained text to complete the word segmentation operation. The relevant code is as follows:

def token(self, sentence):
    words = self.segmentor.segment(sentence)  #分词
    words = list(words)
    postags = self.postagger.postag(words)    #词性标注
        postags = list(postags)
        netags = self.recognizer.recognize(words, postags)#命名实体识别
        netags = list(netags)
        result = []
        for i, j in zip(words, netags):
            if j in ['S-Nh', 'S-Ni', 'S-Ns']:
             result.append(j)
               continue
        result.append(i)
    return result

After using the above code, get 4 files - 2 training sets and 2 test sets, the same line of the corresponding file is title and content respectively. Tags are replaced for all texts to complete word segmentation.

3) Data reading

Read the data according to the obtained file, the relevant code is as follows:

data_set = [[] for _ in buckets]  
with tf.gfile.GFile(source_path, mode="r") as source_file:  
  	with tf.gfile.GFile(target_path, mode="r") as target_file:
   	 source, target = source_file.readline(), target_file.readline()  
   	 counter = 0  #源文件和目标文件
   	 while source and target and (not max_size or counter < max_size):  
      counter += 1  
      if counter % 10000 == 0:  
        print("reading data line%d"% counter)#输出信息
        sys.stdout.flush()  
      source_ids = [int(x) for x in source.split()]  
      target_ids = [int(x) for x in target.split()]  
      target_ids.append(data_utils.EOS_ID)  #添加标识
      for bucket_id, (source_size, target_size) in enumerate(buckets):  
        if len(source_ids)<source_size and len(target_ids)<target_size: 
          data_set[bucket_id].append([source_ids, target_ids])  
          break  
      source, target = source_file.readline(), target_file.readline()  
return data_set  

The code will automatically read the information and store the read data into the bucket. The deep learning model is generated by learning the content and the title. The data preprocessing is performed when preparing the data, which can be read using the codecs module of Python.

2. Extract summary

Most papers are tens of thousands of words in length, directly using the model to train and test the data will consume computing resources. Therefore, the importance of data is extracted through text sorting, and the algorithm is as follows.

1) Sort iterative algorithm

First, get a binary list, the sentence is a sub-list, and the elements are words; second, determine the link by judging whether two words appear in the same time window at the same time. After adding all the words to the links of the graph, use the PageRank algorithm to iterate to obtain a stable word PR value; finally, get the list of important words.

def sort_words(vertex_source, edge_source, window=2, pagerank_config={
    
    'alpha': 0.85, }):#对单词的关键程度进行排序
"""
vertex_source:二维列表,子列表代表句子,其元素是单词,用来构造PageRank中的节点
edge_source:二维列表,子列表代表句子,其元素为单词,根据单词位置关系构造PageRank中的边窗口,一个句子中相邻的window个单词,两两之间认为有边
pagerank_config:PageRank的设置
"""
     sorted_words = []
     word_index = {
    
    }
     index_word = {
    
    }
     _vertex_source = vertex_source
     _edge_source = edge_source
     words_number = 0
     for word_list in _vertex_source:#对每个句子进行处理,提取包含单词的列表
         for word in word_list:
             if not word in word_index:
#更新word_index 假如字典中没有单词,将这个单词与索引添加到字典中
             word_index[word] = words_number
               index_word[words_number] = word #对word进行反向映射
                words_number += 1
    graph = np.zeros((words_number, words_number))
#构建word_number*word_number的矩阵,实现图计算
    for word_list in _edge_source:
        for w1, w2 in combine(word_list, window):
            if w1 in word_index and w2 in word_index:
                index1 = word_index[w1]
                index2 = word_index[w2]
                graph[index1][index2] = 1.0
                graph[index2][index1] = 1.0
#根据窗口判断其连接
    nx_graph = nx.from_numpy_matrix(graph)
#构成邻接矩阵
    scores = nx.pagerank(nx_graph, **pagerank_config)  
#使用PageRank算法进行迭代
    sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)
    for index, score in sorted_scores:
        item = AttrDict(word=index_word[index], weight=score)
        sorted_words.append(item)
    return sorted_words

2) Sentence similarity algorithm

When using the TextRank algorithm to output a sentence, the default node used is the sentence, and the weight of the connection between the two nodes uses the similarity of the sentence. The relevant code is as follows:

def get_similarity(word_list1, word_list2):#计算两个句子的相似程度
    	"""默认用于计算两个句子相似度的函数
    	word_list1, word_list2分别代表两个句子,都是由单词组成的列表
    	"""
    		words = list(set(word_list1 + word_list2))
    		vector1 = [float(word_list1.count(word)) for word in words]
    		#统计某个单词在句子中的频率
    	vector2 = [float(word_list2.count(word)) for word in words]
    	vector3 = [vector1[x] * vector2[x] for x in range(len(vector1))]
    		vector4 = [1 for num in vector3 if num > 0.]
    		co_occur_num = sum(vector4)#分子
    	if abs(co_occur_num) <= 1e-12:
        		return 0.
         denominator = math.log(float(len(word_list1))) + math.log(float(len(word_list2)))  #分母
   		 if abs(denominator) < 1e-12:
       	 return 0.
    	return co_occur_num / denominator #返回句子的相似度

3. Model construction and compilation

After completing the production of the data set, build the model, define the model input, and determine the loss function.

1) Model building

Based on the model provided by TensorFlow, parameters are passed using classes:

class LargeConfig(object): #定义网络结构
    		learning_rate = 1.0  #学习率
    		init_scale = 0.04
    		learning_rate_decay_factor = 0.99 #学习率下降
    		max_gradient_norm = 5.0
    		num_samples = 4096 #采样的Softmax
    		batch_size = 64
    		size = 256 #每层节点数
    		num_layers = 4 #层数
vocab_size = 50000
#模型构建
def seq2seq_f(encoder_inputs, decoder_inputs, do_decode):
      return tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(
         encoder_inputs,#输入的句子
         decoder_inputs,#输出的句子
         cell,#使用的cell、LSTM或者GRU
         num_encoder_symbols=source_vocab_size,#源字典的大小
         num_decoder_symbols=target_vocab_size,#转换后字典的大小
         embedding_size=size,#embedding的大小
         output_projection=output_projection,#看字典大小
         feed_previous=do_decode,#进行训练还是测试
         dtype=tf.float32)

2) Define the model input

In the model, bucketthe input characters are accepted, so a placeholder must be constructed for each element of the bucket.

#输入
    self.encoder_inputs = []
    self.decoder_inputs = []
    self.target_weights = []
    for i in xrange(buckets[-1][0]):  
      self.encoder_inputs.append(tf.placeholder(tf.int32, shape=[None],
                                                name="encoder{0}".format(i)))
#为列表对象中的每个元素表示一个占位符,名称分别为encoder0、encoder1...
    for i in xrange(buckets[-1][1] + 1):
      self.decoder_inputs.append(tf.placeholder(tf.int32, shape=[None],
                                                name="decoder{0}".format(i)))
      self.target_weights.append(tf.placeholder(tf.float32, shape=[None],
                                                name="weight{0}".format(i)))
#target_weights 是一个与 decoder_outputs 大小一样的矩阵
#该矩阵将目标序列长度以外的其他位置填充为标量值0
#目标是将解码器输入移位1
    targets = [self.decoder_inputs[i + 1]
               for i in xrange(len(self.decoder_inputs) - 1)]
#将 decoder input向右平移一个单位

3) Determine the loss function

On the loss function, the function TensorFlowin use sampled_softmax_loss().

def sampled_loss(labels, inputs):#使用候选采样损失函数
  labels = tf.reshape(labels, [-1, 1])
#需要使用32位浮点数计算sampled_softmax_loss,以避免数值不稳定性
  local_b = tf.cast(b, tf.float32)
  local_inputs = tf.cast(inputs, tf.float32)
  return tf.cast(
      tf.nn.sampled_softmax_loss(  #损失函数
          weights=local_w_t,
          biases=local_b,
          labels=labels,
          inputs=local_inputs,
          num_sampled=num_samples,
          num_classes=self.target_vocab_size),tf.float32)

4. Model training and storage

After setting the model structure, define the model training function to import and call the model.

1) Define the model training function

Define model training functions and related operations.

def train():
  	#准备标题数据
  	print("Preparing Headline data in %s" % FLAGS.data_dir)
	src_train,,dest_train,src_dev,dest_dev,_,_=data_utils.prepare_headline_data(FLAGS.data_dir, FLAGS.vocab_size)
#将获得的数据进行处理,包括:构建词典、根据词典单词ID的转换,返回路径
config = tf.ConfigProto(device_count={
    
    "CPU": 4}, 
                  inter_op_parallelism_threads=1, 
                  intra_op_parallelism_threads=2) 
  	with tf.Session(config = config) as sess:
    	print("Creating %d layers of %d units."%(FLAGS.num_layers, FLAGS.size))
    	model = create_model(sess, False)
   #创建模型
    	print ("Reading development and training data (limit: %d)."
           % FLAGS.max_train_data_size)
  	dev_set = read_data(src_dev, dest_dev)
    	train_set = read_data(src_train, dest_train, FLAGS.max_train_data_size)
    train_bucket_sizes = [len(train_set[b]) for b in xrange(len(buckets))]
	    train_total_size = float(sum(train_bucket_sizes))
    	trainbuckets_scale=[sum(train_bucket_sizes[:i + 1]) / train_total_size
                           for i in xrange(len(train_bucket_sizes))]
    	#进行循环训练
    	step_time, loss = 0.0, 0.0
   	current_step = 0
    	previous_losses = []
    	while True:
      		random_number_01 = np.random.random_sample()
      		bucket_id = min([i for i in xrange(len(trainbuckets_scale))
                       if trainbuckets_scale[i] > random_number_01])
              #随机选择一个bucket进行训练
      		start_time = time.time()
      		encoder_inputs,decoder_inputs,target_weights=model.get_batch(
          	train_set, bucket_id)
    _,step_loss,_=model.step(sess, encoder_inputs, decoder_inputs,
                                   target_weights, bucket_id, False)
      		step_time+=(time.time()-start_time)/FLAGS.steps_per_checkp oint 
      		loss += step_loss / FLAGS.steps_per_checkpoint
      		current_step += 1
      		if current_step % FLAGS.steps_per_checkpoint == 0:
        			perplexity=math.exp(float(loss))
if	loss<300	
else                                                                         
float("inf")
        			print ("global step %d learning rate %.4f step-time %.2f perplexity ""%.2f" % (model.global_step.eval(), 
model.learning_rate.eval(),
                         step_time, perplexity))   #输出参数
        		if len(previous_losses)>2 and loss > max(previous_losses[-3:]):
          			sess.run(model.learning_rate_decay_op)
        		previous_losses.append(loss)
checkpoint_path=os.path.join(FLAGS.train_dir, "headline_large.ckpt") 
       				model.saver.save(sess,checkpoint_path,
global_step=model.global_step) #检查点输出路径
        		step_time, loss = 0.0, 0.0
        		for bucket_id in xrange(len(buckets)):
          			if len(dev_set[bucket_id]) == 0:
            			print("  eval: empty bucket %d" % (bucket_id))
            			continue
          			encoder_inputs,decoder_inputs,target_weights= 
model.get_batch(dev_set, bucket_id)  #编解码及目标加权
          			_,eval_loss,_=model.step(sess,encoder_inputs, 
decoder_inputs, target_weights, bucket_id, True)
          			eval_ppx = math.exp(float(eval_loss))#计算损失 
if eval_loss < 300 
else float("inf")
print("eval:bucket%dperplexity%.2f"%(bucket_id, eval_ppx))#输出困惑度
        		sys.stdout.flush()

2) Model import and call

Put the generated model /ckptinside the folder and load the model during the run. After the program obtains the sentence, it performs the following processing:

while sentence:
      sen=tf.compat.as_bytes(sentence)
      sen=sen.decode('utf-8')
      token_ids = data_utils.sentence_to_token_ids(sen, vocab, 
normalize_digits=False)
      print (token_ids) # 打印ID
      #选择合适的bucket
      bucket_id = min([b for b in xrange(len(buckets)) if buckets[b][0] > len(token_ids)])
      print ("current bucket id" + str(bucket_id))
      encoder_inputs, decoder_inputs, target_weights = model.get_batch(
          {
    
    bucket_id: [(token_ids, [])]}, bucket_id)
            #获得模型的输出
      _, _, output_logits_batch = model.step(sess, encoder_inputs, 
decoder_inputs, target_weights,
 		  bucket_id, True)
      #贪婪解码器 
      output_logits = []
      for item in output_logits_batch:
        output_logits.append(item[0])
      print (output_logits)
      print (len(output_logits))
      print (output_logits[0])
      outputs = [int(np.argmax(logit)) for logit in output_logits]
      print(output_logits)
      #剔除程序对文本进行的标记
      if data_utils.EOS_ID in outputs:
        outputs = outputs[:outputs.index(data_utils.EOS_ID)]
      print(" ".join([tf.compat.as_str(rev_vocab[output]) for output in outputs]))

5. Development of graphical interface

In order to improve usability, the code-oriented operation environment is transformed into an interface-oriented operation, and the graphical interface of the project is completed through the QtDesigner and PyQt5 environment provided by Python.

1) Interface design

Open the configured Qt Designer from PyCharm's External Tools, create the main window, and use WidgetBox to layout the components, as shown in the figure below.

insert image description here

The aesthetics of the native components are not enough, and the styles of each component need to be customized. Select to modify the style sheet of the corresponding component in the monitoring window, and complete the beautification of each component and interface by adding CSS (Cascading Style Sheets, cascading style sheets). The beautification picture shown in the figure below shows the CSS code of the "Enter Program" button. Here, the basic style, click style and mouse hover style are respectively set to make the logic of the button closer to the real usage scenario and improve the user experience.

insert image description here

Preview the modified style sheet of the main window and each component, as shown in the figure below.

insert image description here

Home page design preview (button style is pointer floating) picture

2) Code conversion

Save the above interface design as a .ui file, and use the configured PyUIC5 tool to process it to get the converted .py code.

Adjust the components in the code that cannot be rendered by running the program. For example, modify the icon introduced by specifying the association through the .qrc file to refer to a relative address. The relevant code is as follows:

from PyQt5 import QtCore, QtGui, QtWidgets  #引入所需的库
from PyQt5 import QtCore, QtGui, QtWidgets, Qt  
from PyQt5.QtWidgets import *  
import PreEdit  
  class Ui_MainWindow_home(QtWidgets.QMainWindow):  #界面类定义
    def __init__(self):  
        super(Ui_MainWindow_home,self).__init__()  
        self.setupUi(self)  
        self.retranslateUi(self)  
      def setupUi(self, MainWindow_home):  #设置界面
        MainWindow_home.setObjectName("MainWindow_home")  
        MainWindow_home.resize(900, 650)  
        MainWindow_home.setMinimumSize(QtCore.QSize(900, 650))  
        MainWindow_home.setMaximumSize(QtCore.QSize(900, 650))  
        MainWindow_home.setBaseSize(QtCore.QSize(900, 650))  
        font = QtGui.QFont()  
        font.setFamily("黑体")  
        font.setPointSize(12)  
        MainWindow_home.setFont(font)  #设置字体
        MainWindow_home.setStyleSheet("QMainWindow#MainWindow_home{\n"  
            "background:#FFFEF8\n}")  
        self.centralwidget = QtWidgets.QWidget(MainWindow_home)  
        self.centralwidget.setStyleSheet("")  #设置表单风格
        self.centralwidget.setObjectName("centralwidget")  
       self.pushButton_openfile=QtWidgets.QPushButton(self.centralwidget)
       self.pushButton_openfile.setGeometry(QtCore.QRect(320,328,258,51))
        font = QtGui.QFont()  
        font.setFamily("等线")  
        font.setPointSize(11)  
        font.setBold(True)  
        font.setWeight(75)  
        self.pushButton_openfile.setFont(font)  #单击按钮,自动浏览文件设置
        self.pushButton_openfile.setCursor(QtGui.QCursor(QtCore.Qt.PointingHandCursor))  
        self.pushButton_openfile.setStyleSheet("QPushButton#pushButton_openfile{  \n"  
            "border: 1px solid #9a8878;  \n"  
            "background-color:#ffffff;\n"  
            "border-style: solid;  \n"  
            "border-radius:0px;  \n"  
            "width: 40px; \n"  
            "height:20px;  \n"  
            "padding:0 0px;  \n"  
            "margin:0 0px;  \n"  
            "}  \n"  
            "\n"  
            "QPushButton#pushButton_openfile:pressed{\n"  
            "background-color:#FBF7F6;\n"  
            "border:0.5px solid #DDCFC2;\n"  
            "}\n"  
            "\n"  
            "QPushButton#pushButton_openfile:hover{\n"  
            "border:0.5px solid #DDCFC2;\n"
            "}")  
        icon = QtGui.QIcon()  #图标设置
        icon.addPixmap(QtGui.QPixmap(r".\icon\enter2.png"), 
QtGui.QIcon.Normal, QtGui.QIcon.Off)  
        icon.addPixmap(QtGui.QPixmap(r".\icon\enter2.png"), 
QtGui.QIcon.Normal, QtGui.QIcon.On)  
        self.pushButton_openfile.setIcon(icon)  
        self.pushButton_openfile.setCheckable(False)  
        self.pushButton_openfile.setObjectName("pushButton_openfile")  
        self.label_maintitle_shadow=QtWidgets.QLabel(self.centralwidget)
        self.label_maintitle_shadow.setGeometry(QtCore.QRect(331,188,241, 61))
        font = QtGui.QFont()  #图形界面的字体设置
        font.setFamily("微软雅黑")  
        font.setPointSize(36)  
        font.setBold(True)  
        font.setWeight(75)  
        self.label_maintitle_shadow.setFont(font)  
        self.label_maintitle_shadow.setStyleSheet("QLabel#label_maintitle_shadow{\n"  
            " color:#847c74\n"  
            "}")  #设置表单的风格
        self.label_maintitle_shadow.setAlignment(QtCore.Qt.AlignCenter)
        self.label_maintitle_shadow.setObjectName("label_shadow")
        self.label_format = QtWidgets.QLabel(self.centralwidget)  
        self.label_format.setGeometry(QtCore.QRect(325, 395, 251, 20))  
        font = QtGui.QFont()  
        font.setFamily("黑体")  
        font.setPointSize(10)  
        self.label_format.setFont(font)  #设置表单的格式字体
        self.label_format.setStyleSheet("QLabel#label_format{\n"  
            "color:#3A332A\n"  
            "}")  
        self.label_format.setObjectName("label_format")  
        self.label_maintitle = QtWidgets.QLabel(self.centralwidget)  
        self.label_maintitle.setGeometry(QtCore.QRect(331, 189, 241, 61))
        font = QtGui.QFont()  
        font.setFamily("微软雅黑")  
        font.setPointSize(35)  
        font.setBold(True)  
        font.setWeight(75)  
        self.label_maintitle.setFont(font)  
        self.label_maintitle.setStyleSheet("QLabel#label_maintitle{\n"  
            "color:#3A332A\n"  
            "}")  #主题标签的风格设置
        self.label_maintitle.setAlignment(QtCore.Qt.AlignCenter)  
        self.label_maintitle.setObjectName("label_maintitle")  
        self.label_author = QtWidgets.QLabel(self.centralwidget)  
        self.label_author.setGeometry(QtCore.QRect(328, 600, 251, 20))  
        font = QtGui.QFont()  
        font.setFamily("等线")  
        font.setPointSize(8)  
        self.label_author.setFont(font)  
        self.label_author.setStyleSheet("QLabel#label_author{\n"  
            "color:#97846c\n"  #设置表单风格
            "}")  
        self.label_author.setAlignment(QtCore.Qt.AlignCenter)  
        self.label_author.setObjectName("label_author")  
        MainWindow_home.setCentralWidget(self.centralwidget)  
        self.menubar = QtWidgets.QMenuBar(MainWindow_home)  
        self.menubar.setGeometry(QtCore.QRect(0, 0, 900, 23))  
        self.menubar.setObjectName("menubar")  
        MainWindow_home.setMenuBar(self.menubar)  #主窗口设置菜单栏
        self.statusbar = QtWidgets.QStatusBar(MainWindow_home)  
        self.statusbar.setObjectName("statusbar")  
        MainWindow_home.setStatusBar(self.statusbar)  #主窗口设置状态栏
          self.retranslateUi(MainWindow_home)  
        QtCore.QMetaObject.connectSlotsByName(MainWindow_home)  
      def retranslateUi(self, MainWindow_home):  
        _translate = QtCore.QCoreApplication.translate  
        MainWindow_home.setWindowTitle(_translate("MainWindow_home",
"MainWindow"))#主窗口设置窗口标题
        self.pushButton_openfile.setText(_translate("MainWindow_home",
"进入程序"))  
        self.label_maintitle_shadow.setText(_translate("MainWindow_home",
"论文助手"))  
        self.label_format.setText(_translate("MainWindow_home",
"支持扩展名: .pdf  .doc  .docx  .txt"))  
        self.label_maintitle.setText(_translate("MainWindow_home",
"论文助手"))  
        self.label_author.setText(_translate("MainWindow_home",
"Designed by Hu Tong & Li Shuolin"))  
      def openfile(self):  
        openfile_name=QFileDialog.getOpenFileName(self,'选择文件',
'','files(*.doc,*.docx,*.pdf,*.txt)')  

3) Interface interaction

After completing the design, establish the interaction relationship between the interfaces. Try two methods here: one is to define the jump function; the other is to bind the slot function of the button to complete the jump.

a) Define the jump function
Define the jump function and related operations.

#Jumpmain2pre.py  
from PyQt5 import QtCore, QtGui, QtWidgets  
from home import Ui_MainWindow_home        #跳转按钮所在界面  
from PreEdit import Ui_Form                #跳转到的界面  
  class Ui_PreEdit(Ui_Form):   				#定义跳转函数的名字  
    def __init__(self):  
        super(Ui_PreEdit,self).__init__()  #跳转函数类名  
        self.setupUi(self)  
  #主界面  
class Mainshow(Ui_MainWindow_home):         
  def __init__(self):  
        super(Mainshow,self).__init__()     
        self.setupUi(self)  
  #定义按钮功能  
  def loginEvent(self):  
        self.hide()  
        self.dia = Ui_PreEdit()              #跳转到的界面类名
        self.dia.show()  
  def homeshow():                            #调用这个函数来执行  
    import sys  
    app=QtWidgets.QApplication(sys.argv)  
    first=Mainshow()  
    first.show()  
    first.pushButton_openfile.clicked.connect(first.loginEvent) 
#绑定跳转功能的按钮  
    sys.exit(app.exec_()) 

b) Slot function for binding buttons
Set a click event for the button that needs to complete the jump function, and use the slot function to bind the event. Here is an example of binding the showwaiting() function:

self.pushButton_create.clicked.connect(self.showwaiting)  

Separately define the event to be bound to the button as a function:

def showwaiting(self):  
    import sys  
    self.MainWindow = QtWidgets.QMainWindow()  
    self.newshow = Ui_MainWindow_sumcreating()  #图形界面创建
    self.newshow.setupUi(self.MainWindow)        #界面设置
    self.hide()  
    self.MainWindow.show()  
    print('生成中…')  

c) Example: read local files in GUI

Define the open action, save action, and save content as three functions, which are convenient to call through the slot when the button is clicked. The relevant codes are as follows:

def open_event(self):  #打开文件事件
     _translate = QtCore.QCoreApplication.translate  
     directory1 = QFileDialog.getOpenFileName(None, "选择文件", "C:/","Wo
rd文档 (*.docx;*.doc);;文本文件(*.txt);;pdf(*.pdf);;")  
     print(directory1)  #输出路径
     path = directory1[0]  
     self.open_path_text.setText(_translate("Form", path))  
     if path is not None:  
         with open(file=path, mode='r+', encoding='utf-8') as file:  
             self.text_value.setPlainText(file.read())  
 def save_event(self):  #保存事件
     global save_path  
     _translate = QtCore.QCoreApplication.translate  
     fileName2, ok2 = QFileDialog.getSaveFileName(None, "文件保存", "C:/",
"Text Files (*.txt)")  
     print(fileName2)  #打印保存文件的全部路径(包括文件名和后缀名)  
     save_path = fileName2  
     self.save_path_text.setText(_translate("Form", save_path))  
  def save_text(self):  #保存文本
     global save_path  
     if save_path is not None:  
         with open(file=save_path, mode='a+', encoding='utf-8') as file:
             file.write(self.text_value.toPlainText())  
         print('已保存!')  

Bind the click action to the open and save buttons, and call the corresponding function defined above through the slot. At the same time, use the same method to associate the paths obtained in the open_event() and save_event() functions with the two defined path display boxes. The relevant codes are as follows:

def retranslateUi(self, Form):  
    _translate = QtCore.QCoreApplication.translate  
    Form.setWindowTitle(_translate("Form", "Form"))  
    self.label_preview.setText(_translate("MainWindow_preview","预览"))  
    self.open_path_text.setPlaceholderText(_translate("Form","打开"))  
    self.open_path_but.setText(_translate("Form", "浏览"))  
    self.save_path_but.setText(_translate("Form", "浏览"))  
    self.save_path_text.setPlaceholderText(_translate("Form","保存"))  
    self.save_but.setText(_translate("Form", "保存"))  
    self.open_path_but.clicked.connect(self.open_event)  
    self.save_path_but.clicked.connect(self.save_event)  
    self.save_but.clicked.connect(self.save_text)  
    self.pushButton_create.clicked.connect(self.showwaiting)  
    self.pushButton_create.setText(_translate("Main_preview","生成"))  

The effect diagram of the program operation is shown in the following two figures respectively.

insert image description here

insert image description here

4) Program docking

In order to complete the combination of the model and the graphical interface, it is necessary to reserve a corresponding docking interface in the code. In this project, there are four key connection parts between the interface and the main body of the model: call the model, end the model processing, display the result and save the output.

a) call the model

In order to complete the function of calling the model for processing after clicking the generate button, write the call interface in the last showWaiting() function of PreEdit.py, and call the model when the summary generation page pops up. The relevant code is as follows:

def showwaiting(self):  
    import sys  
    self.MainWindow = QtWidgets.QMainWindow()  
    self.newshow = Ui_MainWindow_sumcreating()  #创建
    self.newshow.setupUi(self.MainWindow)        #设置
    self.hide()  
#待对接程序,读取前面保存的文件(文件的路径在save_event函数里)
#调用模型进行输出并保存  
    self.MainWindow.show()  
    print('生成中…')  

b) End of model processing

After the model processing is finished, it is necessary to continue to run the result display page, so a judgment is added to the main function of PaperMain.py. After the judgment model is processed, the resultShow() function is called to continue to run the subsequent result display. The relevant code is as follows:

def main():  
    homeshow()    
    #待对接程序在PreEdit.py最后的showwaiting()函数里调用模型  
    #待对接程序判断处理完成后继续运行结果展示页  
resultshow()  

c) Results display

After the model is processed, the abstract, title, and keywords need to be displayed on the result display page. The result display page corresponds to result.pythe file. Before docking, the results displayed on the interface are fixed strings; when docking, it is only necessary to save the results of the model operation as strings and replace the previously fixed content. The relevant code is as follows:

#待对接程序模型运行的结果存几个字符串后替换下面的文字即可
#替换摘要
self.plainTextEdit_summary.setPlainText(_translate("MainWindow_result", "生成的摘要"))   
#替换标题1  
self.lineEdit_title1.setText(_translate("MainWindow_result","标题1"))
#替换标题2  
self.lineEdit_title2.setText(_translate("MainWindow_result","标题2"))
#替换标题3    
self.lineEdit_title3.setText(_translate("MainWindow_result","标题3"))  
#替换关键词
self.lineEdit_keywords.setText(_translate("MainWindow_result","关键词"))

d) Save
the output When the "Save" button is clicked, the output of the model is directly saved locally. This function is completed result. pyby save_text()the middle function f.write(). Before docking, the output of f.write() is a fixed string, and it can be replaced with the content output by the model when docking. The relevant code is as follows:

def save_text(self):  
    global save_path  
    if save_path is not None:  
        with open(file=save_path, mode='a+', encoding='utf-8') as file: 
        #对接file.write这里直接把程序里的字符串加起来写入保存的结果即可
            file.write("hello,Tibbarr")  
        print('已保存!')  

6. Application Packaging

In order to improve the convenience of use and lower the user's threshold for use, this project requires integrated packaging. Considering that the writing of the thesis is mostly done on the PC side, PyInstaller is used to package the project as .exean application.

1) Install PyInstaller

Download it from the Tsinghua Warehouse Mirror PyInstaller-3.6.tar.gz. After decompressing it locally, use cmd to enter the console, switch to the corresponding directory after decompression, and execute the command to complete the installation. :

python setup.py install

2) Package the program as an .exe file

Open the command window, switch the directory to papermain.pythe path, and enter the command

pyinstaller -F -w papermain.py

The details are shown in the figure below.

insert image description here

Use the PyInstaller command to package successfully, as shown in the figure below.

insert image description here

3) View the .exe file

After the program is successfully packaged, a dist folder is generated under the papermain.py file directory, which contains the generated .exe file, which can be run by double-clicking, and the program packaging is completed.

System test

This section covers training perplexity, testing performance, and model application.

1. Training perplexity

In the Seq2Seq model, perplexity is used to evaluate the final effect, and the smaller the value, the better the effect of the language model. This project uses a large network for training, with a total of 48,000 steps. After a period of training, the loss will no longer decrease. Start training and building the network, as shown in the figure below.

insert image description here

During the training process, the value of the model perplexity shows a downward trend, that is, the ambiguity of the language gradually decreases as the training progresses, and the model effect gradually becomes better. When the model runs to 30,000 steps, its downward trend has become flat, and the perplexity basically no longer decreases; when it reaches 47,000 steps, the perplexity value of the model fluctuates in a small range, and finally, the lowest point of its perplexity drops to 232.62, As shown below.

insert image description here

2. Test effect

Load the trained model and input relevant text for testing. Use Seq2Seqthe output for the header section, as shown in the image below.

insert image description here

From the output results, it can be seen that the title generation ability of the model is still lacking, and it can only achieve better titles for simple content. When dealing with difficult texts, the accuracy needs to be improved.

In the abstract extraction and keyword extraction part, TextRankthe algorithm is used to train the model, and the trained model is used to output the given text. After many tests, good results are obtained, as shown in the figure below.

insert image description here

3. Model application

Since the program has been packaged as an executable file, download .exethe file to the computer and double-click to run it. The initial interface of the application is shown in the figure below.

insert image description here

The home page is the project name, the "Enter Program" button and the description of the supported file formats. Click the "Enter Program" button to enter the file reading page.

On the file reading page, click the browser corresponding to the open address and save address and the document to be processed, and set the save path and file name of the modification result, as shown in the figure below.

insert image description here

After the file is read in, the preview and edit pages are shown in Figure-1 below. Preview and modify the read-in content here, click the "Save" button to temporarily save the modified document, click the "Generate" button, the model starts to process the text content, and enters the waiting page, as shown in Figure-2 below .

insert image description here

Figure 1 Preview and edit page

insert image description here

Figure 2 Processing waiting pages

After processing the model, close the current window, and the program will automatically jump to the result display page. The page displays three different title schemes, paper abstracts, and keywords from top to bottom. Users can copy directly in the interface, or select a path at the bottom of the page, and save all the results to the local machine, as shown in the figure below.

insert image description here

After selecting the save path and clicking the "Save" button, the program will save all the results to the specified path and jump to the download success page, as shown in the figure below.

insert image description here

After the processing is completed, the user directly closes the program or clicks the "Back to Home" button to jump back to the home page to process other files. Test the program on the PC side, the input file content, the output content map and the output result file are shown in Figure 3 to Figure 5 below.

insert image description here

Figure 3 Input file content map

insert image description here

Figure 4 output content map

insert image description here

Figure 5 Output result file diagram

Project source code download

See my blog resource download page for details


Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/131635775