CSDN question and answer module title recommendation task (2) - effect optimization

series of articles


Team Blog: CSDN AI Group


1. Problem Background

This article follows the previous article " CSDN Question and Answer Module Title Recommendation Task (1) - Basic Framework Construction ". In short, it is to optimize the effect of the titles asked by users of the CSDN Q&A module, and recommend more reasonable and informative titles to users. For specific tasks in Beijing, please refer to the previous article . This article mainly introduces the effect optimization strategy for this task.

2. Effect optimization methodology

The basis for effect optimization is mainly based on the problems obtained by analyzing the wrong data in Section 2.3 of the previous article . In addition, it also combines other people's opinions and suggestions to make more reasonable improvements. The specific optimization methodology is as follows.

2.1 Detection of Invalid Headers

In the previous article , all user question titles were recommended, but in many cases they were effective titles (accounting for about 91%), and a small number of recommended titles may be more effective than the original effective titles. Difference. Therefore, in order to improve the efficiency of title recommendation and avoid the worse effect of the original title recommendation, it is necessary to detect all invalid titles first, and further recommend titles.

Invalid title detection mainly includes the following two strategies:

2.1.1 Keyword matching strategy

Most of the invalid titles contain some keywords, such as the following titles, which contain big brother, help, emergency! , Thanks , the title composed of these words cannot express the meaning of the question itself.
insert image description here

This paper sorts out 64 keywords through manual collection, and judges whether the title is invalid by string matching. All sorted keywords are listed below:

urgent!
Urgent,
help
, uncle,
emergency,
help
, please
, ball, please, help
, kneel, beg, master , boss , advice, advice, answer , Xiaobai , new kid , help , brother , sister , sister , little sister , brother , brother, novice, rookie, rookie, brother , master , beginner , self -study , thank you , dear man Homework is crazy, teacher, good man , this question, source code, doing the question , ah, ah , waiting online, it’s too difficult, family, brothers, little brother












































This question This question
This question
Exam question
One question
Source code
Programming question
How to write the
question How to do the question
Seeking the source code
haha
​​course design

Randomly sample 2000 pieces of data to test the current coverage of invalid title identification is 98.32%.

2.1.1 Stop word removal strategy

There are no obvious keywords in some invalid titles, but the entire title does not have any amount of information. For such titles, this paper removes stop words from the title based on the stop word table, and finally judges whether the title is empty. If it is Empty indicates that the title is invalid.

For example, the following title is full of question marks:
insert image description here

Another example is the following titles, all of which are words without any useful information:
insert image description here

2.2 OCR module: to ensure the integrity of information

In some questions from users, the key information can be included in the uploaded pictures, or there are only a few pictures in the user's text description without any text description. For example: this article uses paddle_ocr's picture to text module to convert the pictures
insert image description here
insert image description here
contained The text information is recognized as text, which enters the identification of the rule module and the Text_Rank module.

In addition, because OCR is time-consuming, it is necessary to ensure that the OCR module is called as little as possible. If the existing information is enough to recommend a reasonable title, then do not call OCR to identify all pictures.

2.3 Rule module: Improve Precision (accuracy)

For some questions with relatively distinctive features, such as: program error reporting, exercise questions, and knowledge points, this paper uses the method of rules to directly recall. The details are as follows:

2.2.1 Error information extraction module

In some users' questions, the text or pictures contain obvious error information. The error information is the key information of this type of problem, so the error information can be directly extracted as the title of the user's question.
insert image description here
This article uses regular expressions to extract error messages. It includes forward rules and reverse rules. The forward rules are used to extract error information, and the reverse rules are used to avoid aspiration.

# 正向规则
error_word_pattern = re.compile(r'(exception|error\srequestid|errormessage|errorcode|errmsg|errcode|error|no .*?(detected|found)|unavailable|undefined|系统找不到|烫烫烫|不包含[a-z_\.].*?的定义|执行[a-z_\.].*?时出错|报错)')
# 反向规则
error_word_reverse_pattern = re.compile(r'((except|import|catch|throw)[a-z_ \.\(\)\:]{0,20}exception|([1-9]|logger|self)[\. ]error|error[\((]s[\))][0-9 \,]{1,5}warning[\((]s[\))]|error:function|exception.{1,3}details|onerror|conda config|(catch|throw)[a-z_ \.\(\)\:]{0,20}error)')

2.2.2 Exercise recognition module

Some users' questions are about how to answer the practice questions of some courses. These questions also contain more obvious information, such as keywords: programming questions, practice questions , etc.

Regular expression is also used here for matching, including title-based regularization and text-based regularization. The regular expressions are as follows:

# 基于标题的正则
title_exercise_words_pattern = re.compile(r'(题目|作业|编程题|练习题|写.*?程序)')
# 基于正文的正则
body_exercise_words_pattern = re.compile(r'(^(1|A)(、| |\.).{3,}?$|^(①|②|③|④|⑤)|作业|问题[0-9]|[一二三四五]是:|题目描述)')

2.2.3 Inquiry knowledge point module

Some users' questions are asking about some very specific knowledge points, so the knowledge points can be directly extracted as the title of the user's question.

Here is also a regular expression method for matching, mainly matching the information in the title, as follows:

ask_words_pattern = re.compile(r'((?:求助|关于)[\u4e00-\u9fa5a-zA-Z0-9_]*?(?:问题|贴)|[^,。?!;]+?(?:的理解|的区别|的特性|的疑惑))')

2.2.4 Add title header

In order to further clarify the field of the question, such as: python, c++, artificial intelligence, etc., to enrich the information contained in the title, this paper uses the tag information of the question itself, plus some templates, to generate a title header. The three rules with different rules are also divided into three templates, plus the title extracted by the rule, the final title generated based on the rule is as follows:

  • Error information extraction module: questions about #tag#

Original title: Xiaobai asks the big guys to do me a favor!
Recommended title: Question about #Deep Learning#: NameError: name 'capitalize' is not defined

  • Practice question recognition module: Questions about #tag#

Original title: God who knows how to script, please help~~~
Recommended title: About #c语言#

  • Ask knowledge points module: ask #tag# knowledge points

Original title: A novice asking for advice, about bad circulation.
Suggested title: Ask about #java#'s knowledge points: Questions about bad circulation

2.4 Text_Rank module: Improve Recall (recall rate)

The rule module focuses on ensuring the high accuracy of title recommendation, so there will be a problem of low recall rate. The current recall rate is about 33.5%, so the remaining 66.5% needs to be recalled using Text_Rank.

Text_Rank is an extractive text summarization model. The usage strategy has been briefly introduced in the previous article , and questions are given priority. In addition, the extraction method has high requirements on the original input text, so it is necessary to focus on preprocessing the text to remove some interference information that is useless for text extraction, mainly including code segments, conversion of some escape characters, URL information, pictures Tag link information, useless HTML tag information , etc. In addition, words and phrases that do not contain any valid information will also be removed, such as: big brother, help me, Xiaobai asks for guidance ...

After the above processing, the title generated by Text_Rank will be more accurate and more readable.

3. Summary and next step plan

Through the above rules + model strategy, the current title recommendation effect has increased from 47.92% of the baseline in the previous article to 86.0%, initially reaching a usable state.

However, due to the highly colloquial and complex characteristics of user questions, some details in the title recommendation task still need to be further optimized. Next steps include:

  • Improve the coverage of rules and resources, and improve the recall rate of rules;
  • Further optimize the effect of the Text_Rank algorithm to improve the quality of title extraction;
  • After all, Text_Rank is a sentence extracted from the text, and it is a bit blunt to directly use it as a question title. The next step is to consider using a template or a generated method to improve the readability of the title and more in line with the style of the question.

P.S.

This series of articles will continue to be updated. I hope that colleagues, teachers and experts in NLP and other fields can provide valuable suggestions, thank you!

Guess you like

Origin blog.csdn.net/u010280923/article/details/118074512