CSDN question and answer label skill tree (1) - the construction of the basic framework

series of articles


Team Blog: CSDN AI Group


1 Problem Definition

1.1 Background

The questions in the current CSDN question and answer module are only simply classified, such as: Python, Java, C language and other categories, but the questions are not mapped to specific knowledge points in the categories. For example, in the example below, the question It belongs to the problem of data visualization in Python language.
Question Example 1
Fine-grained categorization and division of questions can allow questioners to understand more clearly the position of their questions in the knowledge system, and also facilitate the system to more accurately recommend relevant materials to questioners for learning and reference.

In order to solve the above problems, this paper first constructs a programming language skill tree for each category, and then maps the questions that have been adopted in the past to specific nodes in the skill tree. Finally, for a new question, based on the constructed skills tree, match to the most similar node, and recommend the accepted question on this node.

2 solutions

2.1 Knowledge collection

To build a programming skill tree, you first need to collect relevant knowledge. This article first takes the Python programming language as an example to implement it.

Through online search and research, the following two channels are summarized:

  • Climb to directory from something

    • Search from a certain website by keyword " python ", and filter out Top N books by sales volume
    • Extract the content of the directory field from the details page to obtain the unprocessed directory
      Question example 2
  • Learning paths on the site forum:

2.2 Construction of skill tree

After obtaining the corresponding knowledge resources, the resources need to be stored in a tree structure, which is implemented in this paper using the treelib package.

In order to facilitate the merging of trees in the next section, this article limits the directory to a 4-level structure:

  • Large chapter titles. Example: first part
  • Chapter title. Example: Chapter 1
  • Section title. Example: 1.1
  • Section title. For example: 1.1.1

The structured tree structure is shown in the figure below:

Picture example 4

2.3 Merging of Skill Trees

After building a skill tree based on catalogs and knowledge system resources from different sources, it is necessary to merge multiple different skill trees to form a single Python skill tree.

For the merging of trees, this paper mainly considers the following aspects:

  • Merge by layer starting from the root node
    • Use recursive method to merge multiple trees
  • Similar nodes in the same layer need to be merged
    • Divide nodes into multiple clusters using a heuristic clustering method (no need to predetermine the number of clusters)
    • The similarity calculation method during clustering is calculated using the method of longest common subsequence ratio + Levenstein ratio (edit distance ratio)
    • The new node after merging is replaced by the longest common subsequence of multiple sentences, for example: 3 nodes use the if statement , the if statement processing list setting , the format of the if statement The longest common subsequence is the if statement , and finally use if statement as the value of the merge node.
  • remove useless nodes
    • Use the method of tree pruning + dictionary to remove useless nodes in the skill tree, such as: chapter summary , extended reading , project and other chapter nodes.

The merged skill tree is shown in the figure below:

Picture example 5

2.4 Matching of Questions and Skill Trees

After the skill tree is built, it is necessary to map all the adopted questions in the Python field to the corresponding nodes, and for a new question, based on the built skill tree, match the most similar node and recommend the node. Click on Accepted question.

The matching algorithm used in this paper is the Levenstein ratio (edit distance ratio) . By calculating the Levenstein ratio between the question and the node, the node that best matches the question is determined.

3 Summary and next step plan

Summarize

This paper mainly implements the construction and merging of programming language skill trees, as well as the matching of questions and nodes in the skill tree. Now only the preliminary functions have been realized, and the effect needs to be further optimized. The current problems mainly include:

  • The removal of extraneous nodes is not clean enough
  • It is unreasonable to use the longest common subsequence instead of the longest common subsequence in the similarity calculation method of clustering, and the new node after multiple nodes are merged. For example, the Python version running and the Python code fragment are divided into the same cluster and merged into Python
  • The description style of the nodes in the question and the skill tree is quite different. One is a question and the other is a knowledge point. It is unreasonable to use the Levin-Stan ratio (edit distance ratio) method to calculate the similarity when the question is matched with the node.
  • ……

next step

In view of the current problems, the next step is to consider:

  • Further improvements to the quality of crafted skill trees
  • Improved matching of questions and trees

Guess you like

Origin blog.csdn.net/u010280923/article/details/117403746