Keywords project iteration process of extraction technology

Project requirements are resolved Science and Technology Policy, the Science and Technology Policy and quantify the text body of information technology associate.

My approach is to use natural language processing-related technologies, science and technology policy to quantify. There are many scholars on quantitative models to quantify the theoretical policy with their design. After quantify, can be drawn in the internal policy readjustment period of a particular area, combined with "Economic Statistics Yearbook" in the relevant data, it can be drawn on the impact of technological policy on economic and technical performance.

Specific recommendations on how to resolve a science and technology policy, mentors is to pick out the key sentence, because the policy belongs to the normative documents, a key sentence is usually an important source of support to the theory or data. But I did not follow this practice, for the following reasons:

  1. Natural language processing technology is not mature enough, the difficulty of understanding the larger sentence.
  2. Keyword extraction technology is relatively easy to spot, but also have the relevant papers and study table name, keyword extraction technology using text mining is a viable option.
  3. By title clustering, combined with keyword extraction, it can provide a general policy position up and down relationship, can do to replace the association between sentences.

Compared to the ultimate pursuit of the algorithm, I think that software engineering is more concerned about how to use existing technology, design a rational business processes as possible, thereby reducing dependence on the project but must also be less mature technologies. In our project, the keyword extraction technology uses a machine learning algorithm, compared to the direct use of TF-IDF algorithm or TextRank a lot better, although you can also further optimized, but we are more concerned about is possible to design a reasonable classification model, make a reasonable allocation of these keywords, which play a role beyond these keywords themselves in value.

The following details about how I was doing keyword extraction.

First, science and technology policy in particular have a lot of professional vocabulary, while the existing common lexicon is difficult to cover, but also did not have the authority of the policy lexicon, so the program needs to be able to find and identify policy document new words.

By doing text-based word, each word is calculated mutual information and entropy around, you can achieve the purpose of the new words found. https://github.com/zhanzecheng/Chinese_segment_augment

In this way, we found about 4300 new words from Corpus crawling by artificial selection after a wrong new words added to the custom dictionary.

This process is quite interesting, I found a lot of new words and some artificial also difficult to discern, pick out a part, as follows:

 

 

After going to the program, the design of this function, the new science and technology policy, we can find new words, by interacting with the user, can persist in the dictionary.

 

Second, the policy usually involves multiple adjustment area, so the extracted keywords have a high specialization, representation and discrimination.

After adding new words above to find new words in the dictionary algorithm, first with tfidf and doc2vec, textrank weighted average score to give the candidate the top 30 keywords, and then do this as dichotomous, the keyword is marked as 1 , non-keyword flag is 0, while extracting tfidf, textrank, speech, position, tf, lda, word2vec, doc2vec, idf other words, as the feature. With 2 layers MLP training on the training set, the last forecast in the test set, taking the probability of each document before 7 as a keyword.

Finally, the effect achieved is quite good, as follows:

 

 The main task of the next using the keyword when classifying and quantifying policy.

 

Guess you like

Origin www.cnblogs.com/w-honey/p/11707867.html