Overview of Chinese word segmentation and keyword extraction

This article is based on the realization process of the duplicate checking requirements of the question bank and the learning process of "NLP Natural Language Processing Principles and Practice". There must be deficiencies, please point out.

introduce

Chinese word segmentation is the primary problem of natural language processing (NLP) in the Chinese environment. The main difficulty is that Chinese is different from English. There are clear separators (such as spaces) used to segment words, and different segmentation methods do not necessarily have language problems. For example:

  1. A married/monk/unmarried person.
  2. married and unmarried

basic concept

Evaluation index

Generally, Chinese word segmentation is evaluated from three dimensions: Precision, Recall, and F-score. Generally, what we pay more attention to is the F-score. As shown in the picture below ( picture source https://github.com/lancopku/pkuseg-python ), this picture describes three indicators of three Chinese word segmentation tools, which is convenient for users to compare the results of word segmentation for a specific data set by the three tools index.
insert image description here

Sometimes F-score is also described as F1-Measure, the two are the same concept.

Model Evaluation Criteria

For models (including semantic models, classification/clustering models, etc.), there are generally four indicators to judge the model effect:

  1. Accuracy (accuracy rate) - the proportion of correctly judged numbers in the total number of samples
  2. Recall (recall rate) - the proportion of correctly judged numbers in the total correct number
  3. Precision (accuracy, precision rate) - the proportion of the number of correct judgments in the total number of results
  4. F1-Measure—— harmonic mean (harmonic mean; reciprocal mean) of Precision and Recall, meaning that it is associated with the relationship between the number of correct judgments, the number of wrong judgments, and the total number of results at the same time.

common solution

There are two main ideas for Chinese word segmentation:

  1. Semantic based implementation
  2. Based on statistical probability

Affected by various factors such as the development history of programming languages, social environment, and market demand, the current mainstream Chinese word segmentation tools are mostly python and java. Common word segmentation tools are listed below, and there are too many copy online introductions here. I personally think that in the case of supporting custom thesaurus and word frequency, these projects should be able to meet most of the less complex language analysis scenarios after years of iteration in the open source community.

  1. Stuttering participle - natively based on python, there are also go, java, php, and Node.js versions, but they are not official maintenance projects. Take the php version as an example. After using it, it is found that it has slow updates, code bugs, and lack of flexibility in the code. Issues such as lack of scalability.
  2. HanLp - developed based on Java.
  3. funNLP - developed based on python.
  4. sego - developed based on go language.
  5. scws - the original is php, the last update in github was in 2016, the actual effect of using mixed text in Chinese, English and numbers is far worse than the php version with stuttering words.

Regarding HMM and Viterbi in Chinese word segmentation, you can refer to my other blog post HMM, Viterbi and Chinese word segmentation . We welcome opinions from all the big guys to exchange and learn.

Applications

Sensitive word detection

Sensitive word detection is generally implemented in two ways.

  1. Based on thesaurus + finite automatic state machine (Deterministic Finite Automaton, DFA)

    ​ Construct a prefix index forest from the thesaurus, and then traverse and judge sensitive words based on DFA+ sensitive word prefix index forest.

  2. Based on thesaurus + Chinese word segmentation tool

    ​Compared to finite automatic state machines. In fact, the key to this method is to judge sensitive words only for each word in the word segmentation result. From the effect point of view, its accuracy depends on the sensitive word detection training results (word frequency). But it will not match without considering the context like DFA, which is more intelligent. When the sensitive word detection result is fed back to the user in some form, the user experience achieved by using this method will be better than that based on DFA.

But it can be seen that no matter which of the above-mentioned methods is relied on, the problem of the lexicon must be solved first. Regarding the thesaurus, there are some privately uploaded thesaurus in github, and you can also try to search for free sensitive thesaurus released by enterprise teams such as Weibo, Baidu, and Tencent. But regardless of the source, sensitive words are always time-sensitive and require constant maintenance.

keyword extraction

Keyword extraction is actually a subdivided application scenario, and its higher-level applications are mostly found in long text keyword retrieval (it cannot be directly applied to the text retrieval principle of ES, and the ES index word segmentation method and retrieval word segmentation method depend on the selection of word segmentation tools), Text similarity retrieval, automatic summarization. Common algorithms include TF-IDF and TextRank algorithms, and stuttering word segmentation supports both algorithms.
The accuracy of keywords depends on the accuracy of word segmentation. In addition, regarding TF-IDF, a basic method of keyword extraction is: the more a word appears in the text and the less it appears in other texts, the greater the probability of the word as a keyword. Therefore, in practical applications, when dealing with a limited text set or a certain vertical field text set, first count the idf dictionary for the limited text set or vertical field samples, and then perform keyword extraction based on the idf, which can effectively improve Precision.

Similar text deduplication

There are many solutions for deduplication of similar texts, and here are just a few important ideas. In practical applications, various factors such as algorithm efficiency, data volume, implementation difficulty, and output mode should be considered.

  1. Based on keyword extraction, similar hashes are calculated based on keywords. When the two texts are similar, the extracted keywords are also similar or equal, and the corresponding calculated similar hash values ​​should also be similar or equal. It is common in massive data checking, but it is impossible to be 100% reliable. Corresponding to the problem of judging similarity or equality, it can be converted to the problem of judging similarity or equality for similar hash values.
  2. Violence, each text calculates the Hamming distance or edit distance distance with other texts in the collection. When the ratio of distance to total length is greater than a certain threshold, it is judged to be similar. When the text is not too complex and the amount of data is not large, the implementation cost is low, the algorithm is simple, and the Precision after being optimized for specific text characteristics can be very high. It is conceivable that when the size of the text collection space is n, it is necessary to execute n 2 n^2n2 comparisons.

In fact, the biggest difficulty of similar text, I personally think is not the performance, but the Precision. Take the following text as an example:

  1. Please select the correct option below.
  2. Please select the incorrect option below.

From this point of view, the difference between the two is only one word, but the actual meaning is very different. This kind of problem may actually be more common in the similarity comparison of short texts. Regarding this point, there is currently no better solution for individuals, and students with ideas are welcome to make suggestions. My temporary solution is that although the two are similar, the sim hash value is still not equal, (word segmentation can distinguish between correct and incorrect). Therefore, as long as the sim hash values ​​are not equal, they will not be classified temporarily, but this obviously reduces the Recall.

common problem

Q: Will the performance of word segmentation based on the Chinese word segmentation tool be very low?

When the text is not very long, word segmentation performance is very high. Take stuttering word segmentation as an example. In the php7.3 environment, I have 1000 texts with an average length of about 200. It only takes about 1 second to request data + word segmentation + keyword extraction + SimHash calculation + storage (batch update), and the length of a single line does not exceed 300 text, word segmentation time is less than 1ms. The current cpu performance is very strong. If there are not too many tasks and limited cpu resources, what we need to pay attention to is often not word segmentation performance, but more about the time complexity of the code implemented by **; database acquisition efficiency (such as MySQL select is optimized); database write efficiency; whether external requests occur, and if so, the time consumed by external requests. **Of course, specific problems should be analyzed according to the actual situation, and the possibility of performance problems must not be limited to the points I mentioned.

Q: Does the Chinese word segmentation tool work out of the box?

My experience is that if it's just for learning, it can be considered out of the box. But if you want to use it for sensitive word detection or other vertical fields (including but not limited to text deduplication, text retrieval, etc.). The suggestion is to collect as many reliable vertical domain thesaurus as possible, and count a thesaurus based on the data held, and load it into the word segmentation tool (generally open source word segmentation tools provide interfaces for loading user-defined dictionaries), It can effectively improve the F-score of word segmentation and keyword extraction. In addition, in practical applications, when there are many special symbols and useless symbols, it is recommended to add logic codes to deal with them first (such as html entities, tags, redundant invisible characters in escaped text, etc.), which can also help improve F -score.

Q: For NLP-based functions, how much F-score is appropriate?

In fact, I don't know how much F-score is appropriate. It may be easier to answer this question in combination with actual business scenarios (if it is academic research, the higher the better). My experience is that after the F-score reaches more than 90%, the cost will increase exponentially for every 1% increase (it is necessary to repeatedly adjust the word selection strategy, improve thesaurus and word frequency files, and re-statistic/analyze related fields), while Every adjustment is not necessarily universal. Therefore, I personally recommend judging in combination with the actual scene. For example, the question bank has 1300W+ data, and the Precision has been increased to 90%+ (for specific question types of specific subjects, it can even be considered to reach 99%+). At this time, it is estimated that the proportion of erroneous data to the total data is very small (maybe less than 1%). Then, these duplicate checking data are temporarily hidden (soft deleted), and then recovered when they are found later. This method is beneficial to the entire The impact on product experience is not obvious.

Q: In practical applications, what should I do if the Precision requirements are very high?

This kind of problem may be more common in the production environment. For example, if there are 1000W+ data, even if only 1% of them are misjudged, 10W pieces of data will not be wrongly judged, and 100,000 pieces of wrong data may also cause great losses. At this time, we can actively analyze the characteristics of the text collection, and improve the fitting degree by modifying the tool code in a targeted manner; supplementing the thesaurus; adding stop words; modifying word frequency, etc. out-of-the-box?). This may cause overfitting problems, but I think overfitting is not a bad thing if you are only dealing with a known and limited data set.

Guess you like

Origin blog.csdn.net/qq_23937195/article/details/102586257