Natural language processing from entry to application - evaluation indicators

Category: General Catalog of "Natural Language Processing from Entry to Application"

Related Articles:
A Deeper Understanding of Machine Learning - Performance Metrics for Machine Learning Models


Due to the diversity of natural language processing tasks and the subjectivity of evaluation, it is difficult to use a single evaluation index to measure the performance of all tasks, so different evaluation methods are often used for different types of tasks. Accurate grasp of evaluation methods is helpful for in-depth understanding of various natural language processing tasks. Accuracy is the simplest and most intuitive evaluation index, and it is often applied to problems such as text classification. Its calculation formula is:
ACC CLS = number of correctly classified texts total number of test texts\text{ACC}^{\text{CLS}} = \frac{\text{number of correctly classified texts}}{\text{total number of test texts}}ACCCLS=total test textNumber of correctly classified texts

Sequence tagging problems such as part-of-speech tagging can also be evaluated by accuracy, namely:
ACC POS = the number of correctly tagged words in the total number of words in the test text\text{ACC}^{\text{POS}} = \frac{\text{ Number of correctly annotated words}}{\text{Total number of words in the test text}}ACCPOS=The total number of words in the test textCorrectly tagged words

However, not all sequence tagging problems can be evaluated by accuracy rate. For example, after converting sequence segmentation problems such as word segmentation and named entity recognition into sequence tagging problems, accuracy rate should not be used for evaluation. Taking named entity recognition as an example, if the accuracy rate calculated by words is used, many non-named entities (the category corresponding to the corresponding word is O) are also included in the calculation of the accuracy rate. In addition, if some words are mislabeled, the result of named entity recognition will be wrong, but according to the calculation of word accuracy, some words are still considered to be classified correctly. As shown in the example in the figure below, according to the calculation of words (here, Chinese characters), among the 8 input words, only one (three) is wrongly predicted, and the accuracy rate is 7 8 = 0.875 \frac{7}{ 8 }=0.87587=0.875 , which is obviously unreasonable. Similar problems exist in the evaluation of other sequence segmentation problems such as word segmentation.
Example Named Entity Recognition Evaluation
So, how to evaluate the performance of the sequence segmentation problem more reasonably? This requires the introduction of the F-value (F-Measure or F-Score) evaluation index, which is the weighted harmonic average of the precision rate (Precision) and the recall rate (Recall). The specific formula is: F-value = ( β 2 + 1 )
∗ PR β 2 ( P + R ) \text{F value}=\frac{(\beta^2 + 1) * PR}{\beta^2(P + R)}F value=b2(P+R)( b2+1)PR

In the formula, β \betaβ is the weighted harmonic parameter,PPP is the accuracy rate,RRR is the recall rate. Whenβ = 1 \beta=1b=When 1 , that is, the weight of the precision rate and the recall rate is the same. At this time, the F value is also called the F1 value. The specific formula is:
F 1 = 2 PRP + R F_1=\frac{2PR}{P + R}F1=P+R2PR

In named entity recognition problems, precision and recall are defined as:
P = number of correctly identified named entities Total number of identified named entities R = number of correctly identified named entities Total number of named entities in the test text\begin{aligned } P &=\frac{Number of correctly identified named entities}{Total number of identified named entities} \\ \\ R &=\frac{Number of correctly identified named entities}{Total number of named entities in the test text} \end {aligned}PR=The total number of identified named entitiesNumber of correctly identified named entities=The total number of named entities in the test textNumber of correctly identified named entities

Still taking the example in the above figure as an example, the "number of correctly recognized named entities" is 1 ("Harbin"), the "total number of recognized named entities" is 2 ("Zhang" and "Harbin"), and the "test text The total number of named entities in "is 2 ("Zhang San" and "Harbin"), then the precision rate and recall rate are both 1 2 = 0.5 \frac{1}{2}=0.521=0.5 , the finalF 1 = 0.5 F_1=0.5F1=0.5 . Compared with the accuracy calculated based on words (0.875), this value is more reasonable.

After understanding the difference and connection between the two evaluation indicators of accuracy and F value, it is easy to choose an appropriate evaluation indicator for a natural language processing task. For example, when evaluating dependency syntactic analysis (the analysis result is a syntactic dependency tree), since the correct labeling result assigns a correct parent node to each word, the accuracy rate in units of words can be used to evaluate the dependency syntax The results of the analysis are evaluated to indicate what proportion of the words found their parents correctly. However, the evaluation index is usually not directly called the accuracy rate, but uses the UAS (Unlabeled Attachment Score) index, which is the accuracy rate at which the parent node of the word is correctly identified. In addition, when considering the relationship between a word and its parent node, the LAS (Labeled Attachment Score) index is used for evaluation, that is, the accuracy rate that both the parent node of the word and the syntactic relationship with the parent node are correctly identified. When evaluating the semantic dependency graph task, since the number of parent nodes of each word is uncertain, the accuracy rate cannot be used for evaluation. At this time, the F value needs to be used, that is, the arc in the graph is used as the unit. Calculate the precision and recall of its recognition, and then calculate the F value. Like the dependency parsing, the F value is also divided into two cases: considering the semantic relationship and not considering the semantic relationship. Similarly, phrase structure parsing cannot be evaluated using the accuracy rate, but can be evaluated using the F value of phrases (including the type of phrase and the scope covered by the phrase) in the syntactic structure.

Although the accuracy rate and F value can be used to evaluate tasks with relatively clear standard answers, the answers to many natural language processing questions are not clear, or not unique. For example , the language model problem introduced in " Natural Language Processing from Entry to Application - Language Model (Language Model, LM) for Natural Language Processing ", when predicting the next word in a given historical text, in addition to the words that appear in the corpus, There are many other words that are also plausible. Therefore, accuracy cannot be simply used for evaluation, so the evaluation index of perplexity is introduced. The evaluation of the machine translation system is similar. The reference translation in the test data is not the only correct answer. As long as the target language translation result has the same semantics as the source language, its expression method can be very flexible. The BLEU value is the most commonly used automatic evaluation index for machine translation. Its calculation method is to count the ratio of the number of N-gram matches in the machine translation and the reference translation (more than one) to the total number of all N-grams in the machine translation, that is, N-gram the accuracy rate. The value of N should not be too large or too small. If N is too large, there will be too few N-grams co-occurring between the machine translation and the reference translation, and if N is too small, the sequence information of the words in the machine translation cannot be measured. Therefore, N is generally set to a maximum of 4. In addition, since this evaluation method only considers the precision rate and ignores the recall rate, it tends towards shorter translations. Therefore, the BLEU value introduces a length penalty factor to encourage the number of words in the machine translation to be as close as possible to the number in the reference translation. Finally, the BLEU value ranges from 0 to 1, and the higher the score, the better the translation quality of the machine translation system.

For the evaluation of the human-computer dialogue system, although it is also possible to use the historical human-to-human dialogue data and indicators such as the BLEU value, due to the openness of the reply, it is difficult to guarantee the fairness and objectivity of the results of this automatic evaluation. Because similar to machine translation, the machine reply of the human-computer dialogue system does not have a unique standard answer, but it is more difficult than the evaluation of machine translation, the reply of the human-computer dialogue system does not even have the same constraint as the input semantics, that is, The answer to the man-machine dialogue system is open-ended. In addition, due to the interactive nature of the dialogue, the system cannot be evaluated simply through a round of man-machine dialogue. All of the above problems have brought great challenges to the automatic evaluation of human-machine dialogue systems. Therefore, when evaluating a human-computer dialogue system, a manual evaluation method is often used, that is, after multiple rounds of dialogue between the human and the system, a total or multiple dimensions (fluency, relevance, accuracy, etc.) are finally given. subjective score. Due to the subjectivity of scoring, the consistency of manual evaluation is often relatively low, which means that different people may have large differences in scoring. In order to eliminate this difference, multiple people are required to evaluate and finally take an average score. Therefore, the cost of manual evaluation is often very high, and it is difficult to carry out multiple times in the process of system development. To sum up, the evaluation method of human-computer dialogue system is still a very difficult open problem in the field of natural language processing, and it has not been well solved.

References:
[1] Che Wanxiang, Cui Yiming, Guo Jiang. Natural language processing: a method based on pre-training model [M]. Electronic Industry Press, 2021. [2] Shao Hao, Liu Yifeng. Pre-training
language model [M] ]. Electronic Industry Press, 2021.
[3] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019 [
4] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Combat [M]. People's Posts and Telecommunications Publishing Society, 2023
[5] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131031773