NLP Road to Frozen Hands (3) - Evaluation and use of indicator functions (Metric, taking BLEU and GLUE as examples)


✅ NLP research 0 player's study notes



Previous article link: NLP Road to Frozen Hands (2) - Download and various operations of text datasets (Datasets)


1. The required environment

python3.7+ required, pytorch1.10+ required

● The library used in this article is based on Hugging Face Transformer, the official website link: https://huggingface.co/docs/transformers/index [A very good open source website, which has done a lot of integration for the transformer framework, currently github 72.3k ⭐️]

● To install the Hugging Face Transformer library, you only need to enter pip install transformers[this is the pip installation method] in the terminal; if you are using it conda, enterconda install -c huggingface transformers

● In addition to installing the above configuration, this article also needs to install the dataset processing package datasetsnamed , just enter pip install datasets[this is the pip installation method] in the terminal; if you are using it conda, enterconda install -c huggingface -c conda-forge datasets

● In addition to installing the above configuration, this article also needs to install the evaluation library evaluatenamed , just enter pip install evaluate[this is the pip installation method, requires python3.7+] in the terminal



2. Loading of evaluation indicators

import evaluate
metrics_list = evaluate.list_evaluation_modules()
print("指标的总数: ", len(metrics_list))
print("所有的指标: ", metrics_list)

● The running results are as follows, among which I have circled the commonly used ones with red boxes:
insert image description here


● For the loading of indicators, we take BLEUas an example , generally use the following method to run directly ( cache_dir, optional str: path to store temporary predictions and references (by default ' ~/.cache/huggingface/evaluate/'):

my_metric = evaluate.load("bleu", cache_dir="./")
print(my_metric)  # 打印关于该指标的相关说明

输出:
EvaluationModule(name: "bleu", module_type: "metric", features: [{
    
    'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, {
    
    'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: 
Computes BLEU score of translated segments against one or more references.
Args:
    predictions: list of translations to score.
    references: list of lists of or just a list of references for each translation.
    tokenizer : approach used for tokenizing `predictions` and `references`.
        The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT.
        This can be replaced by any function that takes a string as input and returns a list of tokens as output.
    max_order: Maximum n-gram order to use when computing BLEU score.
    smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
    'bleu': bleu score,
    'precisions': geometric mean of n-gram precisions,
    'brevity_penalty': brevity penalty,
    'length_ratio': ratio of lengths,
    'translation_length': translation_length,
    'reference_length': reference_length
Examples:

    >>> predictions = ["hello there general kenobi", "foo bar foobar"]
    >>> references = [
    ...     ["hello there general kenobi", "hello there!"],
    ...     ["foo bar foobar"]
    ... ]
    >>> bleu = evaluate.load("bleu")
    >>> results = bleu.compute(predictions=predictions, references=references)
    >>> print(results["bleu"])
    1.0
, stored examples: 0)

However, this method needs to be connected to the Internet to download the evaluation script (through the hugging face). Generally, it needs to be f-walled to succeed, but if the local network is not good, it may not be successful (some indicators can not be f-walled, because their codes are accessible, but Some are on Google, so they cannot be downloaded) .

● If you cannot f the wall at present, you can move to "3. The use of evaluation indicators (BLEU and GLUE as examples)" and use gluethe index to do it, it does not use f.

● After the editor’s efforts, I spent nearly three hours groping, but I still couldn’t find a solution that can’t be solved without the f wall. The following small piece of content is part of my search, but it ended in failure... If there is a master who can succeed To solve this problem, welcome to leave a message in the comment area~!

At this time, it is necessary to download the complete evaluatelibrary locally, and then call BLEUthe evaluation function locally.
● The process is as follows:
  ① Open the corresponding project file, open the command line directly from the file, and then gitclick evaluatethe library on github. The full command is below, and the diagram is also below.

git clone https://github.com/huggingface/evaluate.git

insert image description here
  ② Then rename this folder evaluate_1(do not have the same evaluatename ), and then run the following code to print out squadthe relevant instructions about the indicator (in particular, after the editor's efforts, I found that in this way Still can't load the source code bleufor ().

my_metric = evaluate.load('./evaluate_1/metrics/squad')
print(my_metric)


3. The use of evaluation indicators (BLEU and GLUE as examples)

● Moreover, for some evaluation indicators, you need to be connected to the wai network to use them, for example bleu, but you don't need them glueif . Next, I will use them as examples respectively.

● First, take blueas an example , assuming that the text predicted by the computer is the cat sat on the mat(that is, the candidate translation), and assuming that there are two reference translations, one is look at! one cat sat on the matand the other is there is a cat on a mat, then how to calculate the blue score?

● Specific method: Count the number of occurrences of each word in the candidate translation (candidate), and then count the number of occurrences of each word in the reference translation (reference). Max means take the maximum value of all reference translations, and Min means take the candidate translation and the minimum value of Max(ref_1,ref_2). The 1-gram statistics table is as follows, and the calculated Precision is also below.

1-gram candidate ref_1 ref_2 Max(ref_1,ref_2) Min(candidate, reference_Max)
the 2 1 0 1 1
cat 1 1 1 1 1
sat 1 1 0 1 1
on 1 1 1 1 1
mat 1 1 1 1 1

● Add the Min value of each word, add the number of occurrences of each word in the candidate translation, and then divide the two values ​​to get a 1-gram Precision, which can be the firstprecisions value of output in the following code . Finally, calculate 2-gram, 3-gram, and 4-gram in turn, then take the log, and perform weighted average processing, and finally multiply by the penalty coefficient. For details, please refer to the BLEU calculation process of the machine translation evaluation index .
P = 1 + 1 + 1 + 1 + 1 2 + 1 + 1 + 1 + 1 = 5 6 = 0.833333 P =\frac{1+1+1+1+1}{2+1+1+1+1 }=\frac{5}{6}=0.833333P=2+1+1+1+11+1+1+1+1=65=0.833333

import evaluate
my_metric = evaluate.load('bleu')
print(my_metric)
predictions = ["the cat sat on the mat"]
references = [["look at! one cat sat on the mat", "there is a cat on a mat"]]
results = my_metric.compute(predictions=predictions, references=references)
print("results:", results)

输出:
results: {
    
    'bleu': 0.6431870218238024, 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666], 'brevity_penalty': 0.846481724890614, 'length_ratio': 0.8571428571428571, 'translation_length': 6, 'reference_length': 7}

● For gluethe large index, I simply use its mrqcsmall task, which can be used to measure the semantic equivalence and similarity of two sentences.

import evaluate
my_metric = evaluate.load('glue', 'mrpc')
predictions = [0, 1, 0, 1, 0]  # 注意, 只能通过比较二进制
references = [0, 1, 0, 1, 1]
results = my_metric.compute(predictions=predictions, references=references)
print("results:", results)  # 只有第 4 个数不同, 所有答案为 4/5=0.8

输出:
results: {
    
    'accuracy': 0.8, 'f1': 0.8}


Four. Summary

● The operation of indicators is not difficult, just read it roughly and go through it once.


5. Supplementary Notes

Previous article link: NLP Road to Frozen Hands (2) - Download and various operations of text datasets (Datasets)

● If there is something wrong, or if you have any questions, please feel free to comment and exchange.

● Reference video: HuggingFace concise tutorial, BERT Chinese model practical example, NLP pre-training model, Transformers class library, datasets class library quick start.

● Reference 1: Evaluate: detailed introduction to the introduction of huggingface evaluation index module

● Reference 2: Explain the principle and calculation of BLEU in detail

● Reference 3: BLEU detailed calculation process of machine translation evaluation index


⭐️ ⭐️

Guess you like

Origin blog.csdn.net/Wang_Dou_Dou_/article/details/127495110