一、需要的环境

● python 需要 3.7+，pytorch 需要 1.10+

● 本文使用的库基于 Hugging Face Transformer，官网链接：https://huggingface.co/docs/transformers/index 【一个很不错的开源网站，针对于 transformer 框架做了很多大集成，目前 github 72.3k ⭐️】

● 安装 Hugging Face Transformer 的库只需要在终端输入 pip install transformers【这是 pip 安装方法】；如果你用的是 conda，则输入 conda install -c huggingface transformers

● 本文除了要安装上述配置，还要安装名为 datasets 的数据集处理包，只需要在终端输入 pip install datasets【这是 pip 安装方法】；如果你用的是 conda，则输入 conda install -c huggingface -c conda-forge datasets

● 本文除了要安装上述配置，还要安装名为 evaluate 的评估库，只需要在终端输入 pip install evaluate【这是 pip 安装方法，要求 python 3.7+】

二、评价指标的加载

import evaluate
metrics_list = evaluate.list_evaluation_modules()
print("指标的总数: ", len(metrics_list))
print("所有的指标: ", metrics_list)

● 运行结果如下，其中常用的几个我已用红框圈出：
在这里插入图片描述

● 对于指标的加载，我们以 BLEU 为例，一般是用如下方法直接运行即可(cache_dir，可选str：存储临时预测和引用的路径(默认为 ' ~/.cache/huggingface/evaluate/' )：

my_metric = evaluate.load("bleu", cache_dir="./")
print(my_metric)  # 打印关于该指标的相关说明

输出：
EvaluationModule(name: "bleu", module_type: "metric", features: [{
    
    'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, {
    
    'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: 
Computes BLEU score of translated segments against one or more references.
Args:
    predictions: list of translations to score.
    references: list of lists of or just a list of references for each translation.
    tokenizer : approach used for tokenizing `predictions` and `references`.
        The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT.
        This can be replaced by any function that takes a string as input and returns a list of tokens as output.
    max_order: Maximum n-gram order to use when computing BLEU score.
    smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
    'bleu': bleu score,
    'precisions': geometric mean of n-gram precisions,
    'brevity_penalty': brevity penalty,
    'length_ratio': ratio of lengths,
    'translation_length': translation_length,
    'reference_length': reference_length
Examples:

    >>> predictions = ["hello there general kenobi", "foo bar foobar"]
    >>> references = [
    ...     ["hello there general kenobi", "hello there!"],
    ...     ["foo bar foobar"]
    ... ]
    >>> bleu = evaluate.load("bleu")
    >>> results = bleu.compute(predictions=predictions, references=references)
    >>> print(results["bleu"])
    1.0
, stored examples: 0)

● 但是，这种方法需要联网下载评价脚本(通过hugging face)一般要 f墙才能成功，但如果本地的网不行，可能运行不成功(部分指标可以不用 f墙，因为它们的代码可触及，但有一些在 google 那边，就下载不了)。

● 如果你目前不能 f墙，那可以移步到 “三、评价指标的使用(BLEU和GLUE为例)” 处，使用 glue 指标来做，它不用 f。

● 经过小编的努力，花了将近三个小时摸索，还是未能找到不 f墙能解决的方法，下面这一小块内容就是我其中查找的一部分，但是以失败告终… 如有高手能成功解决此问题，欢迎到评论区留言~！

这个时候就需要去把官网的完整的 evaluate 库给本地下载下来，然后从本地调用 BLEU 的评价函数。
● 流程如下：
① 打开对应的项目文件，直接从文件处打开命令行，然后 git 一下 github 上的 evaluate 库。全部命令如下，示意图也在下面。
git clone https://github.com/huggingface/evaluate.git
② 然后将这个文件夹更名为 evaluate_1(别与 evaluate 重名即可)，再通过运行下面这段代码，打印出关于该指标 squad 的相关说明(特别说明一点，经过小编的努力，发现，通过这种方式还是不能加载 bleu 的源代码()。
my_metric = evaluate.load('./evaluate_1/metrics/squad')
print(my_metric)

三、评价指标的使用(BLEU和GLUE为例)

● 而且，对于部分评价指标，需要一直连着 wai网才能使用，比如 bleu，但想 glue 就不用，接下来我将分别用它俩来做例子。

● 首先，以 blue 为例，假设计算机预测的文本为 the cat sat on the mat(即候选译文)，假设参考译文有两个，一个是 look at! one cat sat on the mat ，另一个是 there is a cat on a mat，那么 blue 得分怎么算的呢？

● 具体方法：统计候选译文(candidate)里每个词出现的次数，然后统计每个词在参考译文(reference)中出现的次数，Max 表示取所有参考译文中的最大值，Min 表示取候选译文和Max(ref_1,ref_2)两个中的最小值。1-gram 的统计表如下，计算的 Precision 也在下面。

1-gram	candidate	ref_1	ref_2	Max(ref_1,ref_2)	Min(candidate, reference_Max)
the	2	1	0	1	1
cat	1	1	1	1	1
sat	1	1	0	1	1
on	1	1	1	1	1
mat	1	1	1	1	1

● 将每个词的 Min 值相加，将候选译文每个词出现的次数相加，然后两值相除即得 1-gram 的 Precision，可以下面代码中输出的 precisions 的第一个值即是它。最后，也依次计算 2-gram、3-gram、4-gram，然后取 log，并加权平均处理一下，最后乘上惩罚系数即可，具体可以参考机器翻译评价指标之BLEU详细计算过程。
$=\frac{1+1+1+1+1}{2+1+1+1+1}=\frac{5}{6}=0.833333$

import evaluate
my_metric = evaluate.load('bleu')
print(my_metric)
predictions = ["the cat sat on the mat"]
references = [["look at! one cat sat on the mat", "there is a cat on a mat"]]
results = my_metric.compute(predictions=predictions, references=references)
print("results:", results)

输出：
results: {
    
    'bleu': 0.6431870218238024, 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666], 'brevity_penalty': 0.846481724890614, 'length_ratio': 0.8571428571428571, 'translation_length': 6, 'reference_length': 7}

● 对于 glue 大指标我就简单使用一下它的 mrqc 小任务，可用于测量两个语句的语义等效性、相似性。

import evaluate
my_metric = evaluate.load('glue', 'mrpc')
predictions = [0, 1, 0, 1, 0]  # 注意, 只能通过比较二进制
references = [0, 1, 0, 1, 1]
results = my_metric.compute(predictions=predictions, references=references)
print("results:", results)  # 只有第 4 个数不同, 所有答案为 4/5=0.8

输出：
results: {
    
    'accuracy': 0.8, 'f1': 0.8}