✅ NLP research 0 player's study notes
Article directory
● Previous article link: NLP Road to Frozen Hands (2) - Download and various operations of text datasets (Datasets)
1. The required environment
● python
3.7+ required, pytorch
1.10+ required
● The library used in this article is based on Hugging Face Transformer, the official website link: https://huggingface.co/docs/transformers/index [A very good open source website, which has done a lot of integration for the transformer framework, currently github 72.3k ⭐️]
● To install the Hugging Face Transformer library, you only need to enter pip install transformers
[this is the pip installation method] in the terminal; if you are using it conda
, enterconda install -c huggingface transformers
● In addition to installing the above configuration, this article also needs to install the dataset processing package datasets
named , just enter pip install datasets
[this is the pip installation method] in the terminal; if you are using it conda
, enterconda install -c huggingface -c conda-forge datasets
● In addition to installing the above configuration, this article also needs to install the evaluation library evaluate
named , just enter pip install evaluate
[this is the pip installation method, requires python
3.7+] in the terminal
2. Loading of evaluation indicators
import evaluate
metrics_list = evaluate.list_evaluation_modules()
print("指标的总数: ", len(metrics_list))
print("所有的指标: ", metrics_list)
● The running results are as follows, among which I have circled the commonly used ones with red boxes:
● For the loading of indicators, we take BLEU
as an example , generally use the following method to run directly ( cache_dir
, optional str
: path to store temporary predictions and references (by default ' ~/.cache/huggingface/evaluate/'
):
my_metric = evaluate.load("bleu", cache_dir="./")
print(my_metric) # 打印关于该指标的相关说明
输出:
EvaluationModule(name: "bleu", module_type: "metric", features: [{
'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, {
'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage:
Computes BLEU score of translated segments against one or more references.
Args:
predictions: list of translations to score.
references: list of lists of or just a list of references for each translation.
tokenizer : approach used for tokenizing `predictions` and `references`.
The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT.
This can be replaced by any function that takes a string as input and returns a list of tokens as output.
max_order: Maximum n-gram order to use when computing BLEU score.
smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
'bleu': bleu score,
'precisions': geometric mean of n-gram precisions,
'brevity_penalty': brevity penalty,
'length_ratio': ratio of lengths,
'translation_length': translation_length,
'reference_length': reference_length
Examples:
>>> predictions = ["hello there general kenobi", "foo bar foobar"]
>>> references = [
... ["hello there general kenobi", "hello there!"],
... ["foo bar foobar"]
... ]
>>> bleu = evaluate.load("bleu")
>>> results = bleu.compute(predictions=predictions, references=references)
>>> print(results["bleu"])
1.0
, stored examples: 0)
● However, this method needs to be connected to the Internet to download the evaluation script (through the hugging face). Generally, it needs to be f-walled to succeed, but if the local network is not good, it may not be successful (some indicators can not be f-walled, because their codes are accessible, but Some are on Google, so they cannot be downloaded) .
● If you cannot f the wall at present, you can move to "3. The use of evaluation indicators (BLEU and GLUE as examples)" and use glue
the index to do it, it does not use f.
● After the editor’s efforts, I spent nearly three hours groping, but I still couldn’t find a solution that can’t be solved without the f wall. The following small piece of content is part of my search, but it ended in failure... If there is a master who can succeed To solve this problem, welcome to leave a message in the comment area~!
At this time, it is necessary to download the complete
evaluate
library locally, and then callBLEU
the evaluation function locally.
● The process is as follows:
① Open the corresponding project file, open the command line directly from the file, and thengit
clickevaluate
the library on github. The full command is below, and the diagram is also below.git clone https://github.com/huggingface/evaluate.git
② Then rename this folderevaluate_1
(do not have the sameevaluate
name ), and then run the following code to print outsquad
the relevant instructions about the indicator (in particular, after the editor's efforts, I found that in this way Still can't load the source codebleu
for ().my_metric = evaluate.load('./evaluate_1/metrics/squad') print(my_metric)
3. The use of evaluation indicators (BLEU and GLUE as examples)
● Moreover, for some evaluation indicators, you need to be connected to the wai network to use them, for example bleu
, but you don't need them glue
if . Next, I will use them as examples respectively.
● First, take blue
as an example , assuming that the text predicted by the computer is the cat sat on the mat
(that is, the candidate translation), and assuming that there are two reference translations, one is look at! one cat sat on the mat
and the other is there is a cat on a mat
, then how to calculate the blue score?
● Specific method: Count the number of occurrences of each word in the candidate translation (candidate), and then count the number of occurrences of each word in the reference translation (reference). Max means take the maximum value of all reference translations, and Min means take the candidate translation and the minimum value of Max(ref_1,ref_2). The 1-gram statistics table is as follows, and the calculated Precision is also below.
1-gram | candidate | ref_1 | ref_2 | Max(ref_1,ref_2) | Min(candidate, reference_Max) |
---|---|---|---|---|---|
the | 2 | 1 | 0 | 1 | 1 |
cat | 1 | 1 | 1 | 1 | 1 |
sat | 1 | 1 | 0 | 1 | 1 |
on | 1 | 1 | 1 | 1 | 1 |
mat | 1 | 1 | 1 | 1 | 1 |
● Add the Min value of each word, add the number of occurrences of each word in the candidate translation, and then divide the two values to get a 1-gram Precision, which can be the firstprecisions
value of output in the following code . Finally, calculate 2-gram, 3-gram, and 4-gram in turn, then take the log, and perform weighted average processing, and finally multiply by the penalty coefficient. For details, please refer to the BLEU calculation process of the machine translation evaluation index .
P = 1 + 1 + 1 + 1 + 1 2 + 1 + 1 + 1 + 1 = 5 6 = 0.833333 P =\frac{1+1+1+1+1}{2+1+1+1+1 }=\frac{5}{6}=0.833333P=2+1+1+1+11+1+1+1+1=65=0.833333
import evaluate
my_metric = evaluate.load('bleu')
print(my_metric)
predictions = ["the cat sat on the mat"]
references = [["look at! one cat sat on the mat", "there is a cat on a mat"]]
results = my_metric.compute(predictions=predictions, references=references)
print("results:", results)
输出:
results: {
'bleu': 0.6431870218238024, 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666], 'brevity_penalty': 0.846481724890614, 'length_ratio': 0.8571428571428571, 'translation_length': 6, 'reference_length': 7}
● For glue
the large index, I simply use its mrqc
small task, which can be used to measure the semantic equivalence and similarity of two sentences.
import evaluate
my_metric = evaluate.load('glue', 'mrpc')
predictions = [0, 1, 0, 1, 0] # 注意, 只能通过比较二进制
references = [0, 1, 0, 1, 1]
results = my_metric.compute(predictions=predictions, references=references)
print("results:", results) # 只有第 4 个数不同, 所有答案为 4/5=0.8
输出:
results: {
'accuracy': 0.8, 'f1': 0.8}
Four. Summary
● The operation of indicators is not difficult, just read it roughly and go through it once.
5. Supplementary Notes
● Previous article link: NLP Road to Frozen Hands (2) - Download and various operations of text datasets (Datasets)
● If there is something wrong, or if you have any questions, please feel free to comment and exchange.
● Reference video: HuggingFace concise tutorial, BERT Chinese model practical example, NLP pre-training model, Transformers class library, datasets class library quick start.
● Reference 1: Evaluate: detailed introduction to the introduction of huggingface evaluation index module
● Reference 2: Explain the principle and calculation of BLEU in detail
● Reference 3: BLEU detailed calculation process of machine translation evaluation index
⭐️ ⭐️