Navigating the NLP Ocean: A Quick Start with HuggingFace


foreword

Hugging Face is a natural language processing (NLP)-focused technology company and an open source community and platform that aims to provide a wealth of NLP models, tools, and resources. Hugging Face aims to be a community and innovation driver in the field of NLP, and they do this by providing developers and researchers with open-source tools, pre-trained models, and datasets. Hugging Face's open source libraries and tools are widely used in various NLP tasks, including text classification, named entity recognition, sentiment analysis, machine translation, etc.

1. Introduction to HuggingFace

1-1. Introduction of HuggingFace

Hugging Face is an open source community and technology company dedicated to the field of Natural Language Processing (NLP). They provide an extensive platform of NLP tools and resources designed to help developers and researchers quickly build, train and deploy various NLP models.
Through Hugging Face, you can use their open source libraries and tools, such as transformers, tokenizers, and datasets, to process text data, build pre-trained Transformer models, and perform fine-tuning and transfer learning. These tools support various common NLP tasks such as text classification, named entity recognition, sentiment analysis, etc.
insert image description here

The main library of HuggingFace is :

  • Transformer model library: call various pre-trained models
  • Datasets dataset library: dataset usage
  • Tokenizer word segmentation library: word segmentation tool

Official Documentation: https://huggingface.co/docs

1-2. Installation

# 安装transformers和datasets包
pip install transformers  -i https://mirror.baidu.com/pypi/simple

pip install datasets  -i https://mirror.baidu.com/pypi/simple

2. Tokenizer word segmentation library: word segmentation tool

2-0. Load BertTokenizer: the name of the pre-trained model needs to be passed in

from transformers import BertTokenizer

#加载预训练字典和分词方法
tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path='bert-base-chinese',  # 可选,huggingface 中的预训练模型名称或路径,默认为 bert-base-chinese
    cache_dir=None,  # 将数据保存到的本地位置,使用cache_dir 可以指定文件下载位置
    force_download=False,   
)

2-1. Use Tokenizer to encode sentences:

sents = [
    '选择珠江花园的原因就是方便。',
    '笔记本的键盘确实爽。',
    '房间太小。其他的都一般。',
    '今天才知道这书还有第6卷,真有点郁闷.',
    '机器背面似乎被撕了张什么标签,残胶还在。',
]
#编码两个句子
out = tokenizer.encode(
    text=sents[0],
    text_pair=sents[1],  # 一次编码两个句子,若没有text_pair这个参数,就一次编码一个句子

    #当句子长度大于max_length时,截断
    truncation=True,

    #一律补pad到max_length长度
    padding='max_length',   # 少于max_length时就padding
    add_special_tokens=True,
    max_length=30, # 指定最大长度为30。
    return_tensors=None,  # None表示不指定数据类型,默认返回list
)

print(out)
print(tokenizer.decode(out))

Output : The beginning is the special symbol [CLS], the middle of the two sentences is separated by [SEP], the end of the sentence is also [SEP], and finally the sentence is filled to the max_length length with [PAD]

[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272 , 511, 102, 0, 0, 0]
[CLS] The reason for choosing Pearl River Garden is convenience. [SEP] The keyboard of the notebook is really cool. [SEP] [PAD] [PAD] [PAD]

2-2. Use the enhanced Tokenizer to encode the sentence:

sents = [
    '选择珠江花园的原因就是方便。',
    '笔记本的键盘确实爽。',
    '房间太小。其他的都一般。',
    '今天才知道这书还有第6卷,真有点郁闷.',
    '机器背面似乎被撕了张什么标签,残胶还在。',
]

#增强的编码函数
out = tokenizer.encode_plus(
    text=sents[0],
    text_pair=sents[1],

    #当句子长度大于max_length时,截断
    truncation=True,

    #一律补零到max_length长度
    padding='max_length',
    max_length=30,
    add_special_tokens=True,

    #可取值tensorflow,pytorch,numpy,默认值None为返回list
    return_tensors=None,

    #返回token_type_ids
    return_token_type_ids=True,

    #返回attention_mask
    return_attention_mask=True,

    #返回special_tokens_mask 特殊符号标识
    return_special_tokens_mask=True,

    #返回offset_mapping 标识每个词的起止位置,这个参数只能BertTokenizerFast使用
    #return_offsets_mapping=True,

    #返回length 标识长度
    return_length=True,
)

print(out)   # 字典
print(tokenizer.decode(out['input_ids']))

Output :
{'input_ids': [101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315 , 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1] , 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'length': 30}
[CLS] The reason for choosing Pearl River Gardens is convenience. [SEP] The keyboard of the notebook is really cool. [SEP] [PAD] [PAD] [PAD]

Indicators explained in detail :

  • input_ids is the encoded word, which is to change each word in the sentence into a number
  • token_type_ids The position of the first sentence and the special symbol is 0, and the position of the second sentence is 1 (including [SEP] at the end of the second sentence)
  • special_tokens_mask The position of the special symbol is 1, and the other positions are 0
  • The position of the attention_mask pad is 0, and the other positions are 1
  • length returns the sentence length

2-3. Batch encoding a single sentence:

sents = [
    '选择珠江花园的原因就是方便。',
    '笔记本的键盘确实爽。',
    '房间太小。其他的都一般。',
    '今天才知道这书还有第6卷,真有点郁闷.',
    '机器背面似乎被撕了张什么标签,残胶还在。',
]

# 批量编码一个一个的句子
out = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs=[sents[0], sents[1]],  # 批量编码,一次编码了两个句子(与增强的编码函数相比,就此处不同)

    # 当句子长度大于max_length时,截断
    truncation=True,

    # 一律补零到max_length长度
    padding='max_length',
    max_length=15,
    add_special_tokens=True,

    # 可取值tf,pt,np,默认为返回list
    return_tensors=None,

    # 返回token_type_ids
    return_token_type_ids=True,

    # 返回attention_mask
    return_attention_mask=True,

    # 返回special_tokens_mask 特殊符号标识
    return_special_tokens_mask=True,

    # 返回offset_mapping 标识每个词的起止位置,这个参数只能BertTokenizerFast使用
    # return_offsets_mapping=True,

    # 返回length 标识长度
    return_length=True,
)

print(out)  # 字典
print(tokenizer.decode(out['input_ids'][0]))

Output :
{'input_ids': [[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 102], [101, 5011, 6381, 33 15, 4638 , 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'special_tokens_mask': [[1 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 1, 1, 1, 1]], 'length': [15, 12], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]} [CLS] Select Pearl River Garden
's The reason is convenience [SEP]

2-4. Add new words:

#获取字典
zidian = tokenizer.get_vocab()
#添加新词
tokenizer.add_tokens(new_tokens=['月光', '希望'])

#添加新符号
tokenizer.add_special_tokens({
    
    'eos_token': '[EOS]'})   # End Of Sentence

# 获取更新后的字典
zidian = tokenizer.get_vocab()

# 解码新词
#编码新添加的词
out = tokenizer.encode(
    text='月光的新希望[EOS]',
    text_pair=None,

    #当句子长度大于max_length时,截断
    truncation=True,

    #一律补pad到max_length长度
    padding='max_length',
    add_special_tokens=True,
    max_length=8,
    
    return_tensors=None,
)

print(out)

print(tokenizer.decode(out))

output :
insert image description here

2-5. The difference between AutoTokenizer and BertTokenizer

AutoTokenizer is a general-purpose package that adapts according to the loaded pre-trained model.

The main differences are :

  • Automatic selection of model types : AutoTokenizer is a tool that automatically selects the appropriate tokenizer for the task. It can automatically select the corresponding tokenizer according to the input pre-trained model name or model type. This makes it more convenient when using different models without manually specifying tokenizers. For example, you can use AutoTokenizer.from_pretrained("bert-base-uncased") and it will automatically select the BertTokenizer that fits the BERT model.

  • Specific model tokenizer : BertTokenizer is a tokenizer for the BERT model, which is based on the WordPiece tokenization algorithm. It splits the input text into small word units (tokens) and assigns each token a unique ID. BertTokenizer also provides other useful methods, such as getting the ID of a special token (such as [CLS] and [SEP]), converting text to the input format required by the model, etc.

  • Support for more model types : AutoTokenizer can automatically select tokenizers suitable for various pre-trained models, not limited to BERT. It supports various models including GPT, RoBERTa, XLNet, etc. This enables you to switch and compare between different models using the same tool.

In general, AutoTokenizer is a tool for automatically selecting a tokenizer suitable for a task, while BertTokenizer is a tokenizer specifically for the BERT model. AutoTokenizer provides greater flexibility and versatility, and can be applied to many different pre-trained models.

3. Datasets data set library: Data set use

3-1. Dataset usage

Datasets dataset library : dataset usage

# 1
from datasets import load_dataset

#加载数据
dataset = load_dataset(path='lansinuote/ChnSentiCorp')
#保存数据集到磁盘
dataset.save_to_disk(dataset_dict_path='./data/ChnSentiCorp')

# 2
#从磁盘加载数据
from datasets import load_from_disk

dataset = load_from_disk('./data/ChnSentiCorp')

Problem : It is easy to report an error and cannot find the data *ConnectionError: Couldn't reach 'seamew/ChnSentiCorp' on the Hub (ConnectionError)*l

Solution : Open the website https://huggingface.co/datasets/seamew/ChnSentiCorp to get the data.
insert image description here

3-2. Data set operation

3-3. Evaluation function

View all evaluation metrics

from datasets import list_metrics

#列出评价指标
metrics_list = list_metrics()

print(metrics_list)

output :
[‘accuracy’, ‘bertscore’, ‘bleu’, ‘bleurt’, ‘brier_score’, ‘cer’, ‘character’, ‘charcut_mt’, ‘chrf’, ‘code_eval’, ‘comet’, ‘competition_math’, ‘coval’, ‘cuad’, ‘exact_match’, ‘f1’, ‘frugalscore’, ‘glue’, ‘google_bleu’, ‘indic_glue’, ‘mae’, ‘mahalanobis’, ‘mape’, ‘mase’, ‘matthews_correlation’, ‘mauve’, ‘mean_iou’, ‘meteor’, ‘mse’, ‘nist_mt’, ‘pearsonr’, ‘perplexity’, ‘poseval’, ‘precision’, ‘r_squared’, ‘recall’, ‘rl_reliability’, ‘roc_auc’, ‘rouge’, ‘sacrebleu’, ‘sari’, ‘seqeval’, ‘smape’, ‘spearmanr’, ‘squad’, ‘squad_v2’, ‘super_glue’, ‘ter’, ‘trec_eval’, ‘wer’, ‘wiki_split’, ‘xnli’, ‘xtreme_s’, ‘AlhitawiMohammed22/CER_Hu-Evaluation-Metrics’, ‘BucketHeadP65/confusion_matrix’, ‘BucketHeadP65/roc_curve’, ‘Drunper/metrica_tesi’, ‘Felipehonorato/eer’, ‘He-Xingwei/sari_metric’, ‘JP-SystemsX/nDCG’, ‘Josh98/nl2bash_m’, ‘Kyle1668/squad’, ‘Muennighoff/code_eval’, ‘NCSOFT/harim_plus’, ‘Natooz/ece’, ‘NikitaMartynov/spell-check-metric’, ‘Pipatpong/perplexity’, ‘Splend1dchan/cosine_similarity’, ‘Viona/fuzzy_reordering’, ‘Viona/kendall_tau’, ‘Vipitis/shadermatch’, ‘Yeshwant123/mcc’, ‘abdusah/aradiawer’, ‘abidlabs/mean_iou’, ‘abidlabs/mean_iou2’, ‘andstor/code_perplexity’, ‘angelina-wang/directional_bias_amplification’, ‘aryopg/roc_auc_skip_uniform_labels’, ‘brian920128/doc_retrieve_metrics’, ‘bstrai/classification_report’, ‘chanelcolgate/average_precision’, ‘ckb/unigram’, ‘codeparrot/apps_metric’, ‘cpllab/syntaxgym’, ‘dvitel/codebleu’, ‘ecody726/bertscore’, ‘fschlatt/ner_eval’, ‘giulio98/codebleu’, ‘guydav/restrictedpython_code_eval’, ‘harshhpareek/bertscore’, ‘hpi-dhc/FairEval’, ‘hynky/sklearn_proxy’, ‘hyperml/balanced_accuracy’, ‘ingyu/klue_mrc’, ‘jpxkqx/peak_signal_to_noise_ratio’, ‘jpxkqx/signal_to_reconstruction_error’, ‘k4black/codebleu’, ‘kaggle/ai4code’, ‘langdonholmes/cohen_weighted_kappa’, ‘lhy/hamming_loss’, ‘lhy/ranking_loss’, ‘lvwerra/accuracy_score’, ‘manueldeprada/beer’, ‘mfumanelli/geometric_mean’, ‘omidf/squad_precision_recall’, ‘posicube/mean_reciprocal_rank’, ‘sakusakumura/bertscore’, ‘sma2023/wil’, ‘spidyidcccc/bertscore’, ‘tialaeMceryu/unigram’, ‘transZ/sbert_cosine’, ‘transZ/test_parascore’, ‘transformersegmentation/segmentation_scores’, ‘unitxt/metric’, ‘unnati/kendall_tau_distance’, ‘weiqis/pajm’, ‘ybelkada/cocoevaluate’, ‘yonting/average_precision_score’, ‘yuyijiong/quad_match_score’]

Use an evaluation metric :

from datasets import load_metric

#加载一个评价指标
metric = load_metric('glue', 'mrpc') 
#计算一个评价指标
predictions = [0, 1, 0]
references = [0, 1, 1]

final_score = metric.compute(predictions=predictions, references=references)
print(final_score)

4. Model

The models correspond to the Hugging Face website: https://huggingface.co/models .

4-1. Common models for various NLP tasks

Here are some commonly used models for tasks:

Text classification : finbert, roberta-base-go_emotions, twitter-roberta-base-sentiment-latest
Q&A : roberta-base-squad2, xlm-roberta-large-squad2, distilbert-base-cased-distilled-squad
Zero-shot classification : bart -large-mnli, mDeBERTa-v3-base-mnli-xnli
Translation : t5-base, opus-mt-zh-en, translation_en-zh
Summary : bart-large-cnn, led-base-book-summary
Text generation : Baichuan -13B-Chat, falcon-40b, starcoder
text similarity : all-MiniLML6-v2, text2vec-large-chinese, all-mpnet-base-v2

5. Analysis of actual combat cases

5-1. Kaggle Competition Real or Not? NLP with Disaster Tweets Text Classification

5-1-1. Data introduction

Data source : kaggle competition - Natural Language Processing with Disaster Tweets

Task : We need to judge whether this tweet is related to a disaster based on the location, keyword, and text of the article. Based on this, relevant departments can detect possible disasters at the first time and respond quickly to minimize losses.
insert image description here

Data scale: Training Set Shape: (7613, 5); Test Set Shape: (3263, 4)

Training set :
insert image description here
Validation set :
insert image description here

Other matters: According to data analysis

  • The location column has many missing values, so discard it directly
  • There are very few missing values ​​in the keyword column. According to data analysis, there is a visible strong correlation between keywords and labels.
  • The labels are evenly distributed, we can directly use them to train the model
  • Here we omit the data analysis stage and enter the model training stage. If you are interested in the complete process, please check my other article Kaggle competition Real or Not? NLP with Disaster Tweets Text Classification .

5-2. Training model

Process final_text and return the encoded ID string.

def encode_fn(text_list):
    all_input_ids = []
    for text in text_values:
        input_ids = tokenizer.encode(text, add_special_tokens=True, max_length=180, pad_to_max_length=True, return_tensors='pt')
        all_input_ids.append(input_ids)
    all_input_ids = torch.cat(all_input_ids, dim=0)
    return all_input_ids

Divide the data into training and validation sets

epochs = 4
batch_size = 32

# Split data into train and validation
# 将数据转化为输入模型需要的数据类型
all_input_ids = encode_fn(text_values)
labels = torch.tensor(labels)

# TensorDataset是pytorch中的一个类,用于将多个张量打包成一个数据集,将输入数据张量和标签张量进行一一对应组合。
# 划分训练集和验证集
dataset = TensorDataset(all_input_ids, labels)
train_size = int(0.90 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Create train and validation dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

Load the bert model

# Load the pretrained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2, output_attentions=False, output_hidden_states=False)
model.cuda()

# create optimizer and learning rate schedule
optimizer = AdamW(model.parameters(), lr=2e-5)
# len(train_dataloader),数据被分为N个批次,每个批次的数据长度是32
# total_steps 
total_steps = len(train_dataloader) * epochs
# get_linear_schedule_with_warmup: 创建学习率调度器函数,
# get_linear_schedule_with_warmup函数在训练初期使用一种称为"warm-up"的策略,即在初始几个epoch中逐步增加学习率,以帮助模型更快地收敛到一个相对合适的参数范围。然后,在warm-up之后,学习率线性地减少或保持恒定。
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

Accuracy calculation
insert image description here

from sklearn.metrics import f1_score, accuracy_score

def flat_accuracy(preds, labels):
    
    """A function for calculating accuracy scores"""
    
    pred_flat = np.argmax(preds, axis=1).flatten()
    # 将真实标签数组labels进行展平,得到一维的真实标签数组。
    labels_flat = labels.flatten()
    # 计算展平后的真实标签数组与预测结果数组之间的准确率
    return accuracy_score(labels_flat, pred_flat)

training and validation

for epoch in range(epochs):
    model.train()
    total_loss, total_val_loss = 0, 0
    total_eval_accuracy = 0
    for step, batch in enumerate(train_dataloader):
        model.zero_grad()
        loss, logits = model(batch[0].to(device), token_type_ids=None, attention_mask=(batch[0]>0).to(device), labels=batch[1].to(device))
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step() 
        scheduler.step()
        
    model.eval()
    for i, batch in enumerate(val_dataloader):
        with torch.no_grad():
            loss, logits = model(batch[0].to(device), token_type_ids=None, attention_mask=(batch[0]>0).to(device), labels=batch[1].to(device))
                
            total_val_loss += loss.item()
            
            logits = logits.detach().cpu().numpy()
            label_ids = batch[1].to('cpu').numpy()
            total_eval_accuracy += flat_accuracy(logits, label_ids)
    
    avg_train_loss = total_loss / len(train_dataloader)
    avg_val_loss = total_val_loss / len(val_dataloader)
    avg_val_accuracy = total_eval_accuracy / len(val_dataloader)
    
    print(f'Train loss     : {
    
    avg_train_loss}')
    print(f'Validation loss: {avg_val_loss}')
    print(f'Accuracy: {avg_val_accuracy:.2f}')
    print('\n')

Train loss : 0.441781023875452
Validation loss: 0.34831519580135745
Accuracy: 0.86
Train loss : 0.3275374324204257
Validation loss: 0.3286557973672946
Accuracy: 0.88
Train loss : 0.2503694619696874
Validation loss: 0.355623895690466
Accuracy: 0.86
Train loss : 0.19663514375973207
Validation loss: 0.3806843503067891
Accuracy: 0.86

predict

# Create the test data loader
text_values = df_test['final_text'].values
all_input_ids = encode_fn(text_values)
pred_data = TensorDataset(all_input_ids)
pred_dataloader = DataLoader(pred_data, batch_size=batch_size, shuffle=False)
model.eval()
preds = []
for i, (batch,) in enumerate(pred_dataloader):
    with torch.no_grad():
        outputs = model(batch.to(device), token_type_ids=None, attention_mask=(batch>0).to(device))
        logits = outputs[0]
        logits = logits.detach().cpu().numpy()
        preds.append(logits)

final_preds = np.concatenate(preds, axis=0)
final_preds = np.argmax(final_preds, axis=1)

# Create submission file
submission = pd.DataFrame()
submission['id'] = df_test['id']
submission['target'] = final_preds
submission.to_csv('submission.csv', index=False)

Reference article:
HuggingFaceg official GitHub .
HuggingFace quick start (take bert-base-chinese as an example) .
Various tasks of NLP .
EDA exploratory data analysis .
pytorch+huggingface realizes text classification based on bert model (with code) .

HuggingFace brief tutorial .
How to download the huggingface transformers pre-trained model to the local and use it? .
Using the hugging face model library and loading Bert pre-trained models .

Summarize

Just want to be happy.

Guess you like

Origin blog.csdn.net/weixin_42475060/article/details/131771466