Hugging Face Transformer: A Comprehensive Guide from Principle to Practice

I. Introduction

 

Earlier, I introduced the basic principles of ChatGPT and the development history of pre-trained large language models. We know what is the core of ChatGPT and all pre-trained large language models? In fact, it is Transformer. The popularity of Hugging Face is inseparable from their open source Transformers library. There are tens of thousands of models in this open source library that we can call directly. In many scenarios, this open source model is enough for us to use. Next, we will introduce the Hugging Face Transformer from the Transformer's architecture and specific cases.

2. Transformer architecture

Transformer is a neural network model for natural language processing and other sequence-to-sequence tasks. It was proposed by Vaswani et al. in 2017. The core module of Transformer is to capture the sequence of sequences through the self-attention mechanism (Self Attention). dependencies between.

 If there is only the word vector itself, without the attention mechanism and positional encoding, then the language model cannot be assigned. For example, it does not know whether a word apple is a fruit or a technology company. Only after the context-related information is captured through the self-attention mechanism can the language model establish the election representation of the entire sequence. 

The self-attention mechanism is composed of three parts:

  • Query Key Value: Query key value. The self-attention mechanism input is a sequence, and each element is a vector, that is, a word vector. For each element, we will calculate three vectors Query, Key and Value , these three vectors are multiplied by the word vector of the sequence itself by the parameter matrix of Query, Key, and Value. Through matrix multiplication, we derive three additional vectors from the word vector itself, namely Query, Key, and Value.

  • Attention Scores : attention score, for each query vector, we have to calculate the similarity score between it and all other key vectors, this score is called attention score. This score is calculated by clicking Dot Product on the query vector and key vector.

  • Attention Weight : The weight of attention. After getting the attention score, it will be normalized by Softmax, and then the weight of attention will be obtained. This weight is actually the weighted average of each value vector, which is also the output of the self-attention mechanism. After this series of calculations, each word vector in this sequence begins to change from completely unconnected to a lot. relationship, the output vector can be used by Transformer for further analysis and processing.

  • All attention mechanisms are multi-headed. Multi-headed attention is to add a small improvement to the attention mechanism. It also produces many sets of attention mechanisms by linearly transforming the input vector. These attention mechanisms can be calculated in parallel. , calculate many sets of Query, Key and Value at the same time, and then aggregate them into a new vector representation, and then aggregate the multiple heads. The advantage of this is that it can pay attention to information at different positions and different semantic levels at the same time, so as to better Capturing the global and local features of the sequence, multi-head attention performs better when dealing with complex sequence data.

In addition to the self-attention and multi-head attention mechanism, Transformer also includes an encoder (Encoder) and a decoder (Decoder): Encoder: maps the sequence into a set of hidden states, after processing the attention mechanism, Decoder then converts the hidden state Mapping to the output sequence is a basic mechanism of Transformer. These hidden states form a relatively complex parallel structure through multiple stacked self-attention layers and feed-forward neural network layers.

The above picture shows an input text sequence inside the encoder Encoder and decoder Decoder, and how it flows from Encoder to Decoder, you can understand it.

Encoder and Decoder are composed of multiple stacks inside Transformer, so in Encoder, each Transformer module contains two sublayers:

  • Multi Head Self Attention

  • Feed Forward Neuro Network

Decoder also includes two sub-layers, multi-head attention mechanism layer and Encoder Decoder attention layer, which are two different attention layers, one is Self Attention, the attention of the sequence itself; one is the input of Encoder, Encoder Attention combined with the output of the past and the input of the Decoder itself, so the vector sequence is processed through a multi-layer Transformer module, and each module will perform a series of self-attention, feed-forward, self-attention, feed-forward, and then Pass it to the Decoder and then perform self-attention, then perform Encoder Decoder Attention, and then perform a series of transfer processes such as Feed Forward, layer by layer, overlapping, so that Transformer can get more and more input and output sequences The dependencies between each sequence are learned layer by layer, so that the sequence-to-sequence semantics can be effectively captured for semantic learning.

3. The most influential Transformers

Here are some key milestones in the (brief) history of the Transformer model:

The Transformer architecture was introduced in June 2017. The original research focused on the translation task. Several influential models followed, including

  • June 2018: GPT , the first pre-trained Transformer model, used for various NLP tasks and achieved excellent results

  • October 2018: BERT , another large pretrained model designed to generate better sentence summaries

  • February 2019: GPT-2 , an improved (and larger) version of GPT, not immediately released publicly due to ethical concerns

  • October 2019: DistilBERT , a distilled version of BERT, is 60% faster and uses 40% less memory, but still retains 97% of BERT's performance

  • October 2019: BART and T5 , two large pretrained models using the same architecture as the original Transformer model (the first to do so)

  • May 2020, GPT-3 , a larger version of GPT-2 that performs well on a variety of tasks without fine-tuning (called zero-shot learning)

Among them, the most influential should be the BERT model proposed by Google in 2018. It is one of the most popular natural language processing models. It uses a two-way Transformer encoder to learn context-dependent word representations. After the birth of BERT, many people began to improve BERT to see if they could find a better Transformer.

  • RoBERTa : It was proposed by Facebook. Based on the language model for further training of BERT, by changing some internal structures and training processes, the expressiveness of the model has been improved. In fact, for some downstream tasks, RoBERTa and BERT are different There are characteristics, mainly depends on your specific tasks? Some tasks still perform well with BERT, while for some tasks, RoBERTa will be slightly better than BERT.

  • ALBERT : It is a lightweight language model based on BERT, which was proposed by the Google and Toyota teams in 2019. It shortens the model size and training time through parameter sharing and range techniques, while maintaining similar performance capabilities as BERT, which is lighter.

  • DistillBERT : It is also a lightweight language model based on BERT. It was launched by the Hugging Face team in 2019. It uses a knowledge distillation method, which can maintain the efficiency of the BERT model and compress the BERT model to more than half. About half of the parameters are left, but at the same time maintaining a similar expressiveness, it can be said that DistillBERT is a small large model with relatively high efficiency.

四、Hugging Face Transformers

Hugging Face Transformers is a company. In the API provided by Hugging Face, we can download almost all the information and various parameters of all the pre-trained large models mentioned above. We can think that these models are basically open source in Hugging Face, we just need to take them to fine-tune or retrain these models. In official terms, Hugging Face Transformers is a Python library for natural language processing that provides pre-trained language models and tools, enabling researchers and engineers to easily train and share the most advanced NLP models, including BERT , GPT, RoBERTa, XLNet, DistillBERT, and more.

Through Transformers, these pre-trained models can be easily used for NLP tasks such as text classification, named entity recognition, machine translation, and question answering systems. This library also provides a convenient API, sample code, and documentation, making it very easy for us to use these models or learn from them.

4.1、Transformers Pipeline

The basic functions of Pipeline

Let's take a look at what Transformers, an open source library, can do. The following codes directly use the open source model and need to use the computing power of the GPU, so you'd better run it in Colab, and don't forget to change the type of Runtime to GPU.

from transformers import pipeline

classifier = pipeline(task="sentiment-analysis", device=0)
preds = classifier("I am really happy today!")
print(preds)

Output result:

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9998762607574463}]

This code is very simple. In the first line of code, we define a pipeline whose task is sentimental-analysis, which is a classifier for sentiment analysis. The device=0 means that we specify to let Transformer use GPU resources. If you want it to use CPU, you can set device=-1. Then, call this classifier to perform sentiment analysis on a piece of text. From the output results, it gives the correct Positive prediction, and also gives a specific prediction score. Because we didn't specify any model here, Transformers automatically selected the default model, which is the distilbert-base-uncased-finetuned-sst-2-english model seen in the log.

Looking at the name, we can know that this model is a model for English. If we want to support Chinese, we can also try another model.

classifier = pipeline(model="uer/roberta-base-finetuned-jd-binary-chinese", task="sentiment-analysis", device=0)
preds = classifier("这家店有点黑,鱼香肉丝也太难吃了。")
print(preds)

Output result:

[{'label': 'negative (stars 1, 2 and 3)', 'score': 0.934112012386322}]

Here, if we specify the name of the model, we can switch to another model for sentiment analysis. This time we chose the model of roberta-base-finetuned-jd-binary-chinese. The RoBERTa model is based on BERT with some design modifications. The latter finetuned-jd-binary-chinese is a fine-tuned model based on JD.com's data.

Pipeline is a core function in the Transformers library, which encapsulates the entry of all model inference predictions hosted on HuggingFace. You don't need to care about the architecture of each model and what the input data format looks like. We only need to specify the model to use through the model parameter, specify the task type through the task parameter, and run it to get the result directly.

For example, if we don't want to do sentiment analysis now, but want to do English-Chinese translation, we just need to replace the task with translation_en_to_zh, and then choose a suitable model.

translation = pipeline(task="translation_en_to_zh", model="Helsinki-NLP/opus-mt-en-zh", device=0)

text = "Artificial intelligence is really amazing. I believe you will fall in love with it."
translated_text = translation(text)
print(translated_text)

Output result:

[{'translation_text': '人工智能真的太神奇啦,我相信你会喜欢上它'}]

Here, we choose the opus-mt-en-zh model of the University of Helsinki to do English-Chinese translation. After running it, we can see that the English we input is translated into Chinese. But how do we know which model to choose? Where can I find this magical Helsinki-NLP/opus-mt-en-zh model name?

Five, Hugging Face combat

Hugging Face is an AI community dedicated to sharing machine learning models and datasets. Its main products include Hugging Face Dataset, Hugging Face Tokenizer, Hugging Face Transformer and Hugging Face Accelerate.

  • Hugging Face Dataset is a library for easily accessing and sharing datasets for audio, computer vision, and natural language processing (NLP) tasks. Load datasets with just one line of code and use powerful data processing methods to quickly prepare datasets for training in deep learning models. Powered by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency.

  • Hugging Face Tokenizer is a library for converting text into a digital representation. It supports a variety of encoders, including BERT, GPT-2, etc., and provides some advanced alignment methods that can be used to map the relationship between the original string (characters and words) and the token space.

  • Hugging Face Transformer is a library for Natural Language Processing (NLP) tasks. It provides a variety of pre-trained models, including BERT, GPT-2, etc., and provides some advanced features, such as controlling the length of the generated text, temperature, etc.

  • Hugging Face Accelerate is a library for accelerating training and inference. It supports various hardware accelerators, such as GPU, TPU, etc., and provides some advanced functions, such as mixed precision training, gradient accumulation, etc.

5.1、Hugging Face Dataset

Hugging Face Dataset is a public dataset repository for easy access and sharing of datasets for audio, computer vision, and natural language processing (NLP) tasks. Load datasets with just one line of code and use powerful data processing methods to quickly prepare datasets for training in deep learning models.

Powered by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. Hugging Face Dataset is also deeply integrated with Hugging Face Hub, allowing you to easily load and share datasets with the wider machine learning community.

It is often helpful to quickly get some general information about a dataset before spending time downloading it. Dataset information is stored in DatasetInfo , which can contain information such as dataset description, features, and dataset size.

Use the load_dataset_builder() function to load the dataset builder and inspect the properties of the dataset without submitting a download:

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("rotten_tomatoes")

# Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

# Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

Once you are happy with the dataset, load it with load_dataset() :

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

5.2、Hugging Face Tokenizer

Tokenizers provides implementations of today's most commonly used tokenizers with an emphasis on performance and versatility. These tokenizers are also used in Transformers.

Tokenizer preprocesses the text sequence before inputting it into the model, which is equivalent to the link of data preprocessing, because it is impossible for the model to directly read the text information, and it still needs to be processed by word segmentation to turn the text into tokens. Each model such as BERT , GPT requires different Tokenizers, they all have their own dictionaries, because the training corpus of each model is different, so its token and its dictionary size, token format will be different, overall speaking , is to segment various words, and then encode them to represent the state of the word with 123456. This is the role of Tokenizer.

Therefore, the task of Tokenizer is to convert the input text into tokens one by one, and it can also be responsible for cleaning, truncating, and filling the text sequence. In short, it is to meet the format required by the specific model.

main feature:

  • Train new vocabularies and tokenize using today's most commonly used tokenizers.

  • Very fast (training and tokenization) due to the Rust implementation, tokenizing 1GB of text takes less than 20 seconds on the server CPU.

  • Easy to use, but also very versatile.

  • Intended for use in research and production.

  • Fully aligned tracking. Even with destructive normalization, it is always possible to obtain the part of the original sentence corresponding to any token.

  • All preprocessing is performed: truncation, padding, addition of special tokens required by the model.

Here's how to instantiate one using a BPE model:classTokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

5.3、Hugging Face Transformer

Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pre-trained models can reduce computational costs, carbon footprint, and save time and resources required to train models. These models support common tasks in different modalities, such as:

  • Natural language processing : text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.

  • Computer Vision : Image Classification, Object Detection, and Segmentation.

  • Audio : Automatic speech recognition and audio classification.

  • Multimodal : Form question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use different frameworks at each stage of the model; train a model with three lines of code in one framework and load it for inference in another framework. Models can also be exported to formats such as ONNX and TorchScript for deployment in production environments.

# 导入必要的库
from transformers import AutoModelForSequenceClassification

# 初始化分词器和模型
model_name = "bert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 将文本编码为模型期望的张量格式
inputs = tokenizer(dataset["train"]["text"][:10], padding=True, truncation=True, return_tensors="pt")

# 将编码后的张量输入模型进行预测
outputs = model(**inputs)

# 获取预测结果和标签
predictions = outputs.logits.argmax(dim=-1)

5.4、Hugging Face Accelerate

Accelerate is a library that lets you run the same PyTorch code in any distributed configuration by adding just four lines of code! In short, training and inference at scale becomes simple, efficient, and adaptable.

from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, training_dataloader, scheduler = accelerator.prepare(
    model, optimizer, training_dataloader, scheduler
)

5.5. Example of text classification based on Hugging Face Transformer

Install the necessary libraries for Hugging Face

pip install torch
pip install transformers
pip install datasets

# 导入必要的库
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

# 定义数据集名称和任务类型
dataset_name = "imdb"
task = "sentiment-analysis"

# 下载数据集并打乱数据
dataset = load_dataset(dataset_name)
dataset = dataset.shuffle()

# 初始化分词器和模型
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 将文本编码为模型期望的张量格式
inputs = tokenizer(dataset["train"]["text"][:10], padding=True, truncation=True, return_tensors="pt")

# 将编码后的张量输入模型进行预测
outputs = model(**inputs)

# 获取预测结果和标签
predictions = outputs.logits.argmax(dim=-1)
labels = dataset["train"]["label"][:10]

# 打印预测结果和标签
for i, (prediction, label) in enumerate(zip(predictions, labels)):
    prediction_label = "正面评论" if prediction == 1 else "负面评论"
    true_label = "正面评论" if label == 1 else "负面评论"
    print(f"Example {i+1}: Prediction: {prediction_label}, True label: {true_label}")

Output result:

100%|██████████| 3/3 [00:00<00:00, 65.66it/s]
Downloading model.safetensors: 100%|██████████| 436M/436M [00:19<00:00, 22.0MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Example 1: Prediction: 正面评论, True label: 正面评论
Example 2: Prediction: 正面评论, True label: 负面评论
Example 3: Prediction: 正面评论, True label: 正面评论
Example 4: Prediction: 正面评论, True label: 负面评论
Example 5: Prediction: 正面评论, True label: 负面评论
Example 6: Prediction: 正面评论, True label: 正面评论
Example 7: Prediction: 正面评论, True label: 正面评论
Example 8: Prediction: 负面评论, True label: 正面评论
Example 9: Prediction: 正面评论, True label: 负面评论
Example 10: Prediction: 正面评论, True label: 负面评论

Judging from the above results, the effect is actually not very good, because we did not do task-related data training, and directly used the bert model for text sentiment analysis. Naturally, the effect is not ideal, and we can also see it from the running log, which reminds us This model should be trained on downstream tasks so that it can be used for prediction and inference.

6. Reference

Guess you like

Origin blog.csdn.net/FrenzyTechAI/article/details/131958687