"Transformers Natural Language Processing Series Tutorial" Chapter 1: Introduction to Transformers

In 2017, researchers at Google published a paper proposing a novel neural network architecture for sequence modeling. Known as a Transformer, this architecture outperforms recurrent neural networks (RNNs) in both machine translation quality and training cost.

Meanwhile, an efficient transfer learning method called ULMFiT has shown that training a long short-term memory (LSTM) network on a very large and diverse corpus can produce a state-of-the-art text classifier with very little labeled data.

These advances are the catalyst for two of the most famous transformers today: Generative Pre-Training Transformers (GPT) and Bidirectional Encoder Representations from Transformers (BERT) . By combining transformer architectures with unsupervised learning, these models eliminate the need to train task-specific architectures from scratch and significantly break almost all benchmarks in NLP. Since the release of GPT and BERT, a large number of transformer models have emerged; the timeline of the most prominent milestones is shown in Figure 1-1:
Figure 1-1. Timeline of transformers
but for us to understand the novelty of transformers, we first need to explain:

  • Encoder-decoder framework
  • Attention mechanism
  • Transfer learning

In this chapter, we introduce the core concepts of the generality of transformers, look at some of the tasks they are good at, and explain Hugging Face's tools and libraries.

Let's first explore the encoder-decoder framework and the architecture before the rise of transformers.

1. Encoder-decoder framework

Recurrent architectures like LSTMs were the state of the art in NLP before the advent of transformers. These architectures incorporate a feedback loop in the network connections, allowing information to propagate from one step to another, making them ideal for modeling sequential data like text. As shown on the left side of Figure 1-2, an RNN takes some input (which can be words or characters), feeds it through the network, and outputs a vector called a hidden state.
Figure 1-2. Unrolling an RNN in time
At the same time, the model feeds some information back to itself through a feedback loop, which can then be used in the next step . This can be seen more clearly if we "unroll" the loop shown on the right side of Figure 1-2: RNNs pass state information from each step to the next operation in the sequence. This allows the RNN to keep track of information from previous steps and use it for output predictions.

These architectures were (and continue to be) widely used for NLP tasks, speech processing, and time series. Recommended reading The Unreasonable Effectiveness of Recurrent Neural Networks .

One area where RNNs have played an important role is in the development of machine translation systems, whose goal is to map sequences of words in one language to another. Such tasks are usually handled with encoder-decoder or sequence-to-sequence architectures, which are well suited for situations where both input and output are sequences of arbitrary length. The job of the encoder is to encode the information in the input sequence into a numerical representation, usually called the last hidden state. This state is then passed to the decoder, which generates an output sequence.

In general, the encoder and decoder components can be any kind of neural network architecture that can model sequences. Here is an illustration of the pair of RNNs in Figures 1-3, where the English sentence "Transformers are great!" is encoded into a hidden state vector, which is then decoded to produce the German translation "Transformer sind grossartig!". The input words are input sequentially through the encoder, and the output words are generated one at a time from top to bottom.

Figure 1-3. Encoder-decoder architecture with a pair of RNNs (in general, more recurrent layers than shown here)
Despite its simplicity, a drawback of this architecture is that the final hidden state of the encoder creates an information bottleneck: it must represent the meaning of the entire input sequence, since this is all the information the decoder may need to access when generating the output . This is especially challenging for long sequences, as information at the beginning of the sequence may be lost in the process of compressing everything into a single fixed representation .

Fortunately, there is a way to allow the decoder to access all of the encoder's hidden state. This mechanism is generally referred to as the attention mechanism , and it is a key component of many modern neural network architectures. Understanding how the attention mechanism of RNNs develops will give us a good understanding of the Transformers architecture. Let's take a deeper look at the attention mechanism.

2. Attention mechanism

The main idea behind the attention mechanism is that instead of the encoder producing a single hidden state for the input sequence, it outputs a hidden state at each step accessible to the decoder. However, using all states at the same time would create a huge input to the decoder, so some mechanism is needed to prioritize which states to use. This is where attention comes in: **It allows the decoder to assign different weights to each encoder state at each decoding. **This process is illustrated in Figure 1-4, which shows the role of the attention mechanism in predicting the second token in the output sequence.
Figure 1-4 A codec structure with a pair of RNN attention mechanisms
By focusing on which input tokens are most relevant at each step, these attention-based models are able to learn important alignments between words in the generated translation and words in the source sentence . For example, Figure 1-5 visualizes the attention weights of an English-to-French translation model, where each pixel represents a weight.
Figure 1-5. Codec alignment for RNN English word and French generated translations

The figure shows how the decoder is able to correctly align the words "zone" and "Area", which are in different orders in the two languages.

While attention mechanisms are able to produce better translations, there is still a major disadvantage of using recurrent models for encoders and decoders: the computation is inherently sequential and cannot be parallelized across the input sequence.

With Transformers, a new modeling paradigm is introduced: avoiding recursion entirely, and instead relying entirely on a special form of attention called self-attention. We will cover self-attention in more detail in Chapter 3, but the basic idea is to allow attention to operate on all states of the same layer of a neural network . As shown in Figure 1-6, both the encoder and the decoder have their own self-attention mechanisms, and their outputs are fed into feed-forward neural networks (FF NNs).
Figure 1-6 Codec Architecture of Native Transformer

This architecture can be trained much faster than recurrent models and has paved the way for many recent breakthroughs in NLP.

In the original Transformer paper, the translation model was trained from scratch on a large corpus of sentence pairs in various languages. However, in many practical applications of natural language processing, we do not have access to large amounts of labeled text data to train our models . The last piece that started the Transformer revolution is missing: transfer learning .

3. Transfer learning in NLP

Currently in computer vision, it is common practice in transfer learning to train a convolutional neural network like ResNet for one task and then make it tune or fine-tune for a new task. This enables the network to leverage knowledge learned from the original task. Architecturally, this involves splitting the model into a body and a head, where the head is a task-specific network. During training, body weights learn extensive features of the source domain and use these weights to initialize new models for new tasks . This approach typically produces high-quality models that can be trained more efficiently on a variety of downstream tasks and using less labeled data than traditional supervised learning. A comparison of the two approaches is shown in Figure 1-7.
Figure 1-7. Comparison of traditional supervised learning (left) and transfer learning (right)
In computer vision, these models are first trained on large-scale datasets, such as ImageNet, which contain millions of images. This process is called pre-training , and its main purpose is to teach the model to learn basic features of images, such as edges or colors. These pretrained models can then be fine-tuned on downstream tasks such as classifying flower species with relatively few labeled examples (typically several hundred per class). Fine-tuned models typically achieve higher accuracy than supervised models trained from scratch on the same amount of labeled data.

While transfer learning became the standard approach for computer vision, it was not clear for many years what a similar pre-training process would be for NLP. Therefore, NLP applications usually require large amounts of labeled data to achieve high performance. Even so, this performance cannot compare with what has been achieved in the visual field.

In 2017 and 2018, several research groups proposed new methods that finally made transfer learning a job in NLP. It started with insights from researchers at OpenAI, who achieved strong performance on emotion classification tasks by using features extracted from unsupervised pre-training. This was followed by ULMFiT, which introduced a general framework to adapt pretrained LSTM models to various tasks.

As shown in Figure 1-8, ULMFiT consists of three main steps:

insert image description here

  • Pre-training
    The initial training goal is simple: predict the next word based on previous words . This task is known as language modeling. The elegance of this approach is that no labeled data is required , and one can leverage the wealth of text available from sources such as Wikipedia.
  • Domain Transfer
    Once a language model has been pretrained on a large-scale corpus, the next step is to adapt it to an in-domain corpus (for example, from Wikipedia to the IMDb corpus of movie reviews, as shown in Figure 1-8). This stage still uses language modeling, but now the model has to predict the next word in the target corpus .
  • Fine-tuning
    In this step, the language model is fine-tuned by a classification layer for the target task (for example, classifying the sentiment of movie reviews in Figure 1-8).

By introducing a viable pre-training and transfer learning framework in NLP, ULMFiT provides room for Transformers to grow. In 2018, two Transformers combining self-attention mechanism and transfer learning were released:

  • GPT only uses the decoder
    part in the Transformers structure , and the same language modeling method as ULMFiT. GPT is pre-trained on a corpus of 7,000 unpublished books, including adventure, sci-fi, and romance genres.

  • BERT uses the encoder
    part of the Transformers structure , and a special form of language modeling called masked language modeling. The purpose of masked language modeling is to predict random masked words in text . For example, there is a sentence such as ""I looked at my [MASK] and saw that [MASK] was late." The model needs to predict the most likely candidate for the masked word represented by [MASK]. BERT in the book corpus and English Pretrained by Wikipedia.

GPT and BERT pioneered a new level of state-of-the-art on various NLP benchmarks and ushered in the era of Transformers.

However, with different research labs publishing their models in incompatible frameworks (PyTorch or Tensorflow), it is not always easy for NLP practitioners to port these models to their own applications. With the release of Huggingface Transformers, a unified API across more than 50 architectures is gradually established. This library catalyzed an explosion of research on Transformers and quickly permeated NLP practitioners, making these models easy to integrate into many real-world applications today. Let's take a look!

4. Huggingface Transformers

Applying a new machine learning architecture to a new task is a complex task that usually includes the following steps:

  1. Implement the model body structure in code , usually based on PyTorch or Tensorflow.
  2. Load pre-trained weights from the server (if available).
  3. Preprocess the inputs, pass them through the model, and apply some task-specific post-processing .
  4. Implement a data processor, and define a loss function and optimizer to train the model .

Each step requires custom logic for each model and task. Usually when a research group releases a new paper, they also release the code and model weights. However, this code is rarely standardized and often requires days of engineering to adapt to new use cases.

This is where Huggingface Transformers come to the rescue of NLP practitioners! It provides a standardized interface to a wide range of Transformers models, along with code and tools to adapt these models to new use cases. The library currently supports three major deep learning frameworks (PyTorch, TensorFlow, and JAX) and allows you to switch between them easily. Additionally, it provides task-specific headers so you can easily fine-tune Transformers on downstream tasks such as text classification, named entity recognition, and question answering. This reduces the time it takes a practitioner to train and test a few models from a week to an afternoon!

Alibaba DAMO Academy launched the "Chinese version" of Huggingface Transformers — ModelScope
ModelScope aims to create a next-generation open source model-as-a-service sharing platform, providing flexible, easy-to-use, and low-cost one-stop model services for pan-AI developers Products, make model application easier!
Although this column will be based on Huggingface, it is strongly recommended that you try to use ModelScope! The framework and interface of ModelScope are very similar to Huggingface, and it has more Chinese models , and the documents are more friendly to students who are native Chinese speakers!

As you'll see in the next section, we'll show that, with just a few lines of code, Huggingface Transformers can be applied to some of your most common NLP applications.

5. Transformer application introduction

Every NLP task starts with a piece of text, such as the following customer feedback about an online order:

text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

Depending on your application, the text you're working with could be a legal contract, a product description, or something else entirely. In the case of customer feedback, you might wonder whether the feedback is positive or negative. This task is known as sentiment analysis and is part of a broader topic of text classification that we will explore in Chapter 2. Now, let's see how we can use Huggingface Transformers to extract sentiment from our text.

5.1 Text Classification

As we'll see in later chapters, Huggingface Transformers has a layered API that allows you to interact with the library at different levels of abstraction. In this chapter, we'll start with pipelines , which abstract all the steps needed to transform raw text from a fine-tuned model into a series of predictions.

In Huggingface Transformers, we instantiate a pipeline by calling the pipeline() function, providing the name of the task we are interested in:

from transformers import pipeline

classifier = pipeline("text-classification")

The first time you run this code, you'll see several progress bars as the pipeline automatically downloads the model weights from Hugging Face Hub . The second time you instantiate the pipeline, the library will notice that you have downloaded the weights and will use the cached version. By default, the text-classification pipeline uses a model designed for sentiment analysis, but it also supports multi-class and multi-label classification.

Now that we have our pipeline, let's make some predictions! Each pipeline takes a text string (or a list of strings) as input and returns a list of predictions. Each prediction is a Python dictionary, so we can display it nicely as a DataFrame using Pandas:

import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)   
	label	score
0	NEGATIVE	0.901546

In this case, the model is pretty confident: the text is a negative sentiment, which makes sense since we're dealing with a complaint from an angry customer! Note that for the sentiment analysis task, the pipeline only returns one of a positive or negative label, since the other can be inferred by computing a 1-score .

Now let's look at another common task, named entity recognition in text.

5.2 Named entity recognition

Anticipating the sentiment of customer feedback is a good first step, but you usually want to know that the feedback is about a specific good or service. In NLP, real-world objects such as products, places, and people are called named entities, and extracting them from text is called named entity recognition (NER) . We can do NER by loading the corresponding pipeline and providing our customer feedback text:

ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)    
	entity_group	score	word	start	end
0	ORG	0.879010	Amazon	5	11
1	MISC	0.990859	Optimus Prime	36	49
2	LOC	0.999755	Germany	90	97
3	MISC	0.556569	Mega	208	212
4	PER	0.590256	##tron	212	216
5	ORG	0.669692	Decept	253	259
6	MISC	0.498350	##icons	259	264
7	MISC	0.775361	Megatron	350	358
8	MISC	0.987854	Optimus Prime	367	380
9	PER	0.812096	Bumblebee	502	511

You can see that the pipeline detected all entities and assigned each one a category such as ORG (organization), LOC (location) or PER (person). Here, we use the aggregation strategy parameter to group words based on the model's predictions. For example, the entity "Optimus Prime" consists of two words but is assigned a category: MISC (miscellaneous). These scores tell us how confident the model is about the entities it recognizes. We can see that it has the least confidence in the first appearances of the Decepticons and Megatron, but both of these fail to group them together as a single entity.

See those weird hash signs (#) in the word column in the previous table? These are produced by the model's tokenizer, which divides words into atomic units called tokens. You will learn all about tokens in Chapter 2.

Extracting all named entities in text is fine, but sometimes we want to ask more specific questions. This is where we can use question answering.

5.3 Question Answering (Machine Reading Comprehension)

In question answering tasks, we provide the model with a piece of text called context, and a question whose answer we want to extract. The model then returns the snippet of text corresponding to the answer . We sometimes call it the machine reading comprehension task. Let's take a look at what we get when we ask a specific question about customer feedback:

reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])
	score	start	end	answer
0	0.631291	335	358	an exchange of Megatron

We can see that, in addition to the answer, the pipeline returns the number of start and end positions corresponding to the index of the character for which the answer span was found (just like NER tokens). We examine several types of question answering in Chapter 7, but this particular type of question answering is called extractive question answering because the answers are extracted directly from the text.

Using this method, you can quickly read and extract relevant information from your customers' feedback. But what if you get tons of long-winded complaints and don't have time to read them? Let's see if a summary model can help!

5.4 Summary

The goal of text summarization is to take a long text as input and produce a short version with all relevant facts. **This is a much more complex task than the previous ones, as it requires the model to generate coherent text. Similarly, we can instantiate a summary pipeline as follows:

summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])
Bumblebee ordered an Optimus Prime action figure from your online store in
Germany. Unfortunately, when I opened the package, I discovered to my horror
that I had been sent an action figure of Megatron instead.

That summary isn't too bad! Although parts of the original text had been copied, the model was able to capture the essence of the problem and correctly identified "Bumblebee" (appearing at the end) as the author of the complaint. In this example, you can also see that we pass some key parameters to the pipeline, such as max_length and clean_up_tokenization_spaces; these allow us to adjust the output at runtime.

But what happens when you get feedback in a language you don't understand? You can use Google Translate, or you can use your own transformer which will translate it for you!

5.5 translation

Like summarization, translation is a task whose output consists of generated text . Let's use a translation pipeline to translate English text into German:

translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])
Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus
Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete,
entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von
Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich
hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere
einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt.
Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von
Ihnen zu hören. Aufrichtig, Bumblebee.

Again, the model produced a very good translation, correctly using German formal pronouns such as "Ihrem" and "Sie". Here, we also show how to override the default models in the pipeline to choose the best model for your application - you can find models for thousands of language pairs on the Hugging Face Hub. Before we take a step back and look at the entire Hugging Face ecosystem, let's take a look at one final app.

5.6 Text generation

Let's say you want to be able to respond to customer feedback faster by accessing an automated system . Using the text generation model, it can be as follows:

generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])
Dear Amazon, last week I ordered an Optimus Prime action figure from your online
store in Germany. Unfortunately, when I opened the package, I discovered to my
horror that I had been sent an action figure of Megatron instead! As a lifelong
enemy of the Decepticons, I hope you can understand my dilemma. To resolve the
issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered.
Enclosed are copies of my records concerning this purchase. I expect to hear
from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The order was
completely mislabeled, which is very common in our online store, but I can
appreciate it because it was my understanding from this site and our customer
service of the previous day that your order was not made correct in our mind and
that we are in a process of resolving this matter. We can assure you that your
order

Well, maybe we don't want to deflate Bumblebee with this, but you should already have an idea.

Now that you've seen some cool applications of Transformer models, you might be wondering where model training happens. All the models we use in this chapter are publicly available and have been fine-tuned for the task at hand. However, in general, you need to adapt the model to your own data, and in the next chapters, you will learn how to do this.

But training a model is only a small part of any NLP project — being able to process data efficiently, share results with colleagues, and make your work reusable are also key components. Fortunately, there is a practical ecosystem around Hugging Face Transformers, and these tools support many modern machine learning workflows. Let's take a look.

6. Hugging Face Ecology

From the very beginning, Hugging Face Transformers has rapidly grown into a complete ecosystem to accelerate your NLP and machine learning projects. The Hugging Face ecosystem mainly consists of two parts: a library and a Hub , as shown in Figure 1-9.
Figure 1-9: Overview of the Hugging Face ecosystem

**The library provides the code, while the Hub provides pre-trained model weights, datasets, scripts for evaluating metrics, and more. **In this section, we will briefly introduce various components, which we will see more about in the book.

6.1 Hugging Face Hub

As mentioned earlier, transfer learning is one of the key factors driving the success of Transformers, as it makes it possible to reuse pre-trained models in new tasks. Therefore, it is critical to be able to quickly load pre-trained models and run experiments with them.

Hugging Face Hub offers over 20,000 free models. As shown in Figure 1-10, there are some filter conditions for tasks, frameworks, datasets, etc., designed to help you navigate the Hub and quickly find more ideal candidates.
Figure 1-10: The Models page in the center of the hugging surface, with filter conditions on the left and model lists on the right

As we saw in the pipline, loading an ideal model in code only takes one line of code. This makes it simple to experiment with various models and allows you to focus on domain-specific parts of your project.

In addition to model weights, the Hub also hosts datasets and scripts for computing metrics , which allow you to reproduce published results or leverage other data for your applications.

Hub also provides Model card and dataset card to document the content of models and datasets and help you make a better decision as to whether they are suitable for you. One of the coolest features of the Hub is that you can try out any model directly through a variety of task-specific interactive widgets , as shown in Figure 1-11.

Figure 1-11.  An example model card for the Hugging Face Hub is shown on the right: allows you to interact with the corresponding model

6.2 Hugging Face Tokenizers

Behind every pipline example you've seen in this chapter, there's a Tokenizer step, which splits the raw text into small chunks called tokens. We'll go into more detail about how this works in Chapter 2, but for now it understands that tokens could be words, parts of words, or just characters like punctuation marks . The Transformers model is trained based on the numerical representation of these Tokens, so completing this step correctly is very important for the entire NLP project!
Hugging Face Tokenizer provides many tokenizer strategies, and thanks to its Rust backend, it is very fast at tokenizing text. It also takes care of all the pre- and post-processing steps, such as normalizing the input and converting the model output into the desired format. With Hugging Face Tokenizer, we can load the Tokenizer component, just like we can load pre-trained model weights with Hugging Face Transformers.

We need a dataset and metrics to train and evaluate the model, so let's take a look at Hugging Face Datasets, which takes care of this.

6.3 Hugging Face Datasets

Loading, processing, and storing datasets can be a tedious process, especially when the dataset is too large to fit in your laptop's memory. Also, you usually need to implement various scripts to download and convert the data into a standard format.

Hugging Face Datasets simplifies this process by providing a standard interface to the thousands of datasets that can be found on the Hub . It also provides smart caching (so you don't have to redo the preprocessing every time you run your code), avoiding RAM constraints by utilizing a special mechanism called memory mapping, which maps the contents of files to Stored in virtual memory and enables multiple processes to modify the file more efficiently. The library also interoperates with popular frameworks like Pandas and NumPy, so you can use your favorite data manipulation tools.

However, having a good dataset and a strong model is worthless if you cannot reliably measure model performance. Unfortunately, classic NLP metric measures come with many different implementations that can vary slightly and lead to deceptive results. Hugging Face Datasets help make experiments more reusable and results more believable by providing scripts for many metrics .

With the Hugging Face Transformers, Tokenizers and Datasets library, we have everything we need to train our own Transformers models! However, as we will see in Chapter 10, there are situations where we need fine-grained control over the training loop. This is where the last library of the ecosystem comes into play: Hugging Face Accelerate.

6.4 Hugging Face Accelerate

If you've ever had to write your own training scripts in PyTorch, you've probably run into some trouble trying to port code that runs on your laptop to code that runs on your organization's cluster.

Hugging Face Accelerate adds an abstraction layer to the normal training loop to handle all the custom logic required to train the infrastructure. This really speeds up your workflow by simplifying infrastructure changes when necessary.

This sums up the core components of Hugging Face's open source ecosystem. But before wrapping up this chapter, let's take a look at some of the common challenges that come with trying to deploy Transformers in the real world.

7. The main challenges faced by Transformers

In this chapter, we took a brief look at various NLP tasks that can be handled with Transformers models. Reading the headlines of media coverage, it sometimes sounds as if their capabilities are limitless. However, as useful as Transformers are, they are far from a panacea. Here are some of the challenges associated with it, which we explore throughout the book:

  • Language
    Natural language processing research is mainly in English. There are several models for other languages, but pretrained models for rare or low-resource languages ​​are hard to find . In Chapter 4, we explore multilingual transformers and their ability to perform zero-shot cross-lingual transformations.

  • Data Availability
    While we can use transfer learning to significantly reduce the amount of labeled training data our models need, this is still a lot of problems compared to what humans need to perform the task. Chapter 9 deals with those scenarios where you have little to no labeled data .

  • Applications on Long Documents
    Self-attention works very well on long paragraphs of text, but becomes very expensive when we move to long paragraphs of text like entire documents . Methods to mitigate this situation are discussed in Chapter 11.

  • Ininterpretability
    Like other deep learning models, Transformers are largely opaque. It is difficult or impossible to explain "why" a model makes a certain prediction. This is a particularly difficult challenge when deploying these models to make critical decisions. We will explore some methods for detecting errors in Transformers models in Chapters 2 and 4.

Biased
Transformers models are primarily pre-trained on text data from the internet. This incorporates any bias present in the data into the model. Ensuring that this data is neither racist, sexist, or worse is a challenging task. We discuss some of these issues in more detail in Chapter 10.

8. Summary

Hopefully now you're excited to learn how to start training models and integrating these general models into your own applications! You've seen in this chapter that with just a few lines of code, you can use state-of-the-art models for classification, named entity recognition, question answering, translation, and summarization, but that's really just the "tip of the iceberg."

In the following chapters, you'll learn how to adapt Transformers to a wide range of use cases, such as building text classifiers, or lightweight models for production, or even training language models from scratch. We'll take a hands-on approach to explaining each concept.

Now that we've grasped the basic concepts behind Transformers, it's time to master our first application: text classification. This is the subject of the next chapter!

Guess you like

Origin blog.csdn.net/u011239443/article/details/127783573