Deep learning of natural language processing BERT

Natural Language Processing (NLP) includes natural language understanding and natural language generation. Applications of natural language understanding include semantic analysis, machine customer service, speech recognition, machine translation, etc.

The deep network architecture of transformer plays a pivotal role in the field of NLP. BERT is a natural language model based on transformer. Compared with the GTP3 natural language model also based on transformer, transformer was first published by the Google research team in 2017 in the paper "Attention is all You Need ", which led to major advances in the field of NLP. BERT has many variant architectures, RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, SciBERT, BioBERT, MobileBERT, TinyBERT and CamemBERT, all of which are based on
Google's open source BERTgithub open source project on github. Their core structures are also attention, see the relevant sections.

Transformer (BERT) performance on NLP

All source code download links in this blog post, the following analysis is based on the BERT model.

Whether the judgment sentence of the NLP task is positive or negative

This can be used for any works, products, etc., such as determining the rate of film scheduling based on newly released film reviews, judging its quality among similar books based on book reviews, and increasing product exposure based on the favorable reviews of e-commerce products, etc. This uses BERT to analyze whether the sentence input is positive or negative.

#Copy right [email protected]. All rights reserved.
from transformers import pipeline
import textwrap
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)

#Classifying whole sentences
sentence = 'Both of these choices are good if you’re just starting to work with deep learning frameworks. Mathematicians and experienced researchers will find PyTorch more to their liking. Keras is better suited for developers who want a plug-and-play framework that lets them build, train, and evaluate their models quickly. Keras also offers more deployment options and easier model export.'
classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')
c = classifier(sentence)
print('\nSentence:')
print(wrapper.fill(sentence))
print(f"\nThis sentence is classified with a {
      
      c[0]['label']} sentiment")

The classification output for the above sentence is:

This sentence is classified with a POSITIVE sentiment

Classification of words in sentences

It classifies the words or compound words in the sentence according to organization, person name and place name. ORG is the abbreviation of organization.

sentence = "Both platforms enjoy sufficient levels of popularity that they offer plenty of learning resources. Keras has excellent access to reusable code and tutorials, while PyTorch has outstanding community support and active development."
ner = pipeline('token-classification', model='dbmdz/bert-large-cased-finetuned-conll03-english', grouped_entities=True)
ners = ner(sentence)
print('\nSentence:')
print(wrapper.fill(sentence))
print('\n')
for n in ners:
  print(f"{
      
      n['word']} -> {
      
      n['entity_group']}")

Its output is as follows:

Keras -> ORG
PyTorch -> ORG

question and answer

This application is in the Google search engine. The Asian Games will be held in Hangzhou. You can use Google to search for the purpose of the Asian Games. You can
insert image description here
see that it gives document links, document fragments, and highlights the core answers in the On the search page, the search answers are very accurate, and the product display is also very user-friendly.
In this example, the word release is used in the release time of the original text, and in order to verify the reliability of the model, the word announce is used in the question to verify the robustness of the question and answer. In actual operation, whether it is relased or announced, the running The result is the correct value of 2015.

#question Answering
context = '''
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.
TensorFlow was developed by the Google Brain team for internal Google use in research and production. The initial version was released under the Apache License 2.0 in 2015. Google released the updated version of TensorFlow, named TensorFlow 2.0, in September 2019.
TensorFlow can be used in a wide variety of programming languages, most notably Python, as well as Javascript, C++, and Java. This flexibility lends itself to a range of applications in many different sectors. '''


question = 'When was TensorFlow initial announced ?'

print('Text:')
print(wrapper.fill(context))
print('\nQuestion:')
print(question)


qa = pipeline('question-answering', model='distilbert-base-cased-distilled-squad')

print('\nQuestion:')
print(question + '\n')
print('Answer:')
a = qa(context=context, question=question)
print(a['answer'])

Its output is:

Text:
 TensorFlow is a free and open-source software library for machine learning and
artificial intelligence. It can be used across a range of tasks but has a
particular focus on training and inference of deep neural networks. TensorFlow
was developed by the Google Brain team for internal Google use in research and
production. The initial version was released under the Apache License 2.0 in
2015. Google released the updated version of TensorFlow, named TensorFlow 2.0,
in September 2019. TensorFlow can be used in a wide variety of programming
languages, most notably Python, as well as Javascript, C++, and Java. This
flexibility lends itself to a range of applications in many different sectors.

Question:
When was TensorFlow initial released ?

Question:
When was TensorFlow initial released ?

Answer:
2015

Summary

This is more useful when summery is required in large paragraphs of text such as news, videos or stories, training summaries, etc.

#Text summarization
review = '''
While both Tensorflow and PyTorch are open-source, they have been created by two different wizards. Tensorflow is based on Theano and has been developed by Google, whereas PyTorch is based on Torch and has been developed by Facebook.
 The most important difference between the two is the way these frameworks define the computational graphs. While Tensorflow creates a static graph, PyTorch believes in a dynamic graph. So what does this mean? In Tensorflow, you first have to define the entire computation graph of the model and then run your ML model. But in PyTorch, you can define/manipulate your graph on-the-go. This is particularly helpful while using variable length inputs in RNNs.
 Tensorflow has a more steep learning curve than PyTorch. PyTorch is more pythonic and building ML models feels more intuitive. On the other hand, for using Tensorflow, you will have to learn a bit more about it’s working (sessions, placeholders etc.) and so it becomes a bit more difficult to learn Tensorflow than PyTorch.
 Tensorflow has a much bigger community behind it than PyTorch. This means that it becomes easier to find resources to learn Tensorflow and also, to find solutions to your problems. Also, many tutorials and MOOCs cover Tensorflow instead of using PyTorch. This is because PyTorch is a relatively new framework as compared to Tensorflow. So, in terms of resources, you will find much more content about Tensorflow than PyTorch.
 This comparison would be incomplete without mentioning TensorBoard. TensorBoard is a brilliant tool that enables visualizing your ML models directly in your browser. PyTorch doesn’t have such a tool, although you can always use tools like Matplotlib. Although, there are integrations out there that let you use Tensorboard with PyTorch. But it’s not supported natively.'''

print('\nOriginal text:\n')
print(wrapper.fill(review))
summarize = pipeline('summarization', model='sshleifer/distilbart-cnn-12-6')
summarized_text = summarize(review)[0]['summary_text']
print('\nSummarized text:')
print(wrapper.fill(summarized_text))

The output is as follows:

Summarized text:
 While Tensorflow creates a static graph, PyTorch believes in a dynamic graph .
This is particularly helpful while using variable length inputs in RNNs .
TensorBoard is a brilliant tool that enables visualizing your ML models directly
in your browser . Pytorch is more pythonic and building ML models feels more
intuitive .

word fill in the blank

This is more commonly used in some editors, such as prompts and spell checks for writing documents in word, and some code editors also provide similar functions.

#Fill in the blanks
sentence = 'It is the national <mask> of China'
mask = pipeline('fill-mask', model='distilroberta-base')
masks = mask(sentence)
for m in masks:
  print(m['sequence'])

The output is arranged according to the probability, that is, the probability of anthem is the highest, and the others are analogous, and the output is as follows:

It is the national anthem of China
It is the national treasure of China
It is the national motto of China
It is the national pride of China
It is the national capital of China

translate

The application of mutual translation between different languages ​​is quite wide. Since the BERT built-in model does not have mutual translation between Chinese and English, here we choose to translate English into Chinese as an example.

#Translation
english = '''I like artificial intelligence very much!'''

translator = pipeline('translation_en_to_de', model='t5-base')
german = translator(english)
print('\nEnglish:')
print(english)
print('\nGerman:')
print(german[0]['translation_text'])

The output is as follows:

English:
I like artificial intelligence very much!
German:
Ich mag künstliche Intelligenz sehr!

History of Transoformers

In 2017, a major challenge for large-scale natural language models is the labeling of datasets, which requires a lot of manpower and time. The ULMFiT model proposed subsequently does not need to label the data set. This means that high-quality Wikipedia and books can be directly used for model training. In June 2018, open AI first launched the GTP (Generative pre-training model), which can be used for various NLP tasks. After several months of fine-tuning and the highest level at the time, the Google research team released the BERT model. The previous section is an example of BERT's commercial application. In February 2019, the OpenAI team released a larger model with better effects, GTP-2, but the OpenAI team did not open source the details of GPT-2. In the second half of the same year, the Facebook team released BART, and the Google team released T5. Both are large pre-trained models using the same architecture as the original transformer. In the same year, there were also smaller models. The size of the DistilBERT model was only 60% of BERT, but its performance was 95% of BERT. In 2020, the OpenAI team proposed a GPT-3 language model with a larger model and higher accuracy, and opened some APIs, but did not open the training data set and free model. The practicality and model size of their development is shown in the figure:
Please add a picture description

BERT model

BERT is the abbreviation of Bidirectional Encoder Representations from Transformers. Its training data uses English Wikipedia with 2.5 billion words and published books with 800 million words. BERT divides the model into two categories according to case-sensitive and case-insensitive. If you want Calculate the case-sensitive BERT size, which can be calculated through the model checkpoint officially provided by BERT:

def get_model_size(checkpoint='bert-base-cased'):
    '''
    Usage:
        checkpoint - NLP model with its configuration and its associated weights
        returns the size of the NLP model
    '''

    model = AutoModel.from_pretrained(checkpoint)
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    num_params = 0

    return sum(torch.numel(param) for param in model.parameters())


checkpoint = 'bert-base-cased'
print(f"The number of parameters for {
      
      checkpoint} is : {
      
      get_model_size(checkpoint)}")

output is

The number of parameters for bert-base-cased is : 108310272

According to the parameters of the model, the memory required for model reasoning can be roughly calculated. Each parameter is a four-byte floating-point number. The number of parameters shown above is 108 million, and the calculation requires about 4 1.08=432M memory size . If you want to run the GPT3175 model, you need 175000 4 ~ 700G memory.

transformer architecture and BERT

The transformer architecture diagram given in the "Attention is all you need" paper is as follows:
insert image description here
the left side is the encoding part, and the right side is the decoding part, Nx encoder and decoder counting. In this figure, there are six layers of encoder and six layers of decoder. For translation, summary, etc. The encoder-decoder architecture is very suitable for general tasks that require input and output. Google's BART and T5 are both language models of this architecture. For example, only the encoder part is needed, such as only the sentence meaning of praise and criticism. BERT , RoBERTa and distilBERT belong to this type of model, and the language model containing only the decoder part is suitable for text generation, such as the GPT series models released by OpenAI.
The core of the encoder and decoder that make up the transformer is self-attention. The encoder structure is the core structure of BERT, and Attention is the core structure of the encoder and decoder. The transformer architecture relies entirely on self-attention to draw the global dependencies between input and output. .
The formula of self-attention expresses:
A attention ( Q , K , V ) = softmax ( QKT n ) V Attention(Q,K,V) =softmax(\frac{QK^T}{\sqrt{n}})VAttention(Q,K,V)=softmax(n QKT) V , Q is query, V is value, and K is key, which represents the distance between word vectors. The formula looks abstract and obscure. The following simple examples illustrate the calculation process of the formula.

The following example illustrates the process of self-attention:
1. Input
2. Weight initialization
3. Derivation key, query and value
4. Calculate attention score for input 1
5. Calculate softmax
6. Use the score value to weight the value (multiply)
7. The weight value obtained in step 6 gets output 1
8. For input 2 and input 3, repeat steps 4-7

  • Step 1 Prepare for input
    Here, three four-dimensional examples are used to represent the input:
Input 1: [0, 1, 0, 1] 
Input 2: [3, 0, 4, 1]
Input 3: [1, 2, 1, 1]

Please add a picture description

  • Weight initialization
    Each input vector is represented by three vectors of key (yellow), query (red) and value (purple), and if the dimensions of the three representations are all 3, the weight dimension can be obtained as 4*3.

In order to obtain the three representations of key, query and value, it is assumed that the weight parameters of the three representations are initialized (usually floating-point numbers are used in deep neural networks, random normal distribution, initialization, weights are mentioned in TensorFlow model optimization and tuning examples Parameter initialization, as follows:

//key
 [
  [1, 0, 2],
  [1, 0, 0],
  [0, 1, 2],
  [0, 1, 1]
]
 //query
[
  [1, 1, 0],
  [1, 2, 0],
  [1, 0, 1],
  [0, 0, 1]
]
 //value
 [
  [0, 1, 0],
  [1, 3, 0],
  [1, 0, 2],
  [1, 2, 0]
]
  • Calculation of key, query, and value
    The key of input 1 is calculated as follows:
               [1, 0, 2]
[0, 1, 0, 1] x [1, 0, 0] = [1, 1, 1]
               [0, 1, 2]
               [0, 1, 1]

The key values ​​of similar input 2 and input 3 are calculated as follows:

//key2
               [1, 0, 2]
[3, 0, 4, 1] x [1, 0, 0] = [3, 5, 15]
               [0, 1, 2]
               [0, 1, 1]
//key3
               [1, 0, 2]
[1, 2, 1, 1] x [1, 0, 0] = [3, 2, 5]
               [0, 1, 2]
               [0, 1, 1]

The vector calculation of the key value is expressed as follows:

               [1, 0, 2]
[0, 1, 0, 1]   [1, 0, 0]   [1, 1, 1]
[3, 0, 4, 1] x [0, 1, 2] = [3, 5, 15]
[1, 2, 1, 1]   [0, 1, 1]   [3, 2, 5]

Similar value and query values ​​(in fact, there are also bias values), are calculated as follows:

//value
               [0, 1, 0]
[0, 1, 0, 1]   [1, 3, 0]   [2, 5, 0] 
[3, 0, 4, 1] x [1, 0, 2] = [5, 5, 8]
[1, 2, 1, 1]   [1, 2, 0]   [4, 9, 2]
//query
               [1, 1, 0]
[0, 1, 0, 1]   [1, 2, 0]   [1, 2, 1]
[3, 0, 4, 1] x [1, 0, 1] = [7, 3, 5]
[1, 2, 1, 1]   [0, 0, 1]   [4, 5, 2]

Please add a picture description

With these three sets of values, the attention score can be calculated.

  • Input the score of 1
    to calculate its score, you need to multiply the query value of input 1 with all the key values ​​(in addition to dot multiplication, you can also choose scaled dot multiplication, addition or splicing as the attention scoring value), then you can Get the attention score of which blue color.
            [1, 3, 3]
[1, 2, 1] x [1, 5, 2] = [4, 28, 12]
            [1, 15, 5]

Please add a picture description

  • Calculation of softmax values
    ​​Perform softmax calculations on all blue scoring values, and for the convenience of calculations, four entries and five inputs are performed.
softmax([2, 4, 4]) = [0.0, 1.0, 0.0]
  • Multiply the scoring value and the value value
    Multiply the atten scoring value after softmax and its corresponding value value, so that three corresponding yellow vectors will be obtained, which are called weight value values.
1: 0.0 * [2, 5, 0] = [0.0, 0.0, 0.0]
2: 1.0 * [5, 5, 8] = [5.0, 5.0, 8.0]
3: 0.0 * [4, 9, 2] = [0.0, 0.0, 0.0]
  • Add the weight value to get the output value
  [0.0, 0.0, 0.0]
+ [5.0, 5.0, 8.0]
+ [0.0, 0.0, 0.0]
-----------------
= [5.0, 5.0, 8.0]

The process from input 1 to output is calculated in the figure below,
Please add a picture description

The output of input 1 (called output 1) is the output of input 1 (called output 1) after adding the corresponding position elements of the weight value value, that is, output 1 is based on the query representation of input 1 itself and all other key values.

  • Similar operation for input 2 and input 3

BERT classification

Fine-tuning BERT using the open source movie review dataset IMDB. The data set has two columns, one column is the movie review, and the corresponding second column is the label of whether the movie review is positive or negative, 1 means positive evaluation and 0 means negative evaluation.
The situation of the IMDB dataset is as follows:

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

If using the complete IMDB training takes a lot of time, you can use a movie review containing 2000 sentences to demonstrate fine-tuning. In addition, a validation data set is added to the original data set to verify whether the model is overfitting or underfitting during training.
This blog code download address link

In the article TensorFlow model optimization and tuning examples, one of the deep learning, the model optimization method is mentioned, which can change the batchsize, epoch, and use the distilbert-based-cased model to compare the accuracy/reasoning time of the model with the source code in the link.

Guess you like

Origin blog.csdn.net/shichaog/article/details/125159999