The difference between Bert and T5

The main difference between Bert and T5 is the size of tokens (words) used in prediction . Bert predicts a target consisting of a single word (single token masking ), on the other hand, T5 can predict multiple words , as shown in the figure above. It provides flexibility to the model in terms of learning the model structure.

Transformer is a deep learning model that uses self-attention mechanism. Self-attention works by establishing a degree of importance or relationship between a given word and its surroundings . Before going into details, remember that a word embedding is a real-valued numerical representation of a word that encodes the meaning of a word , which will be useful for checking which other words have a similar encoding . Similar encoding means that words are highly related to each other. Back to self-focus!



"Today I am writing an article about search engines."

Suppose I want to compute the self-attention of the word "article".

SA('article') = amount of relation between the word "article" and other words in the sentence (SA = Self-attention).

Each arrow represents the attention between the word "article" and any word in the sentence. In other words, each arrow indicates how related the two words are to each other. We should note that this is attention for only one word and we should repeat this step for all other words.

At the end of the process, we'll get a vector for each word that contains numeric values ​​representing the word and its relationship to other words.

Why did they create a self-attention mechanism?
The reason for creating the self-attention mechanism is because of limitations found in other fundamental models.

For example, skip-gram is a model that generates word embeddings. During the training phase of skip-gram, it learns to predict a certain number of surrounding words given a single word as input. Usually, we specify the window size, i.e. how many enclosed words will be given as input.

But the main limitation of the model is that the prediction for a given word will only be based on a limited number of surrounding words. On the other hand, self-attention not only checks all other words in the sentence, but also assigns them a certain degree of importance.

Example: How an ML model predicts the word "river" in the following sentence: Bank of a (river) 

 

 

Guess you like

Origin blog.csdn.net/qq_39970492/article/details/131212486