Brief description of KVCache principle

During the reasoning process of GPT, it tests the next word (probability) based on the generated part of the complete question and answer.

For example, our question is [The king of heaven covers the tiger on earth,] and the answer is [The pagoda suppresses the river demon. 】.

So for the first time, GPT generates [Treasure] based on [The King covers the tiger,], and then generates [Tower] based on [The King covers the tiger, treasure], and so on until it encounters the terminator.

The QKV of the question [The king of heaven covers the tiger of the earth,] is actually repeated many times. Since GPT is a one-way attention, the KV of each layer's question is only calculated based on the KV of the previous layer's question (or the embedding vector of the question), and is not calculated based on the KV of any characters in the answer. They can be cached to avoid this. Repeated calculation.

As shown below:

Insert image description here

After improvement, our GPT generates [Treasure] based on [The King covers the tiger,], and KV (The King covers the tiger,), and then generates [Tower] and KV based on KV (The King covers the tiger,) and [Treasure] (The king of heaven covers the tiger on earth, treasure), and so on.

As for why Q is not cached, because in the reasoning scenario we only take the last word, then each layer outputs HS[-1]. HS[-1] is calculated based on all V and the last row A[-1] of the attention matrix, while A[-1] is calculated based on Q[-1] and all K. Q[-1] is only calculated based on the last input One character X[-1] is calculated.

So we ensure that K and V are complete by passing in KVCache, and only pass in the last input character, which is the character generated by the last GPT.

Guess you like

Origin blog.csdn.net/wizardforcel/article/details/133131845