StreamingLLM - Handling infinite length input

Article directory


About StreamingLLM

Efficient Streaming Language Models with Attention Sinks


Deploying large language models (LLMs) in streaming applications such as multi-turn conversations is highly desirable, but this poses two main challenges.
First, during the decoding phase, caching previous tokens' Key and Value (KV) consumes a lot of memory.
Second, popular LLM cannot generalize to texts longer than the training sequence length.

Window attention, caching only the latest KV, is a natural approach - but we show that it fails when the text length exceeds the cache size.
We observed an interesting phenomenon that attention sinking, that is, maintaining the KV of the initial tokens, will restore the performance of window attention to a large extent.

In this paper, we first demonstrate that the attention sink arises due to the initial labeling of the attention sink as a “sink”

Guess you like

Origin blog.csdn.net/lovechris00/article/details/133604848