New work by Jia Jiaya and Han Song’s team: Two lines of code double the context window of a large model | GitHub hot list

Click on the card below to follow the " CVer " public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter -> [Target Detection and Transformer] communication group

Cressy is from Ao Fei Temple
and is reprinted from: Qubit AI (QbitAI)

With just two lines of code and 11 hours of fine-tuning, the 4k window length of the large model can be increased to 32k.

In terms of scale, it can be expanded to up to 100,000 tokens, and you can read multiple chapters of a novel or short and medium-sized novels in one breath.

This new LoRA-based fine-tuning method for large models proposed by Jia Jiaya and Han Song's team has been on the GitHub hot list.

e7cd8bcf3dab26868db5185098c421fa.png

This method is called LongLoRA and is jointly produced by an all-Chinese team from the Chinese University of Hong Kong and MIT.

On a single machine composed of 8 A100s, increasing the window length is several times faster than full fine-tuning .

After reading it, netizens couldn’t help but express that this efficiency is really impressive:

9ea88078cb11d8d780b790dea892649d.png

So, what changes will happen to the model after fine-tuning with LongLoRA?

Read a novel in one sitting

The model used in the research team's experiments was Llama 2.

After fine-tuning the LongLoRA method, the window length of Llama 2-7B can be increased to a maximum of 100,000 tokens.

Actual testing found that the fine-tuned model can read a novel in one sitting and then answer various questions.

For example, summarizing the central idea embodied by Liu in the third part of "The Three-Body Problem" is a level higher than summarizing the content.

The answers given by the model include the dangers of first contact with extraterrestrial civilizations, the difficulty of interstellar travel and the fragility of human civilization, as well as the importance of unity and cooperation.

Indeed, every item is reflected in the original work, and it is relatively comprehensive.

0d68612f1d6fcee0f08f127d4f332bf7.png

In addition to summarizing and refining the entire work, of course you can also inquire about local content.

The characters in the novel can also answer questions fluently, such as how Sun Wukong blossomed and grew in "Journey to the West".

The model tells us that Sun Wukong is very wise, but also has a naughty heart, and he matures in the process of accompanying Tang Monk to learn scriptures.

This summary is still in place.

debf366c7d78cec5a5b3e23204270cb7.jpeg

And not only a single character, but also the complex relationships between different characters can be understood clearly.

The way of asking questions can be simple and crude, directly asking to describe the relationship between the characters in this book ("Harry Potter").

The model is centered on Harry Potter, introducing his friends Weasley, Hermione, enemy Malfoy, and Professor Dumbledore and other characters.

99a21aa968257888ce632b122dfb63c0.jpeg

In addition to reading novels, Llama can also read papers after LongLoRA has been fine-tuned, and productivity has suddenly improved (hi).

Whether it is an overall generalization or a local inquiry, the fine-tuned model can accurately give the answer:

b4f220c5a646f236fc47e87e7716e166.png

The Chinese part is translated by Google

966c654403a4623c0d0bfe2a2a91ad20.png

In order to grasp the performance of the model from a macro perspective, the research team used the following data sets for testing:

  • PG19: A long document data set from books, used to test the language modeling effect.

  • Proof-pile: A mathematical paper data set from arXiv, used to test the language modeling effect.

  • LongQA: A long sequence question answering dataset built by the author for supervised fine-tuning.

  • LongChat: A long conversation understanding data set built by a third party, used to test the effect of understanding long sequence narratives.

The results show that the perplexity of LongLoRA on PG19 and Proof-pile is close to that of full fine-tuning.

c445a4f0b574d2f9a2f168e0f0217a88.png

On the question and answer data set, the model fine-tuned by LongLoRA also performed very well, and it even reached the SOTA level in terms of long text understanding.

24122ebea5c30be0ae5d7a1804c3fe9a.png

Of course, the significance of LongLoRA is not only to increase the window length, but the key is to increase the window length with less consumption.

Taking Llama-2 with 7B parameters as an example, if full fine-tuning is used, it will take five days to increase from 4k to 32k on a single machine with 8 A100s.

By switching to the LongLoRA method, it can be completed in only 11.3 hours, less than half a day, and the efficiency is improved nearly ten times.

If it is increased to 65k, the time required for full fine-tuning will exceed 1,000 hours, but LongLoRA only takes 52.4 hours.

fbe46bd5c5f8e1f9f4cb2bc59efeb12f.png

So how does LongLoRA do it?

"Big and small" reduces the amount of calculation

LongLoRA builds on LoRA and introduces a mechanism called "shift short attention".

This mechanism only requires two lines of code to implement:

ae03a2d385ac3bd10e56b432bd503b98.png

The core of the Transformer architecture is self-attention calculation.

Short attention is to divide the training text into multiple groups so that the self-attention calculation is performed separately in each group, thereby achieving the purpose of reducing the amount of calculation.

In this process, the attention heads are also grouped, and through the displacement of the attention heads, information interaction between groups is achieved.

There is overlap between each group, ensuring that the data can flow throughout the text.

In this way, each calculation only needs to operate on the tokens in the group, and the amount of calculation is greatly reduced.

ad4e12d6b0e0b5445a5c59256d1eae9e.png

In addition to segmenting the input, LongLoRA can also fine-tune the embedding layer and normalization layer compared to Lora.

These two items account for a very small amount of parameters. Taking Llama 2-7B as an example, the embedding layer only accounts for 1.94%, and the normalization layer accounts for less than 4/100,000.

The ablation experiment results show that in addition to the core Attention layer, these two small parts also play an important role.

ca47977e8d46c833c84b977b9374df65.png

In addition to the core short attention mechanism, the research team introduced DeepSpeed ​​and FlashAttention methods to further reduce training consumption.

At present, Llama 2 with different parameter amounts and window lengths after fine-tuning LongLoRA has been open source. If you are interested, you can check it out on the GitHub page.

Paper address:
https://arxiv.org/abs/2309.12307
GitHub project page:
https://github.com/dvlab-research/LongLoRA

Click to enter -> [Target Detection and Transformer] communication group

ICCV/CVPR 2023 paper and code download

 
  

Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号
整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/133503663