Shock! Using RNN can achieve dialogue effects beyond GPT! Even beyond LLaMA? Github has nearly 10,000 stars

Hello everyone, I am zenRRan. Recently, I found that a small partner shared an extremely shocking article in the group: through the pure RNN architecture, the performance of the large language model based on GPT has been achieved or even surpassed . At first I thought it was Minke, but after a closer look, I found that the author Zhihu has more than 100,000 followers

6b342205667de7cfa2f47fc24649ad64.png

The github of this project is called The RWKV Language Model [1], and the star of the project is almost 10,000.

e0d0f44dec52f0dfd8281a44c1ded624.png

Project introduction :

RWKV is an RNN with Transformer-level LLM performance, and it can also be directly trained (parallelizable) like a GPT transformer. And it's 100% attention-free . You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" model to quickly compute the hidden state of the "RNN" model.

So it combines the advantages of RNN and Transformer - great performance, fast inference, saving VRAM, fast training, "unlimited" ctx_len and free sentence embedding (using final hidden state).

Let’s take a look at the author’s Zhihu article below~


Zhihu: PENG Bo
Address: https://zhuanlan.zhihu.com/p/619721229

Enter the NLP group —> join the NLP exchange group

The introduction of all current RWKV models [2] (note that RWKV is 100% RNN, and only I can use RNN to do this on the earth).

Below is the effect of 7B Raven-v7-ChnEng running on ChatRWKV v2 (no modification, no retry):

eb12076b5230032ba08629c0daad4149.png83fcdd3cc7ef1d1cd25c200f12bc3b39.png

It can be seen that 7B sometimes omits details and requires you to guide them. In fact, if the program is well written, allowing editing of the computer's answers, and adding rich details to the early answers of the computer, it can always maintain a detailed and rich style. Note that currently only [20G ordinary + 200G web text] is used in Chinese, and even the vocabulary is in English (many Chinese require two or three tokens), and the basic model of RWKV Chinese will be much stronger later .

In addition, world settings can be added. For example, in the following prompt, I used + to let the model generate various beginnings, all of which are very good:

32d21c446b52f264d151e1f3580c754f.png

Prompt: Please play a text adventure game, I am the protagonist of the game. This is a fantasy world of comprehension, there are four sects. I enter my action, please show the result of the action, and describe the environment in detail. My first action is "wake up", please start the story.

f3bf89985adcd4154aea64643ccf77f8.png

Look at 14B Raven-v7-Eng again, this is strong, Discord records played by foreigners (no modification, no retry):

f77179f71cbef13b90e483bdb0474ccb.png961370dfd5516ee7aa23e074e93912c7.png1cf3d38b9ec8b216ec1b753b315f7d54.png00b40242abbc592e134892c9dfbbb6ee.png

In addition, Raven models can complete various tasks. For example, this is the code written by 7B Raven-v7-Eng (because topp=0.8 here, it is easy to make small mistakes, and it will be more accurate to reduce topp):

7647c878092cc8a5efe912d9db1acbdf.png 4c1d210cfa66abde0810a0f4104b17d9.png

If the above-mentioned things are done by GPT, it is not surprising at all, and the technical content is equal to 0.

But these are done with RNN . As mentioned, I'm the only one on the planet who can do this with an RNN.

RWKV is bigger and stronger, and has the ability to use long ctxlen. Moreover, the algorithm of RWKV is extremely simple, which is more suitable for hardware and chips.

Therefore, in the next few years, I will use RWKV to implement all-round dimensionality reduction strikes on transformers (resources have been gradually accumulated), eliminate transformers, and become the infrastructure of all human AI large models.

Another evidence is that the current designs of other teams, whether it is the state space series or Mega, are moving closer to RWKV's exponential moving average method. It can be seen that RWKV is the current positive solution.

In the same corpus training, RWKV vs GPT, zero-shot results:

db333cb6369d4f2ccbbe8a0d28149b7e.png

All the design, research and development, optimization of RWKV, refining from 0.1B to 14B, data cleaning, promotion, and customer service (this is the most time-consuming lol) are all done by me alone. I will refine it step by step to 100B (in Pile v2 1.7T), and first eliminate LLaMA.

Zhihu trolls like the myth of OpenAI. As I said, as long as I have high-quality data and computing power, I can face OpenAI by myself.

This is not because I understand, but because what OpenAI does is mentally retarded. Because now everyone is picking mentally handicapped low-hanging fruits to do it ( it’s enough to pile up data, computing power, and manual work without brains ), and no one is going to do the really difficult problems . When ChatGPT came out, I said many times that the GPT series is mentally retarded research, and its technical content is equal to 0. This is not my point of view, but all experts in the world know, if you don't know, you are not an expert. Even Baidu, etc. can catch up (if you invest in it).

In my opinion, in order to ensure a truly Open AI, a non-profit foundation, like Linux, must be used. In fact, the comparison of Stable Diffusion with DALLE2 proves that the power of the open source community is stronger than all closed organizations (at the same time, in this open source ecosystem, there can and must be many commercial companies, and VC investment is welcome).

Another reason why it is necessary to do Open AI is that the current arms race between the East and the West is constantly escalating. I have been on the Internet for a long time, and the idea of ​​​​fooling people on the Internet is very simple, that is, they think that China is an evil empire (so I often say that the most effective way for human beings to maintain rule is to fabricate imaginary enemies).

I think global open source Open AI helps to maintain mutual trust and reduce various risks here. As for the risk of AGI itself, as I said before, it may be a test that human beings must pass.

d85509cb66910f21b1e68145bf7fd8ee.png

In fact, RWKV should first enter the textbook. I named it in parallel with LSTM and so on.

35f38ce0bf3d4c20f69672d705301914.png

Note that this article does not represent anything special about RWKV. I think RWKV is a silly model, the whole design is too simple, there is no math to speak of. Luckily, I started early so I was the first to make this silly model.

Why post such a picture, because there are too many trolls now. The characteristic of trolls is that they have no ability to judge and only believe in authorities and liars. Therefore, RWKV has to be certified by experts, but it is helpless.

In addition, I often say laning, because the real boss is not OpenAI , but AGI . The future "AGI" will represent the crystallization of the thoughts of eight billion people around the world. I was ready to play against 8 billion people. If you dare not line up with eight billion people, you can only choose to surrender or Advent faction.


The above is the entire content of Zhihu, let’s take a look at the message by the way.

0c2730121a0b3c74baf8113db529df7d.png 9e681bcc9d73c87c9bc883ee70ef0313.png

That's all for the article, and finally, leave time to verify it.

Please share your views in the comments ~


Enter the NLP group —> join the NLP exchange group

References

[1]

RWKV: https://github.com/BlinkDL/RWKV-LM

[2]

Introduction of all models: https://zhuanlan.zhihu.com/p/618011122

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130652685