I turned myself into a digital clone and now you can chat with me

Besides flying a plane, cooking the perfect rib roast, getting a 6 pack abs and making the company a lot of money, one of the things I've always wanted to do is implement a chatbot.

Compared with the little yellow chicken who replied simply through keyword matching many years ago, ChatGPT, which is now comparable to human intelligence, chat AI has been improving, but they are somewhat different from what I think.

I chat with a lot of people on WeChat, some people chat a lot, some people chat less, I also talk in groups, I also write blogs and public accounts, I leave comments in many places, I I also post on Weibo. These are the traces I left in the online world. To some extent, these things constitute the world's perception of me. From this perspective, they also constitute me. Put these data - my replies to different messages, every article I wrote, every sentence, every Weibo I sent, etc., into a neural network model to update the parameters, It is theoretically possible to obtain a digital copy of me.

In principle, this is different from saying to ChatGPT, "Please act as a person named Xiao Wang, whose experience is XXX". Although with the wisdom of ChatGPT, this kind of acting is effortless and may be fake, but in fact ChatGPT parameters There is no Change, this is more like "acting" than "reshaping", ChatGPT's hundreds of billions of parameters have not changed a single one, it gets some information from your previous text, and then uses its wisdom to deal with you.

264ccc16b840f331a6a99927da4789c4.png

I like to write some metaphors that are not very useful in the article, and I like to make some conclusions at the end. When chatting with people, I like to use "yes" to perfunctory, and at the same time use the trough to express surprise. Sometimes I Reticent, sometimes eloquent, these are some of the characteristics that I can perceive myself.

In addition, there are more fixed habits that I can’t detect myself, but these subtle and vague things, I can’t tell ChatGPT, it’s like you introduce yourself, you can introduce a lot, but it’s still far from the real you Thousands of miles, even sometimes the opposite, because when we are aware of our existence, we are actually performing ourselves. Only when we are not aware of our existence and integrated into life, we are our true selves.

After the release of ChatGPT, learning the technical principles of large text models based on interest has a feeling of joining the national army in 1949, because for individual enthusiasts, the possibility of surpassing ChatGPT in any aspect or even a small vertical field is already It doesn't exist anymore, and at the same time it is not open source, and there is no other idea except to use it.

But some open source text pre-training models that have appeared in the past two months, such as the famous llama and chatglm6b, made my idea of ​​cloning myself start to move again. Last week, I was going to give it a try.

First of all, I need data, which is enough and all of which are generated by me. The simplest source of data is my WeChat chat history and blog, because the WeChat chat history has not been completely cleared. From 2018 to now, WeChat in my mobile phone accounts for I have lost 80G of storage space, and I have always had the feeling that someone has taken over a piece of land at home. Now if I can use the data here, I will settle my suspicions with this 80G.

fde20a2ecf64fed4a5a24e612fc396a3.png 

I backed up my WeChat chat history a few years ago, and I found the tool I used back then. It is an open source tool called WechatExporter on GitHub. I will put the link at the end of the article. Using this tool, it can be implemented on a Windows computer. Back up all the chat records of WeChat on the iPhone and export them into plain text format. This is an operation that requires patience, because first you need to back up the entire phone on the computer, and then this tool will read WeChat from the backup file record and export.

I spent about 4 hours backing up, and then quickly exported all my WeChat chat records, which were exported to many text files according to the chat objects

3e84b4e49903801783dba652337aa278.png

This includes group chats and one-on-one chats.

Then I started to do data cleaning. Most of the groups I dived a lot. I screened out some groups where I was more active. In addition, I screened out some chat records with individuals. I chatted with them a lot, and they also wanted me I used the chat records to do this. In the end, about 50 chat text files were enough for me to use.

I wrote a Python script to traverse these text files, find out all my speeches, and the last sentence, make it into a dialogue format, and then save it in JSON. In this way, I have my own WeChat chat data set.

461f3a91b4d8a4f1186484b4e21e9bf1.png

At this time, I also asked my colleague to use a crawler to crawl all my own blog posts. After he finished crawling and sent them to me, I realized that I could actually use the built-in export function in the blog background to directly export them. Although the blog data is also very clean, I didn't know how to use it at first, because what I want to train is a chat model, and blog posts are a long paragraph, not a chat, so I trained for the first time. Only these pure chat records of WeChat are used.

I chose chatglm-6b as the pre-training model. On the one hand, its Chinese effect has been trained well enough. On the other hand, its parameters are 6 billion. My machine can run without too much effort. There is another reason Yes, there are already several solutions for fine-tuning training on GitHub (I will list them together at the end of the article). In addition, it can also be referred to as 6B, and the 6pen I made is named 6, which makes me more inclined to use it.

Considering that about 100,000 pieces of WeChat chat data are finally available, I set a relatively low learning rate and increased the epoch. One night a few days ago, before going to bed, I wrote the training script and started running it. I went to sleep, hoping to finish the run when I woke up, but I woke up about every hour that night.

After waking up in the morning, the model training is over. Unfortunately, the loss does not drop well, which means that the model trained for 12 hours is not good, but I am a rookie in deep learning, and I can finish the run without reporting an error. Already thank goodness so I wasn't disappointed and started running conversations with this mod.

In order to add a sense of ceremony, I don't want to use Jupyter notes, or chat in a dark terminal, I found an open source front-end chat page, made a slight modification, then deployed the model, encapsulated the API, and then used the front-end page to go Call this API, so you can achieve a chat like that.

Please don't laugh at me, I used my own 100,000 WeChat chat records to train the model, the following is the first conversation between me and him (or it?)

862f250ef1fea09939b4f23dbed1cd19.jpeg 

I tried it again, and the result was still not very good. I am not the kind of person who is embarrassed to take it out without optimizing it to the extreme, so I sent it directly to a few friends without any shyness. The feedback they gave me was, Kind of like you, and they sent me back screenshots of the conversation. 

af0b5255a0fa33b049af53be3a4d586d.jpeg

bfd9a19fc73e137b87aa995fcd02abf3.jpeg

1c59e146352bf475e69d6c5b271da7fd.png

In the first version, this model does have some points that are similar to mine. I can’t say for sure, but there is a feeling of it.

If you ask it, where did you go to university, or where is your hometown, it will not answer accurate information, and it must be wrong, because not many people in my chat history ask me that , in a way the model doesn't know me, it's like a clone.

When I receive a WeChat message with content A and I reply B, then there are some reasons, some of which are stored in the seven or eight billion neurons in my physical head. In theory, if I have generated enough data, maybe hundreds of billions of data, then an artificial intelligence model with large enough parameters can be very close to my brain, 100,000 may be a little less, but it is enough to make the 6 billion parameters of the model Change a part to make it a bit closer to me than the original pretrained model.

In addition, it has a bigger disadvantage, that is, it can't jump out a few words, and the answer is very brief. Although this is in line with my WeChat chat style in many cases, it is not what I want. I want it to say more.

At this time, I suddenly thought of my blog. How can I convert these blogs into questions and answers? I thought of chatgpt. Under my carefully constructed prompt, it successfully turned a piece of text of my blog article into multiple conversations Form Q&A:

4736935788ec39c0685462f89dfa3fe9.png

Sometimes chatgpt will return some content that does not conform to the format, so I wrote a proofreading script to modify all the returns that do not conform to the rules into standard json, and the field names remain unchanged.

Then I packaged it into an interface, put it on a server in Hong Kong, and wrote a script on my computer to divide my blog posts into 500 words and convert them into questions and answers in batches, which is limited by chatgpt Interface speed, it took me almost another night to convert my more than two hundred blog posts into almost 5,000 conversation data sets.

At this point, I am faced with a choice. If blog conversations are added to the WeChat conversation dataset for training, then the proportion of blog conversations is too low, and the impact may be very small, which means that it is not much different from the previous model; another choice is Simply use the data of the article to train a new model.

I asked for help from the algorithm brother of 6pen, and after confirming that the model weights can be fused and trying to find a way to flow from him to the fusion script, I adopted the latter method.

5,000 questions and answers, the training speed is very fast, one or two hours is enough. In the afternoon, I wrote the document and took a look at the training progress. After the training was completed before work, I started to integrate the models and let the previous models trained with WeChat chat records , fused with the model trained on my blog.

The weights of the two models can be freely configured. I have tried a variety of different ratios. Considering that there is still some rebound in the loss during model convergence, I also tried model versions with different steps936027b03785c24d6b5b06c863ddd707.png

I talked to these models all night and night to find the one that worked best, but I found that it seemed difficult for me to find out. These models, there are some different behaviors, some will be more irritable, some like licking dogs, some I am particularly cold, some are very enthusiastic, and then I realized that to some extent, this may be a different side of me. Although this understanding will definitely make people who engage in deep learning and are familiar with the principles sneer at it, it does not Lose some romance.

5a7d567841051063592c82d468e1d0f2.jpeg

In the end, I found that the two models of chat and articles have a weight ratio of 7 to 2, and the model saved in step 6600 is used. The fusion effect is better at more times. Of course, it may be two o'clock in the middle of the night at that time , my judgment has declined a bit, but I've settled on him as the final model anyway.

I talked to him a lot.

 e04b930b185f7f5be07f7c0cb36f12bc.png

b634199737e7cdaeee6ad2be35cb43d2.png

834948c5eedbbeec9cd0c547c5f7eae0.png

2a08cbf48dd91836282c0f4ee229a61f.png

9aa07c9e4d3cd736fd03eadd6268fbd2.png

27cab8440e445c6807a063b0638b7b27.png

263e67aa92beb6ddb38c85bd894e69f6.png

133c16e497b3b8f869ebae6867d78445.png

Obviously, he is very far from chatgpt. He can't help me write code or copywriting, and he is not smart enough, because the data used for training does not contain multiple rounds of dialogue, so the comprehension of multiple rounds of dialogue is even worse. At the same time , he doesn’t know me very well, except for knowing my name (that is, my name), he can’t actually answer a lot of other information about me accurately, but he often says a few simple words to let me know. I have a feeling of familiarity, maybe an illusion, who knows.

In general, all the well-known large text models that exist today are trained with massive data. The training process will contain all the information generated by all human beings as much as possible. This information allows the hundreds of millions of parameters of the model to be continuously optimized. For example, the 2043475th parameter is increased by 4, the 9047113456th parameter is decreased by 17, and then a smarter neural network model is obtained.

These models are getting smarter, but they are more like humans than individuals, and when I retrain the models with my own data, I can get something completely different, a closer to the individual Model, although neither the amount of data I generated nor the amount of parameters and structure of the pre-trained model I used may not be able to support a model that is similar to my brain, but it is still very interesting to try.

I redeployed this web page and added a layer of serverless protection in the middle, so now everyone can try to chat with this digital version of me, the service is provided by my ancestral V100 server, and there is only one Taiwan, so if there are many people, there may be various problems, I will put the link at the bottom.

Positive, the more data you produce from the heart, the more likely you will get a digital copy that is closer to you in the future. This may have some moral or even ethical issues, but this is something that will happen with a high probability. After that I have accumulated more data, or have better pre-trained models, training methods, I may try to train again at any time, this will not be a profit, or any project that has nothing to do with business, this is to some extent considered A way for me to find myself.

Thinking about it this way, life seems to be less lonely.

attached

Live chat with my digital clone: ​​https://ai.greatdk.com

You can also experience it by clicking on the bottom to read the original text, but because only one ancestral V100 graphics card is providing inference, so I set a request limit, even so, it may hang, I will restart this every 10 minutes service, if you are really interested and find it hangs up, you can try again after a while

Projects I use and reference:

  • WechatExporter:https://github.com/BlueMatthew/WechatExporter

  • chatglm-6b:https://github.com/THUDM/ChatGLM-6B

  • zero_nlp:https://github.com/yuanzhoulvpi2017/zero_nlp

  • chatglm_finetuning:https://github.com/ssbuild/chatglm_finetuning

  • MoeChat: https://github.com/Fzoss/MoeChat

  • Alpaca: https://crfm.stanford.edu/2023/03/13/alpaca.html

  • LLAMA:https://github.com/facebookresearch/llama

    Click on the picture below to view more exciting articles

    f7a3f235779b4446798ba2c50f4839bc.png

    791dc5b90c46536128217088421dce47.png

    bd73f4205d1982bed667d573a8f49ff2.png

    016632f2a7bfff04a980f8cd4eb74f05.png

Supongo que te gusta

Origin blog.csdn.net/coderising/article/details/130050942
Recomendado
Clasificación