Records of work related to large model domain migration

Recently, I have done some research on large models, and there is a steady stream of materials and open source models. I hereby record and update the current situation.

llama

  • Size: 7B-130B
  • Contributors: Meta

The llama model is open source, but you need to send an application. There are hf conversion models released by contributors on the hugging face (as an academic discussion, commercial interests are temporarily ignored, the same below).

On top of this, Stanford first used the 52k corpus produced by self-instruction, and used lora technology to fine-tune and open source alpaca . Based on this experience, a series of alpaca models appeared, such as luotuo, camel bell, etc. This is also an open source model that I think has a better ecology.

bloom

  • Size: 7B-176B
  • Contributor: bigsicence

Lianjia Technology has open-sourced BELLE for fine-tuning instructions based on bloom , and also open-sourced some Chinese data sets.

GLM

  • Size: 6B-130B
  • Contributor: Tsinghua University

Open source the dialogue language model ChatGLM-6B supporting Chinese and English bilinguals , and the base model GLM

In terms of Chinese dialogue and downstream applications, as of 3.31, it is the best chat model I have tested so far. Based on chatglm-6B, many people have contributed a lot of fine-tuning and application code, such as:

  • ChatGLM-Tuning : Fine-tuning ChatGLM-6B based on LoRA. Similar projects also include Humanable ChatGLM/GPT Fine-tuning | ChatGLM Fine-tuning
  • langchain-ChatGLM : ChatGLM application based on local knowledge, based on LangChain
  • ChatGLM-Finetuning : Based on the ChatGLM-6B model, fine-tune downstream specific tasks, involving Freeze, Lora, P-tuning, etc., and compare the experimental results.
  • InstructGLM : Learn instructions based on ChatGLM-6B, summarize open source Chinese and English instruction data, fine-tune instruction data based on Lora, open the Lora weights after Alpaca and Belle fine-tuning, and fix the web_demo repetition problem.

At present, the author refers to the first link to fine-tune chatglm-6b based on lora for information extraction tasks

  • I tried a single instruction (one prompt for each data set), the amount of data is about 3000+, and the verification effect of NER is about 60-80. At the same time, I tested the original general ability based on the fine-tuning model, and found that there is a phenomenon of historical forgetting. Abilities performed worse than the original version.
  • Based on chatgpt and artificially produced 100 different information extraction instructions (only for NER), it was found that the extraction effect was slightly improved but the effect was not significant (based on more than ten subjective tests), but the general ability was much better than the previous experiment. Almost the same level as the original version.
  • Fine-tuning chatglm-6B based on the 52k data set of gpt4 , and found that there is not much change before and after the fine-tuning. Blind guess, maybe chatglm has also used the Chinese corpus translated by alpaca to fine-tune?

Chinese tokenize

Roughly count the vocabulary of the three pre-trained models according to the following classification, so why chatglm has a good support effect on Chinese, you can also know a little bit by looking at the vocabulary.

Model English Chinese punctuation other total
llama-7b-hf 24120 700 1167 5990 31977
Belle-7B-2M 89974 28585 1827 130223 250609
chatglm-6b 63775 61345 1469 3660 130249

Welcome to exchange

  • When can the Chinese model open source a few useful ones?
  • Experience and exchange based on domain transfer, information extraction
  • else

Guess you like

Origin blog.csdn.net/qq_23590921/article/details/130137336