The latest AI open source: LMSYS Org open source LongChat, legal big language model ChatLaw, Chinese medical dialogue model Bian Que

One week of SOTA: LMSYS Org open source LongChat, legal big language model ChatLaw, Chinese medical dialogue model Bian Que

1. LMSYS Org releases LongChat, which context crushes 64K open source models

Recently, LMSYS Org led by UC Berkeley released a big language model qualifying competition, refreshing everyone's understanding of the more famous open source and "closed source" chatbots.

Portal: UC Berkeley LLM leaderboard is updated again! GPT-4 ranks first, and Vicuna-33B tops the open source model

On June 29, researchers from LMSYS Org released two open source large models LongChat-7B and LongChat-13B that support 16k token context length, and tested the actual performance of several large models that support long context capabilities.

At present, the open source large models that support long context already support 65K MPT-7B-storyteller and 32K ChatGLM2-6B, and closed source large models such as Claude-100K and GPT-4-32K, but the researchers of LMSYS Org still choose to pass the test To confirm whether they are "Li Gui" or "Li Kui".

How to quickly and effectively confirm whether a newly trained model can effectively handle the expected context length?

To address this issue, research teams can base their evaluations on tasks that require LLMs to handle long contexts, such as text generation, retrieval, summarization, and information association in long text sequences.

The researchers designed a long-context test suite called LongEval, which includes two tasks of different difficulty, providing an easy and fast way to measure and compare the performance of long-context.

Task 1: Coarse-grained Topic Retrieval
The research team used the topic retrieval task to simulate the scenario where the discussion jumps between multiple topics in a long conversation.

This task will ask the chatbot to retrieve the first topic in a long conversation consisting of multiple topics, testing whether the model can locate a piece of text in the long context and associate it with the correct topic name.

insert image description here

Task 2: Fine-grained retrieval

To further test the model's ability to locate and associate text in long conversations, the researchers introduced a more refined Line Retrieval test. In this test, the chatbot needs to retrieve precisely a number from a long document rather than a topic from a long conversation.

insert image description here

Researchers at LMSYS Org considered 4 open-source and 2 closed-source large models.

insert image description here

Legend: Table 1: Model Specifications

Retrieve the test results according to the coarse-grained topics (as shown in the figure below), and you can find:

  • The performance of the open source long context model does not seem to be as good as advertised. For example, Mpt-7b-storywriter claims to have a context length of 84K, but barely achieves 50% accuracy even at one-fifth of its claimed context length (16K).
  • ChatGLM2-6B cannot reliably retrieve the first topic at lengths of 6K (only 46% accuracy), and it achieves almost 0% accuracy when tested on context lengths greater than 10K.
  • The LongChat-13B-16K model reliably retrieves the first topic with comparable accuracy to gpt-3.5-turbo.
  • The closed-source commercial long-context model is very good. On the long-distance topic retrieval task, the benchmark tests of gpt-3.5-16K and Anthropic Claude have almost achieved perfect performance.

insert image description here

Legend: (Task 1: Coarse-grained topic retrieval) Comparing LongChat and other models on long-distance topic retrieval tasks

In a more fine-grained row retrieval test , it can be found that:

  • Mpt-7b-storywriter performs even worse than the coarse-grained case, with accuracy dropping from about 50% to about 30%.
  • ChatGLM2-6B also drops, performing poorly (32% accuracy) on the shortest length (5K context length).
  • In contrast, LongChat-13B-16K performs reliably, achieving a capability close to gpt-3.5/Anthropic-claude within 12K context length.

insert image description here

Legend: (Task 2: Fine-grained retrieval) the accuracy rate of the long-distance line retrieval task

LongChat is obtained by fine-tuning llama-7b and llama-13b on user shared conversations collected from ShareGPT through compression and rotation embedding technology. The evaluation results show that LongChat-13B achieves 2 times higher remote retrieval accuracy than other long context models, including MPT-7B-storywriter (65K), MPT-30B-chat (8K) and ChatGLM2-6B (32k).

The LongChat model performs well on long-distance retrieval tasks, but does this lead to a significant drop in human preference?

The researchers tested whether LongChat still matches human preferences using MT-bench scored by GPT-4. turn out:

  • The LongChat-13B-16K does suffer a slight drop in MT-Bench scores compared to its closest replacement model, the Vicuna-13B, but within acceptable limits, suggesting that this long-range capability does not significantly sacrifice its short-range ability.
  • LongChat-13B-16K is also competitive with other models of the same size (Baize-v2-13B, Nous-Hermes-13B, Alpaca-13B).

insert image description here

Legend: Table 2. Comparison of MT-bench scores between LongChat-13B and other models of similar scale

2. Peking University team releases ChatLaw, a large legal model

The Peking University team released ChatLaw, the first Chinese legal large-scale model landing product, to provide inclusive legal services for the public. The model supports file and voice output, as well as legal document writing, legal advice, and legal aid recommendation.

ChatLaw is a large-scale legal language model that can integrate external knowledge bases and is trained based on Jiang Ziya-13B and Anima-33B, with strong logical reasoning ability.

Currently, three models are open source: ChatLaw-13, ChatLaw-33B, ChatLaw-Text2Vec.

  • ChatLaw-13B is an academic demo version, which performs well in Chinese, but it is not effective in logically complex legal questions and answers, and a model with larger parameters is required.
  • ChatLaw-33B is an academic demo version, and its logical reasoning ability has been greatly improved, but due to the small corpus, English data will appear.
  • ChatLaw-Text2Vec uses a data set made of 93w judgment cases to train a similarity matching model based on BERT, which can match the user's question information with the corresponding legal articles.

Paper address: https://arxiv.org/abs/2306.16092
Open source address: https://github.com/PKU-YuanGroup/ChatLaw
Official address: https://www.chatlaw.cloud/

insert image description here

Q1_batch.mp4

ChatLaw legal large language model

3. Bian Que: A large medical dialogue model that is jointly fine-tuned with instructions and multiple rounds of inquiry dialogues

Bian Que is a Chinese medical dialogue model, currently released two versions Bian Que-1.0 and Bian Que-2.0. Compared with the common open source medical question-and-answer model, Bian Que pays more attention to the insufficient user description in multiple rounds of interaction, defines the query chain and strengthens the ability of advice and knowledge query.

  • Bian Que-1.0 is a large-scale medical dialogue model that has been fine-tuned by instructions and multi-round inquiry dialogues. It is trained using a mixed data set of more than 9 million samples of Chinese medical question-and-answer instructions and multi-round inquiry dialogues.
  • Bianque-2.0 is based on Bianque health big data BianQueCorpus, chooses ChatGLM-6B as the initialization model, and obtains it through fine-tuning training of instructions with full parameters, and expands the data such as instructions for drug instructions, medical encyclopedia knowledge instructions, and ChatGPT distillation instructions, etc. Model suggestion and knowledge query capabilities.

insert image description here

Open source address: https://github.com/scutcyr/BianQue
HuggingFace address: https://huggingface.co/spaces/scutcyr/BianQue

This project was initiated by South China University of Technology College of Future Technology -Guangdong Provincial Key Laboratory of Digital Twins. It has open sourced the ProactiveHealthGPT, the base of the active health large-scale model of living space in the Chinese field , including: (1) After fine-tuning of tens of millions of Chinese health dialogue data instructions (2) SoulChat, a large-scale mental health model that has been fine - tuned by combining Chinese long-text instructions and multiple rounds of empathic dialogue data in the field of psychological counseling .

insert image description here

Legend: ProactiveHealthGPT, the base of the large model of active health in the living space in the Chinese domain

The open source link of the model is as follows:

BianQue: https://github.com/scutcyr/BianQue
SoulChat: https://github.com/scutcyr/SoulChat

insert image description here

Welcome everyone to pay attention to my personal WeChat public account: HsuDan , I will share more of my learning experience, pit avoidance summary, interview experience, and the latest AI technology information.

Reference:
https://lmsys.org/blog/2023-06-29-longchat/
https://www.zhihu.com/question/610072848/answer/3101663890
https://www.chatlaw.cloud/
https:/ /www.163.com/dy/article/I70BJ9U00552UJUX.html
https://github.com/scutcyr/BianQue
https://www.ppmy.cn/news/52419.html?action=onClick

Guess you like

Origin blog.csdn.net/u012744245/article/details/131571342