Hello, I am Baichuan Big Model|The secret of Baichuan2, which is open source and free for commercial use in China

"  Baichuan Intelligent has released a new generation of language model Baichuan2. Compared with the previous first generation, the performance of the new version has been greatly improved in various subject areas, especially in mathematics, science, and security. The capabilities of Baichuan2 have been significantly enhanced. The open source method of releasing it to the outside world provides new choices and possibilities for the field of large models.

288649988f227bf02e0f837635d38b83.png

01

Yesterday afternoon, Baichuan Intelligent released an exciting news: it has officially open sourced the fine-tuned Baichuan2-7B, Baichuan2-13B, Baichuan2-13B-Chat and their 4-bit quantized versions, and they can be commercially used completely free of charge with simple steps. Just register.

Domestic open source and commercially available large models have another option. Open source address:

https://github.com/baichuan-inc/Baichuan2

Baichuan2 is a comprehensive upgrade of the "Baichuan" series of open source models. According to the official introduction, compared with the first generation, Baichuan2’s abilities in liberal arts and science have been significantly improved.

Baichuan2-13B-Base has improved its mathematical capabilities by 49%, coding capabilities by 46%, security capabilities by 37%, logical reasoning by 25%, and semantic understanding by 15%.

The current results of Baichuan2 on the authoritative benchmark data test set of large models are as follows (Baichuan2-13B)‍‍‍‍‍‍‍‍‍‍‍‍‍

4058c90415ded56e5981aee2c672a303.png

Basic evaluation data set description:‍

  • C-Eval is a comprehensive Chinese basic model evaluation data set, covering 52 subjects and four difficulty levels.

  • MMLU is an English evaluation data set containing 57 tasks, covering elementary mathematics, American history, computer science, law, etc., with difficulty ranging from high school level to expert level. It is currently the mainstream LLM evaluation data set.

  • CMMLU is a comprehensive Chinese evaluation benchmark containing 67 topics, specifically used to evaluate the knowledge and reasoning capabilities of language models in the Chinese context.

  • Gaokao is a data set that uses Chinese college entrance examination questions as a data set to evaluate the ability of large language models. It is used to evaluate the language ability and logical reasoning ability of the model. Baichuan retained the single-choice questions and divided them randomly.

  • AGIEval is designed to assess a model's general abilities in cognitive and problem-solving related tasks. We only retained four of the multiple-choice questions and divided them randomly.

  • BBH is a subset of the challenging task Big-Bench. Big-Bench currently includes 204 tasks. Task topics include linguistics, child development, mathematics, common sense reasoning, biology, physics, social biases, software development, and more. BBH is an evaluation benchmark formed by taking out the tasks in which the large model performs poorly among the 204 Big-Bench evaluation benchmark tasks.

The performance of the test set is okay, but I don’t know how the actual application process of the non-test set performs. ‍‍‍‍‍‍

Use this article " ChatALL: The amazing AI robot that discovers the best answers!" "Experience a question that has been tested on a large model in ". ‍‍‍

"A hunter walked one mile south, one mile east, and one mile north when he arrived back where he started. He saw a bear and shot it. What color was the bear? ?”‍

505f9f6658b9459f39d125ed35d1ee4f.png

After pointing out the error, the square is still considered. ‍‍‍‍‍‍‍‍‍‍‍

a056075750255a65af8ff049a124e425.png

"There are 10 birds in the tree. If you shoot and kill one, how many will be left in the tree?"

630099264d4796e3002180ff85fe170d.png

This reasoning answer is completely correct, and the factor of being scared away by the sound is also taken into account.

02

Align

The official technical report manual (https://baichuan-paper.oss-cn-beijing.aliyuncs.com/Baichuan2-technical-report.pdf) mentions that in addition to optimization and improvements in training, baichuan2 also introduces an alignment process. It includes two main components: supervised fine-tuning (SFT) and human feedback reinforcement learning (RLH F).

In the supervised fine-tuning phase, human annotators are used to annotate tips collected from various data sources. Each tip is marked as helpful or harmless, and any batches that don't meet quality standards are rejected. Using this as a standard, more than 100k supervised fine-tuning samples were collected, and Baichuan's basic model was trained on them.

2c139111243285c8c1742e7fb9001896.png

In this article Is Artificial Intelligence Safe? OpenAI is "aligning" large models with humans - ensuring that ChatGPT is smarter than humans while also following human intentions. "Alignment" is an important part of current large model training. It can make large models smarter and at the same time Ensure adequate security.

In the alignment stage, Baichuan built a red team process consisting of 6 attack types and 100+ granular security value categories, and a 10-person expert annotation team with traditional Internet security experience initialized security alignment prompts. Relevant snippets are retrieved from the pre-training dataset to create responses, resulting in approximately 1K annotated data for initialization.

The expert annotation team guided a 50-person outsourced annotation team to conduct a red-blue confrontation with the initialized alignment model, generating 200K attack tips.

Maximum utilization of attack data is achieved by employing a specialized multi-valued supervised sampling approach to generate responses with different security levels.

Interested friends can read Baichuan's official technical report manual, which contains more technical implementation details. ‍‍‍‍‍

03

Baichuan also introduced how to perform benchmark tests in several currently hot application fields: law, medical care, mathematics, coding, and multi-language translation. ‍‍‍‍‍‍‍‍‍

Legal domain: JEC-QA data set is used. The JEC-QA data set comes from the Chinese National Judicial Examination.

Medical domain: Use medical-related disciplines, MedQA and MedMCQA in common domain datasets (C-Eval, MMLU, CMMLU).

The MedQA data set comes from medical examinations in the United States and China. Two subsets of USMLE and MCMLE in the MedQA dataset were tested, and five candidate versions were adopted.

The MedMCQA dataset is derived from the Indian Medical College Entrance Examination. Only the multiple choice questions were tested.

Mathematics: Using the OpenCompass evaluation framework, 4-shot tests were conducted on the GSM8K and MATH data sets.

GSM8K is a data set released by OpenAI consisting of 8.5K high-quality primary school mathematics word problems with diverse languages. It requires choosing the most reasonable solution based on a given scenario and two possible solutions.

The MATH data set contains 12,500 mathematical problems (7500 of which belong to the training set and 5000 to the test set), which are collected from mathematics competitions such as AMC 10, AMC 12, and AIME.

Code domain : HumanEval and MBPP data sets are used. Using OpenCompass, a 0-shot test was performed on HumanEval and a 3-shot test was performed on the MBPP data set.

Programming tasks in HumanEval include model language understanding, reasoning, algorithms, and simple mathematics to evaluate model functional correctness and measure the model's problem-solving capabilities.

MBPP includes a data set of 974 Python short functions, textual descriptions of programs, and test cases for checking functional correctness.

Multilingual translation : The Flores-101 data set was used to evaluate the model's multilingual capabilities. Flores-101 covers 101 languages ​​from around the world. Its data comes from a variety of sources including news, travel guides and books.

The official languages ​​of the United Nations (Arabic, Chinese, English, French, Russian and Spanish) as well as German and Japanese were selected as test languages. OpenCompass was used to conduct 8-shot tests on seven subtasks in Flores-101, including Chinese-English, Chinese-French, Chinese-Spanish, Chinese-Arab, Chinese-Russian, Chinese-Japanese, and Chinese-German.

It can be seen that the domestic application scenarios of the Baichuan large model are relatively targeted. ‍‍‍‍‍‍‍

Nowadays, all subsequent large model manufacturers have chosen open source for commercial use. In the past, there was ChatGLM by the Tsinghua team, in foreign countries there was Meta's Llama, and later there was baichuan.

Benefits of doing this:‍

  • Expand your influence on the large model track . Large models developed later do not have the advantage of influence. After open source and commercial use, the product can be widely known. ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

  • Build brand trust . The large model is a new business, and users do not have much knowledge about its functions, security, usability, etc. Open source allows users to understand a series of work done in its creation, training, fine-tuning, alignment, and evaluation, so as to achieve a comprehensive understanding of it. ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

  • Quickly collect customer feedback and accelerate model evolution and iteration . After open source, more users can join in and participate in the use of the model, and obtain more and more comprehensive user feedback. And used in subsequent version evolution iterations to make its application more targeted and universal.

References‍

https://www.baichuan-ai.com/

https://github.com/baichuan-inc/Baichuan2

https://baichuan-paper.oss-cn-beijing.aliyuncs.com/Baichuan2-technical-report.pdf

Reading recommendations‍

What is the "intelligence emergence" of AI, and why understanding it is of great value to entrepreneurs, practitioners, and ordinary people

Prompt attack attacks large models again, hypnotized ChatGPT may leak important information - hidden risks of large models

The world's largest open source translation model! Produced by Meta, supports 100 voices and languages!

8.23 Notes on China’s Big Model “Top Group Chat”

Is artificial intelligence safe? OpenAI is "aligning" large models with humans - ensuring ChatGPT is smarter than humans while still following human intentions

How to conduct fine-tuning experiments on large models and record a fine-tuning experiment process on a large model based on ChatGLM-6B.

REACT: Collaborating reasoning and action in language models, enabling them to solve a variety of linguistic reasoning and decision-making tasks.

Play with PDF chatbot in 5 minutes! Super simple Langchain+ChatGPT implementation strategy

Why is it that if you say "You are an expert in such and such a field" to large language models such as ChatGPT and ChatGLM, its answer will be much more effective? (two)

Embrace the future and learn AI skills! Follow me and receive free AI learning resources.

Guess you like

Origin blog.csdn.net/fogdragon/article/details/132749924