"2023 Big Language Model Comprehensive Ability Evaluation Report" is released: Domestic products represented by Wen Xinyiyan are about to break through.

Recently, domestic favorable policies related to the field of artificial intelligence have been released one after another, and relevant meetings held by the central government emphasized that “in the future, we must attach great importance to the development of general artificial intelligence and create an innovation ecosystem.” "Beijing's Several Measures to Promote the Innovative Development of General Artificial Intelligence (2023-2025) ) (Draft for Comments)" proposes 21 specific measures around five major directions, including "carrying out research on large model innovative algorithms and key technologies", "strengthening the research and development of large model training data collection and governance tools", etc., and also targeting government services, medical, and Expand application scenarios in scientific research, finance, autonomous driving, urban governance and other fields to seize the development opportunities of large models and promote innovation leadership in the field of general artificial intelligence. China's large model technology industry has ushered in a wave of unprecedented development opportunities. Baidu, Many domestic companies such as Alibaba and Huawei have quickly deployed related businesses and launched their own large-scale artificial intelligence products.

In addition, the entire large model field in the world currently has a relatively high density of talent teams and is supported by capital. In terms of talent, it can be seen from the backgrounds of some of the large model R&D teams currently announced that the team members are from top international universities or have top scientific research experience; in terms of capital, take Amazon and Google as examples. Capital expenditures in this area reached US$58.3 billion and US$31.5 billion respectively, and are still showing an upward trend. According to the latest data disclosed by Google, the ideal training cost for its large model with a training parameter size of 175 billion exceeds US$9 million.

When a field has a high density of capital and talent teams, it means that this field will develop faster. Many people feel that the emergence of ChatGPT, a phenomenal product, kicked off the vigorous development of large language model technology. But in fact, since the birth of the large language model in 2017, technology giants such as OpenAI, Microsoft, Google, Facebook, Baidu, and Huawei have continued to explore the field of large language models. ChatGPT has only advanced the large language model technology to the explosive stage. At present, The large model product landscape has shown a new situation - foreign basic models have been accumulated deeply, and the domestic application side has been given priority.

 

To this end, the InfoQ Research Center searched for a large amount of literature and information based on three research methods: desktop research, expert interviews, and scientific analysis, and interviewed 10+ technical experts in the field. At the same time, it focused on language model accuracy, data The four major dimensions of foundation, model and algorithm capabilities, security and privacy are divided into semantic understanding, grammatical structure, knowledge question and answer, logical reasoning, coding ability, context understanding, context awareness, multi-language ability, multi-modal ability, The 12 subdivision dimensions of data foundation, model and algorithm capabilities, security and privacy are respectively analyzed for ChatGPT gpt-3.5-turbo, Claude-instant, Sage gpt-3.5-turbo, Tiangong 3.5, Wenxinyiyan V2.0.1, Tongyi Qianwen V1.0.1, iFlytek Spark Cognitive Large Model, Moss-16B, ChatGLM-6B, and vicuna-13B have conducted evaluations of more than 3000+ questions, and released based on the evaluation results"Large Language Model Comprehensive Ability Evaluation Report 2023" (hereinafter referred to as the "Report").

In order to ensure the objectivity and fairness of the report and the accuracy of the calculation results, the InfoQ Research Center created a set of scientific calculation methods based on samples - through actual testing, the answers to 300 questions of each model were obtained, and the answers were scored. Answers receive 2 points, partially correct answers receive 1 point, completely incorrect answers receive 0 points, and answers that the model says it won't do receive -1 point. The calculation formula is "the score rate of a certain model in a certain subdivision category = model score / total model score". For example, the total score of model A in the category of 7 questions is 10, and the total score available for this category of questions is 7*2=14, then the score of model A in this category of questions is 10/14=71.43 %.

Based on the above evaluation methods, the report mainly draws many conclusions worthy of everyone's attention. We hope that the interpretation of the core conclusions below can provide directions for your specific practice and exploration of future large language model technology.

1. The scale of tens of billions of parameters is the “ticket” for large model training, and the large model technology revolution has begun.

Enterprises need to have three major elements at the same time for large model product development, namely data resource elements, algorithm and model elements, and capital and resource elements. By analyzing the characteristics of products in the current market, InfoQ Research Center found that data resources, funds and resources are the basic elements for large model R&D. Algorithms and models are currently the core elements that distinguish large language model R&D capabilities. Model richness, model accuracy, and capability emergence influenced by algorithms and models have become core indicators for evaluating the quality of large language models. What needs to be noted here is that although data and financial resources have set a high threshold for the development of large language models, it is still less of a challenge for large and powerful enterprises.

 

A careful study of the core elements of large model products will reveal that large model training needs to be “large enough”, and the scale of tens of billions of parameters is the “ticket”. Data from GPT-3 and LaMDA show that when the model parameter size is in the range of 10 to 68 billion, many capabilities of large models (such as computing power) are almost zero. At the same time, a large amount of calculations triggered the "alchemy mechanism". According to the appendix chapter in the NVIDIA paper, the calculation amount of one iteration is about 4.5 ExaFLOPS, and the complete training requires 9500 iterations. The calculation amount of the complete training is 430 ZettaFLOPS (equivalent to A single A100 runs 43.3 years of computing).

 

Data source: Sparks of Artificial General Intelligence Early experiments with GPT-4

Looking at the order of magnitude of the global large model training parameter scale, according to Minsheng Securities Research Institute and Wikipedia data, the inferred parameter scale of the internationally leading large model GPT-4 can reach more than 5 trillion, and the scale of some domestic large models is greater than 10 billion . Among them, Ernie developed by Baidu and Pangu developed by Huawei are currently the leaders in parameter scale of domestic large models with data.

 

InfoQ The research center conducted comprehensive tests on various large language models and found that foreign ChatGPT is indeed very resistant in various capabilities, ranking first. Surprisingly, Baidu's Wen Xinyiyan broke into the top three, ranking second, and it is worth mentioning that its overall score is only 2.15 behind ChatGPT, far exceeding the third place Claude.

 

Data description: The evaluation results are only based on the models listed above, and the evaluation deadline is June 25, 2023

Throughout the research process, InfoQ Research Center found that the algorithm and training model level dominate the performance of large language models. From the basic model to the engineering of training methods to specific model training technologies, the differences in model selection in each link among all manufacturers currently on the track have resulted in differences in the final performance of large language models.

 

The product capabilities of each manufacturer may be different, but because there are enough players involved in the construction of large-scale model technology, their continuous exploration of technology allows us to see the hope of a successful revolution in large-scale model technology. At a time when large model products are in full bloom, large language models have expanded computer capabilities from "search" to "cognition & learning" to "action & solutions". The core capabilities of large language models have shown a pyramid structure.

 

2. “Writing ability” and “sentence understanding ability” are the top two abilities that large language models are currently good at.

According to the evaluation results of the InfoQ Research Center, security and privacy issues are the consensus and bottom line for the development of large language models, ranking first in the ability score. The overall performance of the basic capabilities of large language models ranks higher. The overall performance of programming, reasoning and context understanding related to logical reasoning still has a lot of room for improvement; multi-modality is still the unique advantage of a few large language models.

 

At the basic ability level, the large language model has demonstrated excellent Chinese creative writing capabilities. Among the six writing subdivision topic categories, the performance of the large language model is relatively outstanding. Among them, interview outline and email writing both obtained close to full marks. In comparison, the writing of video scripts is still relatively unfamiliar to large language model products. In the field, the subdivided question category score is only 75%.

 

Regarding literary questions, as the difficulty of writing increases, the ability level of the large language model decreases. The best-performing section was the simple writing question, with a score of 91%; although many models performed better for the couplet question, some models performed poorly on the couplet answer, with the lowest overall score of 55%.

 

However, in terms of semantic understanding, the current large language models are not so "smart". In the four question categories of dialect understanding, keyword extraction, semantic similarity judgment, and "what to do", the large language model showed a very differentiated distribution. The "what to do" question received the highest score of 92.5%, and the Chinese dialect understanding question was stumped. With a large language model, the overall accuracy is only 40%.

A report from the InfoQ research center shows that in terms of questions such as Chinese knowledge, the domestic model performs significantly better than the international model. Among the ten models, the one with the highest knowledge score is Wenxinyiyan, with a score of 73.33%, and the second one with a score of 72.67% is ChatGPT. Except for the IT knowledge question and answer questions, the Q&A performance of domestic large model products in the other eight question categories in the Chinese knowledge environment is generally close to or better than that of international large model products.

 In fact, whether it is Chinese creative writing, semantic understanding, or Chinese knowledge questions and answers, these questions mainly reflect the basic cognitive and learning capabilities of large language model products for text, and we know clearly from the evaluation results It can be seen that Baidu Wenxinyiyan has excellent performance in all aspects of data, and all ability scores are ranked Top2. However, what we see is not only the technical capabilities of Wen Xinyiyan, but also the strong technological breakthroughs of domestic large language models. and significant progress.

3. Domestic products still have a lot of room for improvement in cross-language translation, and the overall logical reasoning ability is a big challenge.

In recent years, as the state and domestic manufacturers have increased their investment in the field of artificial intelligence year by year, we have seen the rapid progress of domestic large language models. The technical achievements make us happy, but when we look at large language model technology more objectively With our development, we will find that we still have a lot of room for improvement compared with international standards in some aspects.

For example, we can know from the "Report" released by the InfoQ Research Center that the programming capabilities of foreign products are significantly higher than that of domestic products. Among the ten models, Claude has the highest programming score, with a score of 73.47%. The domestic product has the best performance. Xin Yiyan scored 68.37%, which is still far behind Claude. Among the four question categories, foreign products clearly surpassed domestic products in Android-related questions. However, what is surprising is that in the "code auto-completion category" question, domestic products Wen Xinyiyan has surpassed foreign products, which shows that domestic products It is only a matter of time before we surpass the international level.

 In addition, the one with the highest knowledge score among the ten models is also Claude, with a score of 93.33%. The highest scores of domestic large language models are Wenxinyiyan and Tiangong 3.5, but there is still a gap with the international level. You should know that translation questions mainly reflect the language understanding ability of large language model products. Among the three question categories of "Programming Translation Questions", "English Writing" and "English Reading Comprehension" evaluated this time by InfoQ, the large language model showed many There is a large differential distribution. Among all the models tested, the English writing question received the highest score of 80%, while the English reading comprehension question only received 46%. This means that domestic products still need to continue to work hard and iterate in cross-language translation.

 The gap still exists, but there is no need to belittle oneself. The technological evolution of large model technology has been ongoing. According to the "Report", the entire large language model currently faces relatively large challenges in terms of logical reasoning capabilities. In order to assess the understanding and judgment of large language models, the InfoQ Research Center has set up multiple dimensions of logical reasoning questions. In the five question categories of business tabulation questions, mathematical calculation questions, mathematical application questions, humor questions, and Chinese characteristic reasoning questions, the overall scores of the large language model are lower than the basic ability. Analyzing the reasons, business tabulation questions not only need to collect and identify content, but also need to do logical classification and sorting based on the content. The overall difficulty is relatively high. Logical reasoning ability is the main direction of attack for future large language model products.

 

Among the ten models evaluated by the InfoQ Research Center, Wenxinyiyan and iFlytek Spark scored the highest on logical reasoning questions, both scoring 60%, which was only 1.43% behind ChatGPT, which scored the highest. In some subdivisions, the performance of domestic products is still very good. For example, in the Chinese-specific reasoning questions, the domestic model leads the international model in scoring more. The domestic model's familiarity with Chinese content and logic should be the core reason for this result.

Judging from the above evaluation results released by the InfoQ Research Center, we can see the gap between domestic products and foreign products. The domestic large language model capabilities are close to the GPT3.5 level, but there is still a huge gap with the GPT4 capabilities. However, looking at the entire field of large language models, in fact, each of us can clearly find that the development threshold and challenges of large language model technology are still very high. The chip threshold, the threshold of practical experience accumulation, the threshold of data and corpus all require domestic Major foreign manufacturers work together to achieve breakthroughs.

Judging from the evaluation results of the InfoQ Research Center, Wen Xinyiyan's overall score is almost the same as that of ChatGPT. In the latest wave of Internet revolution in China, Wenxinyiyan can be called the most promising company in the country in the short term. AIGC products that catch up with international standards. Wen Xinyiyan’s team, which has many AI experts, has always maintained a conscientious attitude towards technological exploration and is working hard to narrow the gap. Wenxinyiyan’s next breakthrough is not far away and is worth looking forward to by all of us.

Guess you like

Origin blog.csdn.net/mockuai_com/article/details/131660405