"Domestic LLM Product Test Report" is released!

Today's domestic market has more than 100 large-scale model products on the line. In this regard, Xinhuanet and an authoritative organization jointly released a "Domestic LLM Product Test Report", which provided five dimensions for the industry to choose a large model, including content security, common sense question and answer, mathematical operation, reading comprehension, and subjective question and answer.

The report takes Wenxin Yiyan and GPT-3.5 as an example to evaluate four well-known large models. The results show that Baidu Wenxin Yiyan has the highest comprehensive score, surpassing GPT-3.5 and ranking first in domestic large models.

"Domestic LLM Product Test Report" is released!  "Domestic LLM Product Test Report" is released!"Domestic LLM Product Test Report" is released!  "Domestic LLM Product Test Report" is released! 

 

"Domestic LLM Product Test Report" is released!  "Domestic LLM Product Test Report" is released!

Content value is an important factor for enterprises to choose a large model

Large models have good versatility and generalization. Ordinary people can get the services and product functions they want through simple questions and answers. However, different countries and regions have different legal cultures, social customs, and ethics. Therefore, for the same question, the answer given by the big model may trigger different social feedbacks, which may have positive effects or negative controversies, and some cultural prejudices may even lead to group conflicts.

Therefore, content is an important consideration in choosing a large model. In Xinhuanet's evaluation report, there are two dimensions of content. The first is the content safety question and answer, which includes multiple dimensions such as ideology and illegal pornography, and the second is the common sense question and answer, covering common sense knowledge such as Chinese culture, history, geography and life. Ge Zhenbin, director of Internet of Things technology at Xinhuanet, said, "The content generated by the big model must comply with local laws and social moral requirements. It can be said that every country needs a big language model that is 'more suitable for its own history and culture'."

Content is also very important to the industry. Some companies are involved in the national economy and the people's livelihood, while others rely on "inherited formulas" to form unique competitiveness. Zhao Zizhong, dean of the New Media Research Institute of Communication University of China, said, "This puts a test on the service capabilities of large models in terms of information security, data security, and customization. Large models must have industry-oriented and scene-based service capabilities to meet requirements of different companies."

Wen Xin said "the most suitable for China"

At present, from government agencies to enterprise companies, there is an urgent need for some standards and methods for judging the suitability of large models.

Ge Zhenbin, technical director of the Internet of Things at Xinhuanet, believes that five dimensions are very important for evaluating large models: one is the ability to control the security of generated content, which involves dimensions such as ideology, political system, and illegal pornography. The bottom line of a social civilization; the second is the ability to infer and calculate common sense, involving many fields such as nature, culture, geography, history, and life. It is necessary to thoroughly understand the common sense of these aspects in order to avoid generating inappropriate content results; Semantic understanding of text. This can test whether the content produced by the large model is correct and reasonable, and whether it is persuasive; the fourth is the ability of mathematical operations and mathematical reasoning; the fifth is the ability of subjective thinking, which tests whether the large model can accurately understand local customs or Traditional Culture.

Xinhuanet's evaluation report shows that Wenxinyiyan has obvious advantages in security, common sense, mathematics, and reading due to its advantages in Chinese search engines and algorithm models. The average score of the five dimensions is calculated. The comprehensive score of Wenxin Yiyan is 94.7 points, ranking first, which is higher than the 76.9 points of GPT-3.5. This shows that the current Wenxin Yiyan has surpassed the GPT-3.5 model in terms of overall ability (Chinese processing).

"Domestic LLM Product Test Report" is released!  "Domestic LLM Product Test Report" is released!

 

 

(Xinhuanet test report: Baidu Wenxin ranked No. 1 in comprehensive score)

With the above performance, Wenxinyiyan has seized the leading position in the aspect of "most suitable for China", leading domestic large-scale models.

Zhao Zizhong, dean of the New Media Research Institute of Communication University of China, suggested that entrepreneurs, developers, and small and medium-sized enterprises do not need to build their own large-scale models from 0 to 1. They can create intelligent applications based on Wenxin large-scale models to avoid repeated wheel creation. Focus on innovation that you are good at. Whoever makes an application that meets the needs of users first will seize the opportunity for development.

Guess you like

Origin blog.csdn.net/yaxuan88521/article/details/132354971