Test | Xunfei Spark has been upgraded, its capabilities have been greatly improved, and it is no longer serious nonsense

On June 9, Xunfei Xinghuo Cognitive Big Model was released. For details, see:

The press release states:

Xunfei Xinghuo Cognitive Big Model once again enhances core capabilities, open-ended knowledge questions and answers, logical and numerical capabilities, and multiple rounds of dialogue , three comprehensive capabilities are upgraded.

But is it true? We'll have to test it out to find out.

1. Test content

When Xunfei Spark was just launched, I once wrote a series of evaluation articles:

Comprehensive comparative test between Wenxin Yiyan and Xunfei Xinghuo: (1) Language comprehension ability

Comprehensive comparative test between Wenxinyiyan and Xunfei Xinghuo: (2) Mission completed

Wenxinyiyan and Xunfei Xinghuo comprehensive comparison test: (3) Common sense questions

Comprehensive comparative test between Wenxinyiyan and Xunfei Xinghuo: (4) Logical mathematics

Comprehensive comparative test between Wenxinyiyan and Xunfei Xinghuo: (5) Programming ability

These articles comprehensively test the capabilities of the two major domestic models in five aspects. Today, we still choose some use cases from the last test, especially the test cases that Xunfei Xinghuo completed poorly last time, to see if there is any improvement. How much has been improved.

2. Language comprehension ability

Example:

According to the description of the following sentence: Mingming clearly likes him for nothing, but Mingming just doesn't say anything.

May I ask: Who likes whom?

version 1.0

version 1.5

According to the description, "obviously liking him for nothing" means that a person likes another person. "Mingming" here refers to the person's name, while "Mingming Mingbai likes him" means that the person likes another person very clearly. Therefore, it is "obviously" who likes "him".

The person I like is "Ming Ming", because the sentence mentions "Ming Mingming loves him for nothing".

Conclusion: 1.5 is still wrong this time

Example:

The eldest uncle went to the second uncle's house to find the third uncle and said that the fourth uncle stole 100 yuan from the fifth uncle's cabinet in the sixth uncle's house. I would like to ask: 1. Who is the thief? 2. Who did the money originally belong to?

version 1.0

version 1.5

Here is an interesting story from which we can find some clues to answer these two questions:

1. Who is the thief? According to the description, the fourth uncle went to the fifth uncle's house and took 1,000 yuan from the seventh uncle's cabinet. Therefore, the fourth uncle is a thief.

2. Whose money originally belonged to? Since the fourth uncle took the money from the seventh uncle's cabinet, the money originally belonged to the seventh uncle. However, the fourth uncle acted dishonestly in order to get the money, so the money was not actually returned to the seventh uncle.

According to the description, the fourth uncle was accused of stealing money. So, 1. The fourth uncle is a thief.

2. The money should have belonged to Uncle Wu.

Conclusion: The 1.5 version is right this time, and the improvement is obvious

Example:

An article from People's Daily on May 9:

"I hope that young people across the country will keep in mind the teachings of the party, aspire to national rejuvenation, live up to their youth, the times, and the people, run hard on the track of youth, and strive to achieve the best results of contemporary youth!"

......

......

"Everything in life must be done by oneself, and the steps of the country are boundless." Based on the new journey of the new era, contemporary Chinese youth will unswervingly follow the party, release youthful passion and pursue youthful ideals in the struggle, and they will be able to use their youth and struggle , Pave the way and build bridges for national rejuvenation, and contribute to the construction of the motherland.

Please write an abstract of 150 words based on the above text

version 1.0

version 1.5

Sorry, I haven't learned enough on this topic to provide relevant information. You can choose other questions and I will try to answer them for you.

Sorry, I haven't learned enough on this topic to provide relevant information. You can choose other questions and I will try to answer them for you.

Conclusion: This is obviously an avoidance strategy of Xunfei Xinghuo

Through the test, I confirmed that he refused to answer the questions that Xunfei considered unsuitable to answer. At the same time, from the test results, the 1.5 version still has a significant improvement.

3. Complete the task

Example:

version 1.0

version 1.5

Version 1.5 directly gives a python program, after running the program; the generated data:

Only the order of the product name is added, the receiving time and delivery time are random, and the others are unchanged.

Conclusion: Although 1.0 did not output 20 lines, the 5 lines output were all correct, indicating that he understood the meaning of the questioner. 1.5 obviously did not understand the question, and the performance of 1.5 was not as good as 1.0.

Example:

Process a sentence as requested:

I will give you a sentence, please repeat this sentence twice, the first time is completely reversed word for word, and the second time the result of the first time is completely reversed word for word. This sentence is: "I am a trainee who studied GPT for 2 and a half days and likes to sing, dance and rap"

version 1.0

version 1.5

Conclusion: No significant improvement

Example:

You are a poet now, please use "Two and a half years trainee is the best" to compose an acrostic poem, praising a singer for his hard work, superb skills, virtue and skill.

version 1.0

version 1.5

Conclusion: Some improvements, but still not complete

Through testing, in terms of task completion, the improvement of version 1.5 is not great.

4. Common sense issues

Example:

Which is the third highest mountain on earth?

version 1.0

version 1.5

The third highest mountain on earth is Kanchenjunga, which is located on the border of India and Nepal.

Conclusion: Significant improvement

Example:

In ancient China, who did Song Shenzong refer to, and in which year did he die according to historical records?

version 1.0

version 1.5

宋神宗赵顼(1048年5月25日—1085年4月1日),初名赵仲针,宋英宗赵曙长子,生母宣仁圣烈高皇后,北宋第六位皇帝(1067年1月25日—1085年4月1日在位)。治平元年(1064年),进封颍王。治平三年(1066年)十二月,被立为皇太子。治平四年(1067年)正月继位。

元丰八年(1085年),宋神宗赵顼在福宁殿忧郁而逝,享年38岁,庙号神宗,谥号为英文烈武圣孝皇帝,葬于永裕陵,子赵煦嗣位。

结论:这次回答正确,比较有意思的是,回答的内容基本和百度百科上的介绍一模一样。

用例:

在中国历史上有一个著名的事件:"关公战秦琼",请问这个事件是哪一年发生的,主要事件经过是什么样的?

1.0版本

1.5版本

《关公战秦琼》,也叫《唐汉争》,是一段对口相声作品。主流说法是20世纪30年代由相声演员张杰尧所编。20世纪60年代,侯宝林在张杰尧演出本的基础上加以改编而成。

《关公战秦琼》主要讲述了当年山东省主席韩复榘为父庆寿,寿公愚昧无知,在堂会上点戏,让汉朝山西名将关羽和唐朝山东好汉秦琼比试高低。演员在台上即兴发挥,哭笑不得,难以为继,闹出了笑话的故事。

关公就是关羽,是三国时候蜀的大将。秦琼是隋末唐初的大将。关公战秦琼实际上是一个错误的题目,因为两人出生于不同年代不可能相战。所以说,这个相声极具讽刺性。于是“关公战秦琼”这句话,就流传了下来用于讽刺不懂装懂,不切合实际,盲目指挥的人。

结论:在“事实错误”的类型的回答中,1.5版本的确提升很多,包括“林黛玉倒拔垂杨”,“鲁智深大闹天宫”,“张朝阳创立新浪”这样的问题,他都能和好的应对。

在常识问题,1.5比1.0的提升是巨大的,特别是大模型一本正经的胡说八道的问题,1.5改进很大。

五、逻辑数学

用例:

1、2、4、8、16......试按此规律写出第11个数?

1.0版本

1.5版本

结论:这次回答正确,提升明显

用例:

同学们站成一排,从左边数华华是第 5人,从右边数第 4 人是华华,这排共有多少人?

1.0版本

1.5版本

结论:这次回答正确,提升明显

用例:

有A、B、C、D、E五个自然数,其中A>B,E>C>D,D>B,E>A。请从小到大排列这些数

1.0版本

1.5版本

结论:这次还是错了,而且从回答的内容上看,1.0至少知道是要从小到大排列一下。

用例:

你有1000kg蘑菇,含水量是99%。现在晒几天,晒到含水量为98%,那需要晒掉多少水?

1.0版本

1.5版本

结论:有进步,一步一步的解题思路有进步,要是按照步骤给分,怎么也能有一半的分了吧

用例:

小明到爷爷办的养牛场去玩,小明问:“爷爷,这里有多少头奶牛呢?”,爷爷说:“这群奶牛,4头4头的数,多3头;6头6头的数,多5头;15头15头的数,多14头,而且这群奶牛的数量在150~200头之间。”你计算一下,这群奶牛有多少头?

1.0版本

1.5版本

结论:回答正确,而且这个做题的思路就是小学孩子的思路。

从测试结果上看,数学能力上1.5版本还是提升很大的。

六、编程能力

用例:

B2048 计算邮资

用Python和C++编程计算邮费。计算规则如下:

根据邮件的重量和用户选择是否加急计算邮费。

重量在1000 以内(包括),基本费8 元;

超过1000 克的部分,每500 克加收超重费4 元,不足500克部分按500克计算;

如果用户选择加急,多收5元。

输入格式:

一行,包含一个正整数x(大于1小于10e6)和一个字符c(取值为y或n),之间用一个空格隔开,分别表示重量和是否加急。

如果字符是 y,说明选择加急;如果字符是 n,说明不加急。

输出格式:

输出一行一个正整数,表示邮费。

1.0版本

讯飞星火的Python程序得了20%的分,主要问题在于计算加重没有扣除1000g,C++代码变量定义就有问题,编译没有办法通过,后面的逻辑也是和Python一样的错误。

1.5版本

这次编程的结果出乎意料,上次编写的程序至少是完整的,Python还得了20%分,C++至少结构是完整的,这次写了一半就不写了。

结论:从这题来说,1.5有退步。

用例:

我用Python编程计算邮费。计算规则如下:根据邮件的重量和用户选择是否加急计算邮费。重量在1000 以内(包括),基本费8 元;超过1000 克的部分,每500 克加收超重费4 元,不足500克部分按500克计算;如果用户选择加急,多收5元。输入格式:一行,包含一个正整数x(大于1小于10e6)和一个字符c(取值为y或n),之间用一个空格隔开,分别表示重量和是否加急。如果字符是 y,说明选择加急;如果字符是 n,说明不加急。输出格式:输出一行一个正整数,表示邮费。我编写的程序如下:

weight, is_urgent = input().split() weight = int(weight) basic_fee = 8 if weight > 1000: overweight = weight - 1000 extra_fee = (overweight // 500 + 1) * 4 else: overweight = 0 extra_fee = 0 if is_urgent == 'y': total_fee = basic_fee + extra_fee + 5 else: total_fee = basic_fee + extra_fee print(total_fee) 但是,当重量超出部分刚好被500整除时,计算错误,请帮我修改程序,并将程序输出。

1.0版本

没有做任何的修改,提醒之后:

指出,但是就不改。

1.5版本

得出的结果:

修改前

修改后:

他是把第8个测试点的bug修改了,但是其他的地方也改了,

结论:虽然还是没有完全做对,但是至少努力了,因此还是算提升了。

从测试上来看,在编程能力上,1.5版本进步不大。

七、总结

今天对讯飞星火1.5版本进行了一次测试,测试的题目主要是1.0版本表现不好的,因此没有办法测试是否下降(上次回答太差的,基本上没有下降的空间),从测试上看:

In terms of comprehension : test 3 questions, one is flat, one is improved, and one is still unwilling to answer.

In terms of the ability to complete the task , test 3 questions, one is declining, one is not significantly improved, and one is slightly improved.

In the common sense question , three questions were tested, and all of them showed obvious improvement, especially the recognition of wrong questions improved the most.

In terms of logic and mathematics ability , 5 questions were tested, 3 were significantly improved, 1 was slightly improved, and 1 was declined.

In terms of programming ability , two questions were tested, one slightly decreased and the other slightly improved, with little change.

To sum up, Xunfei Xinghuo version 1.5 has improved significantly, especially in common sense questions and mathematical logic ability.

When Xunfei Spark was released, it set a flag to reach the level of GPT4.0 on October 24th, and I look forward to that day.

Guess you like

Origin blog.csdn.net/m0_37771865/article/details/131215240