Wenxinyiyan and Xunfei Xinghuo comprehensive comparison test: (3) Common sense questions

Previous review:

In (1) the language comprehension ability test, we mainly tested the understanding of complex semantics by two large language models, the recognition of the emotion of the article, the summary of the article, and the extraction of the elements of the article. The test results show that: in language comprehension In terms of ability: Except for some questions he refused to answer, Xunfei Xinghuo’s performance is obviously better than Wen Xinyiyan’s, which can be said to be very close to the level of ChatGPT3.5.

In (2) task completion ability, we tested the model's ability to process tables, completed acrostic poems, and drew up a running plan. Overall, we found that the performance of the two models in this area was average.

Today we will test the ability of two models on common sense questions.

  1. Test content design

Knowledge-based testing reflects the powerful knowledge storage and understanding capabilities behind the large model, which can directly help humans quickly answer questions. We include both simpler common sense and more complex professional knowledge here. Especially in terms of professional knowledge, we also describe some phenomena and let the model use professional knowledge to answer them. This reverse test can reflect the model's understanding of knowledge.

Tests include:

1. Objective common sense facts (understood by high school and college students): relatively objective factual questions, mainly based on right and wrong

2. Subjective common sense: Relatively subjective questions, mainly depends on the rationality, logic and quality of model responses

3. Causal Inference: Simple Causal Relationships

4. Factual errors: There are errors in the question itself, see if the model can find it

2. Objective facts

We select some small questions in geography, history, astronomy, literature, etc. that I am more familiar with to test the accuracy.

Use case 1: Which is the third highest mountain on earth?

In a word:

Xunfei Spark:

chatGPT

According to my understanding, it should be that Wen Xing was right, but everything else was wrong.

Use case 2: In ancient China, who is Song Shenzong referring to, and it is recorded in history that he died in that year?

Wen Xing said:

Xunfei Spark:

chatGPT

Question 1: The sixth emperor, Zhao Xu, got all the answers right, Xunfei Xinghuo got the wrong time, and ChatGPT, although the time was right, fabricated a "Decree of the First Year of Renzong Zhiping" by superficiality.

Use case 3: How many planets are there in the solar system, according to the distance from the sun from near to far?

Wen Xing's words

Xunfei Spark:

ChatGPT:

Not bad, not bad, all right!

3. Subjective common sense

Use Case 1: Who do you think is the best current football player?

In a word:

Xunfei Spark:

chatGPT:

The three answers are all good. Obviously, the training data of Xunfei Spark is relatively new, because he mentioned Harland.

4. Causal inference

Use case 1: This is an examination question for civil servants:

Wen Xing said:

Xunfei Spark:

ChatGPT:

Here, I feel that Wenxin Yiyan and Xunfei Xinghuo did not understand the meaning of the question, and the answer of chatGPT is correct.

Use case 2: The same logical reasoning question in the public examination:

In a word:

Xunfei Spark:

ChatGPT:

In this round, Wenxinyiyan and ChatGPT completed the test, but Xunfei Xinghuo did not.

5. Mistakes of fact

This test is quite special, that is, the questioner's question itself is wrong, and it is important to see if the AI ​​can point it out, so as to prevent the user from being wrong at the beginning.

Use case 1: Guan Gong fights Qin Qiong

In a word:

Xunfei Spark:

ChatGPT:

Use case 2: Lin Daiyu pulls up weeping willows upside down

In a word:

Xunfei heart fire:

ChatGPT:

The answers were almost the same, the difference was that Wen Xin's words were pulled up, and the other two looked like Sister Lin.

4. Summary

  1. Today's test tested some general knowledge questions, some logical reasoning questions, and the AI's response to the wrong question.

  2. For common-sense questions, the answers are not satisfactory, and I don’t know the specific reasons. In terms of logical reasoning, Wenxinyiyan and Xunfei Xinghuo are much worse than ChatGPT. Finally, in the face of wrong problems, AI is based on the idea that if you talk nonsense, I will talk even more nonsense, basically adding mistakes to mistakes.

Guess you like

Origin blog.csdn.net/m0_37771865/article/details/131040840