Is Wenxinyiyan 4.0 really as good as GPT-4?

 

7210d9d17b5547141b6be5cfdf8bfe7a.gif

On October 17, Robin Li said at Baidu World 2023. On the same day, Robin Li delivered a speech on the theme of "Teaching You Step by Step to Make AI Native Applications" and released version 4.0 of Wenxin Large Model.

Today, let’s get straight to the point. This time I want to test the Wen Xin Yi Yan large model 4.0 which was released yesterday.

The reason why we want to test it is because of what Robin Li said at the meeting yesterday: " The comprehensive ability is no less than GPT-4! "

 

a22759b61574ff47fb237118a5eb50c6.png

As soon as these words came out, many people became excited.

According to Robin Li, Wenxin 4.0 has made rapid progress in four areas: memory, understanding, logic and generation.

Although he personally demonstrated many cases on site, many users did not buy it at all.

Many people joked: "Just lie to your brothers, don't lie to yourself."

 

4ace7565489b905c7bbfd9478f64939e.jpeg


This time, industry insider Shichao was also lucky enough to get the qualification to be the first to be tested internally.

Since he boasts that he is not inferior to GPT-4, let's let the two compete against each other.

After getting the qualification, Shichao tried for a whole day. This time I won’t talk to you about it, but just tell you the test conclusion:

Generally speaking, GPT-4 is a stable winner, but Wenxinyiyan 4.0 unexpectedly outperforms GPT-4 in some aspects.

 

b5ecc152954aea4b0d9cf71a42a5cabd.jpeg

So for this test of Shichao, we will start from several more common evaluation angles, so that it can be reflected more comprehensively and realistically. However, the test difficulty is aligned with the difficulty of the previous GPT-4 evaluation.

In the first round of this competition, let’s test something that everyone likes to see.

Let’s start with the easier mental retardation bar and semantic trap questions, which can also be used to test your logic and understanding abilities.

However, many large models in this area have been specially trained, and even after asking many questions, they failed. However, after unremitting efforts, Shichao still caught the loophole.

I asked a very classic retarded bar question: If there are really "dragons" in the world, then I have been served by a "dragon" somewhere.

Let’s look at GPT-4 first. Since I don’t know what these two “dragons” mean, I started to make up some random historical allusions.

 

 

f62cbe37c4f7ea52d57338d631eded1e.jpeg

Wen Xin, on her part, wasn't that smart, and just made up a "humorous" explanation.

Shichao even gave it another chance later and asked it: Are the two dragons the same dragon?

Wen Xin still firmly gave me a completely wrong answer.

 

 

cda832898385044af36b186d0757f111.jpeg

However, by the second question, GPT-4 stood up.

When I asked: The company is a warm family, no wonder I am always treated as a grandson.

Wen Xin is still in the "warm company" with "no hierarchy".

 

05b604349e276cbc157b8492b7a22c59.jpeg

13958def2cd97b4fae686a3a20d4f7c3.jpeg

However, Shichao added another leadership question. However, the situation suddenly completely reversed, and Wen Xin won completely.

Shichao asked a few popular jokes: "When the leader picks up food, you turn the table; when the leader drinks water, you brake" and asked them to copy a few.

The sentences given to me by AI on both sides were very neatly aligned, but the semantics of GPT-4 were completely reversed. The leader's flattery was perfect, but unfortunately the answers were all wrong.

 

 

671cb3b758c05b8116b45cee2430e04d.jpeg

The answers given by Wen Xin are really in line with the leadership culture of contemporary youth.

However, a warm reminder, it is recommended to use GPT-4 as the standard during practical operation.

 

f628fa18d6cfb985e1ff24ec8dde8723.jpeg

At the end of the first round of competition, Wenxin VS GPT-4 was tied 1 to 1.

It seems that Wen Xin was not completely bragging when she said that she had made rapid progress.

In the second round of the game, the World Super League wanted to continue to play something interesting and try out the AI's ability to interpret memes.

Back then, when GPT-4 was launched, it was able to decode maps, which was a big surprise for a long time.

Because the previous test was all about Chinese semantics, Shichao felt that it was a bit unfair to GPT-4, so he specially selected a meme with both Chinese and English annotations.

Just like my life

I don’t know what I’m busy with

 

82912478f003e19c4d23d6de90c15e97.jpeg

I don’t know if there is English assistance, but this time GPT-4’s ability to interpret memes is not even a little bit better.

Not only can I understand that "dog" is the key role in this meme, but I also understand that the punchline lies in the contrast between "seriously helping" and "having no effect".

 

85cbda87c6ab145d6dd2579ec730ff2a.jpeg

However, Wen Xin was still treating the sketches as reading comprehension questions. . .

And it’s quite tough. You say this picture is funny, but it insists: There’s nothing funny about it, and it doesn’t understand what you’re having fun with.

 

47a3fd374cb141c071f07e156aa45263.jpeg

However, Wen Xin is not good at explaining memes. But when it came to Chinese Internet memes, they immediately regained control.

Shichao asked about the recent lonely meme of Teacher Wanyan Huide, a new Internet celebrity.

 

b85094a26359445ef5e667b0de22a4d8.jpeg

As a result, Wen Xin not only pointed out the source of the meme, but also correctly explained that it was a homophonic meme.

Although it was a pity in the end, I misunderstood "ethics" as "theory", and it was just a last-minute kick in the door that I didn't get in.

 

 

dd9e29ce004ef7ee9204062e12428c1b.jpeg

But if Wenxin doesn’t get full marks, then GPT-4 might be considered a failure. .

Not only did you not understand the plot, but you also found the wrong source. Let you go to the large-scale documentary "The Legend of Wanyan Huide" to find the answer.

 

47e0b727738e60545774be20ef83a7b6.jpeg

After the two small tests in this second round of competition, both sides have their own merits, so there is no difference between them. Wenxin’s hot memes are updated quickly, and GPT-4 image interpretation is stronger.

After two rounds of competition, there is no clear winner yet, and we are anxiously at 2 to 2.

The first two rounds of semantic understanding are more basic. Let's test our professional abilities again. The third round went directly to GPT-4’s super strength—code questions.

I don’t know if anyone still remembers that it took GPT-4 60 seconds to create a complete snake game, which shocked the entire world.

Now let's use the same test and let Wen Xin try it.

Because the code is relatively long, it is not fully displayed here. We can scroll directly below to see the final effect.

 

703619c7bc22a4adff17dff9279169b5.jpeg

43564cf04726187308134fb81ffbb68e.jpeg

Let’s take a look at the big brother GPT-4 first. It still performs stably. In about tens of seconds, you can create a complete and playable Snake game. Including effects such as the movement of snakes, the random appearance of dots, and the increase in size after eating.

 

84d1e5dff6c7204e806914b2e2629b02.gif

However, when it comes to Wen Xin's side, she completely fails.

This is not an animated picture. It’s not moving.

It's Wen Xin who didn't make any move.

 

f434706f5a9094c0f87a1e1f774ec811.jpeg

However, this does not mean that Wen Xin is very good. Such a huge gap in strength is actually because GPT-4's coding ability is too abnormal.

If we lower the difficulty a little and let them build a website based on the sketch, Wen Xin can handle it easily.

 

c64ef54cd8c55d2f4f45f270c0719c8a.jpeg

However, despite this, judging from the comparison of the effects of the two websites below, GPT-4 is still more beautiful and complete.

Wenxinyiyan

 

6086c7c53f8405f4b8fd242ac809945e.jpeg

GPT-4

 

857a71e212711da57def4c7960039412.jpeg

In this third round of competition, GPT-4 was undoubtedly defeated. Now the score is also widened, Wenxin VS GPT-4 = 2:3.

In order to avoid unfairness, since one of the strengths of GPT-4 was tested in the first interview, the next two tests will also test an ability that Wen Xin said she is better at - memory.

Shichao found an interview document that had been used to interview people related to guide dogs. The entire interview document totaled more than 13,000 words.

 

c28ae9cf47babc6c5bb36accd63faa90.jpeg

After throwing this large document to these two AIs, I asked the simplest question:

Why are guide dogs a scam?

What is a bit surprising is that although the answer of GPT-4 is correct, the analysis is wrong.

 

e361b8ffb456320493a34324b14d8df4.jpeg

On the contrary, Wen Xin's side understood it very accurately. It answered that the cost was high, the propaganda was exaggerated, and the prospects were not as good as guide equipment, etc. These were the key information.

 

6b41f9a1a6669284a82058f12821db83.jpeg

Wen Xin is indeed quite solid in terms of memory and understanding. It was considered a successful comeback and brought the score back to a tie at 3:3.

Since the situation is so stalemate, let's try a more interesting question in this last round.

As mentioned in the GPT-4 Vision version before, this generation of GPT-4 has strong image recognition capabilities and can label individual people in group photos, sort images, etc.

 

9311ec1a8c57d4cb7cf0d64e999ce52f.jpeg

Several previous test questions have proven that Wen Xin's picture recognition ability is not weak at all. So, for this last question, let’s use pictures to show off.

Shichao threw in an X-ray of the tooth and asked both parties to act as doctors to diagnose the condition.

 

49ff5f0c376488fafe8bc21b6ce297a8.jpeg

Although Wen Xinyiyan also discovered the problem of impacted wisdom teeth, he also pointed out other possible problems. But GPT-4's answer is more accurate and more appropriate.

 

615bd7699a41a06f1fbed2ef7b8dc4ac.jpeg

At the end of these five rounds of competition, Wen Xinyiyan still lost to GPT-4 4:3. In terms of code, she was severely beaten. . But in terms of Wenxin's Chinese semantic understanding and memory, as Baidu said, it has improved a lot.

In addition to our basic tests above, Wen Xinyiyan also launched several plug-in functions this time.

For example, Yi Jing Liu Ying (video generation), Shuo Tu Jie Hua (picture interpretation), E Yan Yi Tu (visualized data analysis)

 

4ce3a27ed2775f1f8fa392e7c6cacee8.jpeg

For example, if you make a video of a golden retriever climbing stairs in one sentence, a video with good sound will be ready in a few minutes.

However, it is not very perfect at present, and there are often situations where there is not enough material to generate a video.

It's quite interesting to experience it as a toy, but it's a bit boring to use it as a productivity tool.

 

669512c738b7aec48dc2ce1cac64e799.gif

Even so, the performance of Wenxin 4.0 has already made my eyes shine.

 

f50d3ef2c4e505830f7b86b3a1fc5ba3.jpeg

In front of such a strong opponent, it's easy to seem like all your efforts were in vain. . .

Although you still lost this time, at least you can feel where you have improved and where you are better at it.

However, it must be emphasized in the end that Shichao’s test can only simply compare two large models from a conventional perspective. I can only give you a taste of it and experience it first. There is no way to completely represent the strength of the large model.

How many kilograms and taels will be determined depends on the complete opening up. Only by experiencing it yourself can you have a deeper feeling.

 

 

 

 

Guess you like

Origin blog.csdn.net/m0_68662723/article/details/133928140