Understand Google Gemini: CMU comprehensive evaluation, Gemini Pro loses to GPT 3.5 Turbo

How much does Google’s Gemini weigh? How does it compare with OpenAI's GPT model? This CMU paper makes it clear.

Some time ago, Google released Gemini, a competing product that benchmarks the OpenAI GPT model . This big model comes in three versions - Ultra (the most capable), Pro and Nano. Test results published by the research team show that the Ultra version outperforms GPT4 in many tasks, while the Pro version is on par with GPT-3.5.

Although these comparative results are of great significance to large-scale language model research, the exact evaluation details and model predictions have not yet been made public, which limits the reproduction and detection of the test results, making it difficult to further analyze its implicit details.

In order to understand the true strength of Gemini, researchers from Carnegie Mellon University and BerriAI conducted an in-depth exploration of the model's language understanding and generation capabilities.

They tested the text understanding and generation capabilities of Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral on ten data sets. Specifically, they tested the model's ability to answer knowledge-based questions on MMLU, the model's reasoning ability on BigBenchHard, the model's ability to answer mathematical questions in data sets such as GSM8K, and the model's ability to answer mathematical questions in data sets such as FLORES. The translation ability of the model; the code generation ability of the model was tested in data sets such as HumanEval; the model's ability as an agent that follows instructions was tested in WebArena.

Table 1 below presents the main results of the comparison. Overall, as of the publication date of the paper, Gemini Pro is close to OpenAI GPT 3.5 Turbo in accuracy in all tasks, but still slightly inferior. In addition, they also found that Gemini and GPT performed better than the open source competing model Mixtral.

In the paper, the author provides an in-depth description and analysis of each task. All results and reproducible code can be found at: https://github.com/neulab/gemini-benchmark

Paper link: https://arxiv.org/pdf/2312.11444.pdf

Experimental setup

The author selected four models: Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral as test objects.

picture

Due to differences in experimental settings during evaluation in previous studies, to ensure a fair test, the author re-ran the experiment using exactly the same prompt words and evaluation protocol. In most assessments, they used prompt words and rubrics from a standard repository. These test resources come from the data set that comes with the model release and the evaluation tool Eleuther, etc. Among them, prompt words usually include query, input, a small number of examples, thinking chain reasoning, etc. In some special assessments, the authors found that minor adjustments to standard practices were necessary. Adjusting the bias has been performed in the corresponding code repository, please refer to the original paper.

The objectives of this study are as follows:

1. Provide a third-party objective comparison of the capabilities of OpenAI GPT and Google Gemini models through reproducible code and fully transparent results.

2. Study the evaluation results in depth and analyze in which fields the two models perform more prominently.

Knowledge-based QA

The author selected 57 knowledge-based multiple-choice question and answer tasks from the MMLU data set, covering various topics such as STEM and humanities and social sciences. MMLU has a total of 14,042 test samples and has been widely used to conduct an overall assessment of the knowledge capabilities of large language models.

The author compared and analyzed the overall performance of the four test subjects on MMLU (as shown in the figure below), subtask performance, and the impact of output length on performance.

Figure 1: Overall accuracy of each model on MMLU using 5 sample prompts and thought chain prompts.

As you can see from the figure, the accuracy of Gemini Pro is lower than GPT 3.5 Turbo, and much lower than GPT 4 Turbo. When using the thought chain prompt, there is little difference in the performance of each model. The authors speculate that this is due to the fact that MMLU mainly includes knowledge-based question and answer tasks, which may not benefit significantly from stronger reasoning-oriented prompts.

It is worth noting that all questions in MMLU are multiple choice questions, with four potential answers A to D arranged in order. The graph below shows the proportion of each answer option selected by each model. It can be seen from the figure that Gemini's answer distribution is very skewed, favoring the last option D. This contrasts with the more balanced results given by versions of GPT. This may indicate that Gemini did not receive the extensive instruction adjustments associated with multiple-choice questions, resulting in a bias in the model's answer ranking.

Figure 2: Proportion of answers to multiple-choice questions predicted by the model under test.

The figure below shows the performance of the tested model on the subtasks of the MMLU test set. Gemini Pro performs poorly on most tasks compared to GPT 3.5. Thought chain prompts reduce variance between subtasks.

Figure 3: Accuracy of the tested model on each subtask.

The author takes an in-depth look at Gemini Pro's strengths and weaknesses. As can be observed in Figure 4, Gemini Pro lags behind GPT 3.5 in the tasks of human gender (social sciences), formal logic (humanities), elementary mathematics (STEM), and professional medicine (professional fields). The lead is also slim in the two tasks the Gemini Pro is better at.

Figure 4: Advantages of Gemini Pro and GPT 3.5 on MMLU tasks.

The Gemini Pro's poor performance on certain tasks can be attributed to two reasons. First, there are situations where Gemini cannot return an answer. In most MMLU subtasks, the API response rate exceeds 95%, but the corresponding rates in the two tasks of morality (response rate 85%) and human gender (response rate 28%) are significantly lower. This suggests that Gemini's lower performance on some tasks may be due to input content filters. Second, the Gemini Pro performs slightly worse at the basic mathematical reasoning required to solve formal logic and basic math tasks.

The author also analyzed how the output length in the thought chain prompt affects model performance, as shown in Figure 5. In general, more powerful models tend to perform more complex reasoning and therefore output longer answers. The Gemini Pro has a noteworthy advantage over its "opponents": its accuracy is less affected by output length. When the output length exceeds 900, Gemini Pro even outperforms GPT 3.5. However, compared to GPT 4 Turbo, Gemini Pro and GPT 3.5 Turbo rarely output long inference chains.

picture

Figure 5: Output length analysis of the tested model on MMLU.

General-purpose Reasoning

In the BIG-Bench Hard test set, the author evaluates the general reasoning ability of the subjects. BIG-Bench Hard contains 27 different reasoning tasks, such as arithmetic, symbolic and multilingual reasoning, factual knowledge understanding and other tasks. Most tasks consist of 250 question-answer pairs, with a few tasks having slightly fewer questions.

Shown in Figure 6 is the overall accuracy of the tested model. It can be seen that the accuracy of Gemini Pro is slightly lower than GPT 3.5 Turbo and much lower than GPT 4 Turbo. In comparison, the accuracy of the Mixtral model is much lower.

Figure 6: Overall accuracy of the tested model on BIG-Bench-Hard.

The authors go into more depth about why Gemini general reasoning performs poorly overall. First, they examined accuracy by question length. As shown in Figure 7, Gemini Pro performs poorly on longer and more complex problems. As for the GPT model, especially GPT 4 Turbo, even in very long problems, the regression of GPT 4 Turbo is very small. This shows that it is robust and capable of understanding longer and more complex questions and queries. The robustness of GPT 3.5 Turbo is average. Mixtral performed stably in terms of question length, but had lower overall accuracy.

Figure 7: Accuracy of the tested model on BIG-Bench-Hard by question length.

The author analyzed whether there were differences in accuracy of the tested models in specific BIG-Bench-Hard tasks. Figure 8 shows which tasks GPT 3.5 Turbo performs better than Gemini Pro.

Gemini Pro performs particularly poorly on the task of "tracking the position of transformed objects." These tasks involve people exchanging items and tracking who owns something, but Gemini Pro often struggled to keep the order right.

picture

Figure 8: GPT 3.5 Turbo outperforms Gemini Pro on the BIG-Bench-Hard subtask.

Gemini Pro is inferior to Mixtral in tasks such as arithmetic problems requiring multi-step solutions and finding errors in translations.

There are also tasks where Gemini Pro is better than GPT 3.5 Turbo. Figure 9 shows the six tasks where Gemini Pro leads GPT 3.5 Turbo by the largest margin. The tasks are heterogeneous and include those requiring world knowledge (sports_understanding), manipulating symbol stacks (dyck_languages), sorting words alphabetically (word_sorting), and parsing tables (penguins_in_a_table).

Figure 9: Gemini Pro outperforms GPT 3.5 on the BIG-Bench-Hard subtask.

The authors further analyzed the robustness of the tested model in different answer types, as shown in Figure 10 . Gemini Pro performed worst in the "Valid/Invalid" answer type, which belongs to the task formal_fallacies. Interestingly, 68.4% of the questions in this task had no response. However, in other answer types (consisting of word_sorting and dyck_language tasks), Gemini Pro outperforms all GPT models and Mixtral. That is, Gemini Pro is particularly good at rearranging words and generating symbols in the correct order. In addition, for MCQ answers, 4.39% of the questions were blocked from responding by Gemini Pro. The GPT models excel in this area, and the Gemini Pro struggles to compete with them.

picture

Figure 10: Accuracy of the tested model on BIG-Bench-Hard by answer type.

All in all, no model seems to be leading the way on a specific task. Therefore, when performing general inference tasks, you may wish to try both the Gemini and GPT models before deciding which model to use.

Mathematical ability

To evaluate the mathematical reasoning capabilities of the tested model, the authors selected four mathematical problem benchmark sets:

(1) GSM8K: Primary school mathematics benchmark test;

(2) SVAMP: Check robust reasoning ability by changing word order to generate questions;

(3) ASDIV: with different language modes and question types;

(4) MAWPS: Contains arithmetic and algebra word problems.

The author compared the accuracy of Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo and Mixtral on four mathematical problem test sets, examining their overall performance, performance under different problem complexities, and performance under different depths of thinking chains.

Figure 11 presents the overall results. In tasks including GSM8K, SVAMP and ASDIV with different language modes, the accuracy of Gemini Pro is slightly lower than GPT 3.5 Turbo and much lower than GPT 4 Turbo. For the tasks in MAWPS, although the accuracy of all tested models exceeds 90%, Gemini Pro is still slightly inferior to the GPT model. In this task, GPT 3.5 Turbo narrowly outperforms GPT 4 Turbo. In comparison, the accuracy of the Mixtral model is much lower than the other models.

picture

Figure 11: Overall accuracy of the tested model in four mathematical reasoning test set tasks.

The robustness of each model to problem length is demonstrated in Figure 12. Similar to the inference tasks in BIG-Bench Hard, the accuracy of the model under test decreased when answering longer questions. GPT 3.5 Turbo performs better than Gemini Pro on shorter questions, but regresses faster. Gemini Pro is similar to GPT 3.5 Turbo in terms of accuracy on longer questions, but still lags slightly behind.

picture

Figure 12: The accuracy of the tested model in generating answers for different question lengths in four mathematical reasoning test set tasks.

Additionally, the authors observed differences in the accuracy of the tested models when the answer required a longer chain of thought. As shown in Figure 13, GPT 4 Turbo is very robust even when using long inference chains, while GPT 3.5 Turbo, Gemini Pro and Mixtral show weakness when the COT length continues to increase. Through analysis, the authors also found that Gemini Pro outperformed GPT 3.5 Turbo in complex examples with COT lengths exceeding 100, but performed poorly in shorter examples.

picture

Figure 13: Accuracy of each model on GSM8K under different thinking chain lengths.

Figure 14 shows the accuracy of the tested model in generating answers for different numbers of digits. The authors created three "buckets" based on whether the number of digits in the answer was 1, 2, or 3 or more (except for the MAWPS task, which did not have answers with more than two digits). As shown in the figure, GPT 3.5 Turbo seems to be more robust to multi-digit math problems, while Gemini Pro degrades on problems with more digits.

picture

Figure 14: Accuracy of each model in four mathematical reasoning test set tasks when the number of answer digits varies.

code generation

In this section, the authors use two code generation datasets—HumanEval and ODEX—to examine the coding capabilities of the model. The former tests a model's basic code understanding of a limited set of functions in the Python standard library, and the latter tests a model's ability to use a broader set of libraries across the Python ecosystem. The input for both problems is task instructions written in English (usually with test cases). These questions are used to evaluate the model's language understanding, algorithm understanding, and elementary mathematics ability. In total, HumanEval has 164 test samples and ODEX has 439 test samples.

First, from the overall results shown in Figure 15, we can see that Gemini Pro's Pass@1 scores on both tasks are lower than GPT 3.5 Turbo and much lower than GPT 4 Turbo. These results indicate that Gemini's code generation capabilities leave room for improvement.

picture

Figure 15: Overall accuracy of each model on the code generation task.

Secondly, the author analyzed the relationship between the gold solution length and model performance in Figure 16 (a). The length of the solution can, to a certain extent, indicate the difficulty of the corresponding code generation task. The authors found that Gemini Pro achieved Pass@1 scores comparable to GPT 3.5 when the solution length was below 100 (i.e. the easier case), but it fell behind significantly when the solution length became longer. This is an interesting contrast to the results of the previous sections, where the authors found that Gemini Pro was generally robust to longer inputs and outputs in English tasks.

picture

The authors also analyzed the impact of the libraries required for each solution on model performance in Figure 16 (b). In most library use cases, such as mocks, pandas, numpy, and datetime, Gemini Pro performs worse than GPT 3.5. However, in matplotlib's use case, it outperforms GPT 3.5 and GPT 4, indicating its greater ability to perform plot visualizations through code.

Finally, the authors show several specific failure cases where Gemini Pro performed worse than GPT 3.5 in code generation. First, they noticed that Gemini was slightly inferior in correctly selecting functions and parameters in the Python API. For example, given the following prompt:  

picture

Gemini Pro generated the following code, which resulted in a type mismatch error:   

picture

In contrast, GPT 3.5 Turbo uses the following code, which achieves the desired effect:  

picture

Additionally, Gemini Pro has a higher proportion of errors, where the executed code is syntactically correct but does not correctly match more complex intent. For example, regarding the following tips:   

picture

Gemini Pro created an implementation that extracts only unique numbers without removing those that appear multiple times.

picture

machine translation

This set of experiments uses the FLORES-200 machine translation benchmark to evaluate the model's multilingual capabilities, specifically its ability to translate between various language pairs. The authors focus on a different subset of the 20 languages ​​used in Robinson et al.'s (2023) analysis, covering varying degrees of resource availability and translation difficulty. The authors evaluated 1012 sentences in the test set for all selected language pairs.

In Tables 4 and 5, the author conducts a comparative analysis of mature systems such as Gemini Pro, GPT 3.5 Turbo, and GPT 4 Turbo with Google Translate. Additionally, they benchmarked NLLB-MoE, a leading open source machine translation model known for its broad language coverage. The results show that Google Translate generally outperforms other models, performing well on 9 languages, followed by NLLB, which performs well on 6/8 languages ​​under the 0/5-shot setting. General-purpose language models have shown competitive performance but have not yet surpassed specialized machine translation systems in translation into non-English languages.

picture

Table 4: Performance (chRF (%) score) of each model for machine translation across all languages ​​using 0-shot cues. The best score is shown in bold and the next best score is underlined.

Table 5: Performance of each model (chRF (%) score) for machine translation across all languages ​​using 5-shot cues. The best score is shown in bold and the next best score is underlined.

Figure 17 shows the performance comparison of the general language model across different language pairs. Compared with GPT 3.5 Turbo and Gemini Pro, GPT 4 Turbo shows consistent performance deviation with NLLB. GPT 4 Turbo also has great improvements in low-resource languages, while in high-resource languages, the performance of the two LLMs is similar. In comparison, Gemini Pro outperformed GPT 3.5 Turbo and GPT 4 Turbo on 8 out of 20 languages, and achieved the highest performance on 4 languages. However, Gemini Pro showed a strong tendency to block responses in about 10 language pairs.

picture

Figure 17: Machine translation performance (chRF (%) score) by language pair.

Figure 18 shows that Gemini Pro's performance is lower in these languages ​​because it tends to mask responses in lower-confidence scenarios. If Gemini Pro generates a "Blocked Response" error in a 0-shot or 5-shot configuration, the response is considered "blocked."

picture

Figure 18: Number of samples blocked by Gemini Pro.

A closer look at Figure 19 shows that Gemini Pro slightly outperforms GPT 3.5 Turbo and GPT 4 Turbo in the unshielded sample with higher confidence. Specifically, it has 1.6 chrf and 2.6 chrf more than GPT 4 Turbo at 5-shot and 0-shot settings respectively, and 2.7 chrf and 2 chrf more than GPT 3.5 Turbo. However, the authors' preliminary analysis of the performance of GPT 4 Turbo and GPT 3.5 Turbo on these samples shows that translation of these samples is generally more challenging. Gemini Pro doesn't perform well on these particular samples, and it's particularly noticeable that the Gemini Pro 0-shot blocks the response, while the 5-shot doesn't, and vice versa.

picture

Figure 19: chrf performance (%) for shielded and unshielded samples.

Throughout the analysis of the model, the authors observed that few-shot hints generally moderately improve average performance, with an increasing variance pattern: GPT 4 Turbo < GPT 3.5 Turbo < Gemini Pro. Although Gemini Pro's 5-shot hints are improved over 0-shot hints in high-confidence languages, in some languages, such as hau_Latin, the model's confidence is significantly reduced, resulting in blocked responses (see Table 5).

Figure 20 shows clear trends by language family or script. An important observation is that the Gemini Pro performs competitively with other models on Cyrillic script, but not as well on other scripts. GPT-4 performs outstandingly on various scripts, outperforming other models, among which the few-shot hints are particularly effective. This effect is particularly evident in languages ​​using Sanskrit.

Figure 20: Performance of each model on different scripts (chrf (%)).

WebAgent

Finally, the authors examine each model's ability to serve as a network-navigating agent, a task that requires long-term planning and complex data understanding. They used a simulation environment, WebArena, where success was measured by execution results. Tasks assigned to the agent include information search, website navigation, and content and configuration manipulation. The tasks involved a variety of websites, including e-commerce platforms, social forums, collaborative software development platforms (such as gitlab), content management systems, and online maps.

The authors tested Gemini-Pro's overall success rate, success rate on different tasks, response length, trajectory steps, and tendency to predict task failure. Table 6 lists the overall performance. The performance of Gemini-Pro is close to, but slightly inferior to,GPT-3.5-Turbo. Similar to GPT-3.5-Turbo, Gemini-Pro performs better when the hint mentions that the task may not be completed (UA hint). With UA hint, the overall success rate of Gemini-Pro is 7.09%.

Table 6: Performance of each model on WebArena.

If broken down by website type, as shown in Figure 21, you can see that Gemini-Pro performs worse than GPT-3.5-Turbo on gitlab and maps, while its performance on shopping management, reddit, and shopping websites is close to GPT- 3.5-Turbo. On multi-site tasks, Gemini-Pro outperforms GPT-3.5-Turbo, which is consistent with previous results that Gemini performs better on more complex subtasks across various benchmarks.

picture

Figure 21: Web agent success rate of the model on different types of websites.

As shown in Figure 22, in general, Gemini-Pro predicts more tasks as impossible to complete, especially when a UA hint is given. Gemini-Pro predicts that more than 80.6% of tasks cannot be completed when UA hint is given, while GPT-3.5-Turbo only predicts 47.7%. It is important to note that only 4.4% of the tasks in the dataset are actually unachievable, so both vastly overestimate the actual number of unachievable tasks.

picture

Figure 22: UA forecast quantity.

At the same time, the authors observed that Gemini Pro was more likely to respond with shorter phrases and take fewer steps before reaching a conclusion. As shown in Figure 23 (a), Gemini Pro has more than half of its trajectories with less than 10 steps, while most of the trajectories of GPT 3.5 Turbo and GPT 4 Turbo are between 10 and 30 steps. Likewise, most of Gemini's replies are less than 100 characters long, while most of GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral's replies are more than 300 characters long (Figure 23 (b)). Gemini tends to predict actions directly, while other models reason first and then give action predictions.

picture

Figure 23: Model behavior on WebArena. 

Guess you like

Origin blog.csdn.net/leyang0910/article/details/135118662