ChatGPT was disgusted, Llama 2 was proud of its open source, and OpenAI expressed dissatisfaction!

809c2e6133791c02b7fed428ce46c8bc.gif

Author | Zeng Haochen Editor | Tang Xiaoyin, Yuan Gungun

Listing | CSDN (ID: CSDNnews)

4b55d7de32e5314fdd7ad8dbb7385f44.jpeg

Llama 2, which is both open source and free, is quite popular once it is released, and has become the most popular open source ChatGPT alternative. Many developers and companies at home and abroad have followed suit in model research and commercial development. For example, the legendary OpenAI scientist Andrej Karpathy created a lightweight version of the Llama 2 model in pure C language. On the other hand, GPT-4, known as the ceiling of the large model, is very unsatisfactory, and it is deeply in the whirlpool of IQ decline.

55c195810e8d5ca1609ac958019b17c2.png

When is ChatGPT not smart?

Since the release of GPT-4 in March this year, many developers and users have mentioned in the OpenAI forum that there will be incoherence, unnatural language, and reasoning problems when using ChatGPT. There are different opinions on its core crux. Some scholars suspect that it is caused by OpenAI's system modification and upgrade, which reduces costs and increases efficiency by weakening computing performance. However, due to the closed-source nature of ChatGPT, it is difficult for us to determine the real reason behind it.

27790b9ee8fdf46e8b582a6986182fd6.png

OpenAI community posts discussing GPT-4 performance are particularly lively

The debate surrounding GPT's IQ decline culminated in the release of the paper "How is ChatGPT's Behavior Changing Over Time?", in a March and In June, different versions of GPT-3.5 and GPT-4 were tested on the task, and it was found that the results of different versions showed significant performance differences (drifting).

The first is the code generation ability that programmers are most concerned about. Even under the premise of explicitly stating not to comment, the newer versions of GPT-3.5 and GPT-4 still add more non-code text and comments, making the answer cumbersome and lengthy. At the same time, the decline in code quality resulted in a lower percentage of directly executable code generation (GPT-4 fell from 52% in March to 10% in June). For programmers, when using LeetCode to brush questions, the probability of answering correctly is much higher than that of ChatGPT.

4ca9a300d66c50ec96b2428cd2ead7ca.png

In terms of solving mathematical problems, the ability of GPT-4 to identify prime numbers dropped from almost all pairs in March to 2.4%, while the success rate of GPT-3.5 soared to 86.8%. The author suspects that GPT-3.5 follows the chain-of-thought instruction (Chain-Of-Thought) better than GPT-4, and the new version of GPT-4 may break and make mistakes in the reasoning process.

2867174fd61005365c65f26576e46a13.png

In terms of answering sensitive questions, the new version of GPT-3.5 is bolder than the March version, and the answer rate has increased from 4% to 8%. The new version of GPT-4 is more conservative, dropping from 21% to 5%. At the same time, GPT-4's generated character length dropped from over 600 to about 140, was more concise in rejecting answers, and provided shorter explanations. A similar phenomenon occurs with GPT-3.5. This suggests that the answers from the new version of ChatGPT may be more secure, but also more cowardly and less willing to explain.

21ba8928eea9de54220dda7b5edbc094.png

The final task is visual reasoning. The overall performance of the new version of GPT-4 and GPT-3.5 has slightly improved compared with three months ago, but it is still not high: the correct rate of GPT-4 is 27.4%, and that of GPT-3.5 is 12.2%. It is worth noting that despite the better overall performance, GPT-4 did not make mistakes before but appeared in the new version, highlighting the necessity of drift monitoring for key applications.

a48da5208854dd9264fb502e4713d9d2.png

In the paper, the author does not explicitly mention that the performance of the new version of ChatGPT is degraded compared to the old version, but only describes the observed drift phenomenon, and emphasizes the need to continuously evaluate the behavior of LLM in production applications, and recommends that users and The company implements monitoring analysis similar to the above four tasks to ensure its smooth operation.

Zou: "We don't fully understand what causes these changes in ChatGPT responses because the models are opaque. Tuning a model to improve its performance in some domains may have unintended side effects, making it less effective on other tasks." worse."

Li Feifei's student and NVIDIA senior AI scientist Jim Fan also expressed his views on this paper and ChatGPT's "reverse" upgrade. He believes that OpenAI spent most of its energy on reducing the load from March to June, resulting in the loss of some functions. But at the same time, safety alignment (Safety Alignment) makes programming redundant and makes developers more troublesome, and cost cutting may affect model performance.

89976bcc71ab0995aadcc20693710d4d.png

OpenAI response: GPT has no IQ drop!

In the face of so many discussions, OpenAI denied the statement that ChatGPT's performance was backward. "We didn't make GPT-4 stupid. Quite the opposite: we made each new version smarter than the previous one," Peter Welinder, OpenAI's vice president of product, said in a tweet. The more you use it, the more you can notice problems that you haven't seen before," and encourage everyone to send him screenshots of GPT degradation for analysis.

e04c9c40fb0f12fb9f785674b174855b.png

Judging from the information released by OpenAI, the new version is only a routine update every three months to ensure that developers can always use the best model. But at the same time, OpenAI also found that the update every three months is too frequent, and even with a three-month delay, developers still have no time to upgrade their applications. Therefore, OpenAI extended the support for the gpt-3.5-turbo-0301 and gpt-4-0314 models in the latest OpenAI API to June 13, 2024, one year later, and stated that some cases will encounter model regression problems , can be resolved by sending a more detailed prompt.

79e244b8c21d8188c6b1212daff0ffd6.png

At the same time, OpenAI is also focusing on improving the problems that have been reported by the community. For example, OpenAI technical spokesperson Logan Kilpatrick just announced that the new version of ChatGPT will no longer always start with "As a large language model trained by OpenAI, the following conclusions are obtained...", which is for developers. Feedback can be obtained more directly, and for ChatGPT, the system burden is also reduced to a certain extent.

852777ce26f2801f3da2822868224112.png

65f683b580a472fce703dcb917303590.png

Open source is the answer?

Interestingly, Chen et al.'s paper on the ChatGPT test was published almost at the same time as Llama 2, and it is open to everyone for free download and use, regardless of purpose and user. "OSS LLM will not be so secretive. We can strictly version and track regressions, diagnose and fix all these issues as a community," Fan tweeted.

Since the birth of ChatGPT, everyone has been calling and eager for its open source, but in the end, nothing happened. Even when Sam Altman, the founder of OpenAI, was directly asked about open source, his answer still cleverly circumvented whether GPT would be open source, just saying, "We will have more open source models in the future, but there are no specific models and models." schedule". This is also the key to why Llama 2 has quickly gained the love of developers and companies around the world. For building a closed large language model like ChatGPT, the uncertainty of security, more continuous and transparent information synchronization and maintenance are still the most urgent needs of developers.

Reference link:

https://twitter.com/DrJimFan/status/1681716564335394817

https://arxiv.org/abs/2307.09009

https://www.theregister.com/2023/07/20/gpt4_chatgpt_performance/?td=rt-3a

https://community.openai.com/t/experiencing-decreased-performance-with-chatgpt-4/234269

https://twitter.com/OfficialLoganK https://twitter.com/OpenAI

9d72c5510f4aac10c319ddbe8189947c.gif

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/131950140