Talk about the quality decline caused by the cost reduction and material reduction of GPT-4 text code

First, a small number of users raised doubts, and then a large number of netizens said that they had noticed it, and posted a lot of evidence.

Some people reported that the 3 hours and 25 conversation quota of GPT-4 was used up in one go, and they did not solve their own code problems.

Reluctantly switched to GPT-3.5, but it was solved .

To sum up everyone's feedback, the most important performances are as follows:

  • In the past, GPT-4 could write the correct code, but now it is full of bugs
  • Answer questions with less depth and analysis
  • Response is faster than ever

This has caused many people to doubt, is OpenAI starting to cut corners in order to save costs ?

Two months ago GPT-4 was the greatest writing assistant in the world, a few weeks ago it started to fall into mediocrity. I suspect they cut the hashrate or made it less intelligent.

This is unavoidably reminiscent of Microsoft’s new Bing , which was “the pinnacle of its debut”, and then suffered a “frontal lobotomy” that deteriorated its ability…

After netizens exchanged their experiences with each other,  "it started to get worse a few weeks ago"  has become everyone's consensus.

A storm of public opinion formed simultaneously in technical communities such as Hacker News, Reddit, and Twitter.

Now the official can't sit still.

OpenAI developer promotion ambassador Logan Kilpatrick came forward to respond to a question from a netizen:

The API will not change without us notifying you. There the model is at rest .

Uneasy netizens continued to ask and confirm, "That is to say, GPT-4 has been static since it was released on March 14, right?", and Logan also got an affirmative answer.

"I noticed that the performance of some prompt words is inconsistent, is it just due to the instability of the large model itself?", also got a  "Yes"  reply.

But as of now, neither of the two questions about whether the web version of GPT-4 has been downgraded has been answered, and Logan has posted other content during this time.

So how things are, it is better to test it yourself.

Regarding the fact that netizens generally mentioned that the level of GPT-4 writing code has deteriorated, we did a simple experiment.

Did the actual measurement of GPT-4's "alchemy" ability decline?

At the end of March, we experimented with making GPT-4 "alchemy" and wrote a multi-layer perceptron in Python to realize the XOR gate.

ShareGPT screenshot, the interface is slightly different

After changing GPT-4 to use numpy without the framework, the result given for the first time is wrong.

After modifying the code twice, the operation got the correct result. Modify the number of hidden neurons for the first time, and modify the activation function from sigmoid to tanh for the second time.

On June 2, we tried again to get GPT-4 to complete this task, but replaced it with Chinese prompt words.

This time GPT-4 did not use the framework for the first time, but the code given is still wrong.

In the subsequent modification, the correct result was obtained after only one modification, and it was changed to Li Dazhuanfei's idea of ​​directly increasing the number of training epochs and learning rate.

The quality of the text part of the answer has not been observed to decline significantly, but the response speed does feel faster.

Due to limited time, we only conducted this experiment, and due to the randomness of AI itself, we cannot deny the observations of netizens.

Someone gave feedback as early as April 19

We searched in OpenAI's official Discord channel and found that since late April , sporadic users have reported that GPT-4 has deteriorated from time to time.

But these feedbacks have not sparked widespread discussion, nor have they received an official response.

On May 31, Hacker News and Twitter began to discuss this issue with a large number of netizens on the same day, which became a key node of the whole incident.

A HackerNews netizen pointed out that the GPT-4 profile picture was stronger when it was still black, and now the purple profile picture version will lose a few lines when modifying the code.

Matt Shumer, CEO of HyperWrite (a writing tool developed based on GPT API), raised this question earlier on Twitter.

But this tweet resonated with many netizens, and the tweets replied by OpenAI employees are also aimed at this.

However, these responses did not satisfy everyone. On the contrary, the scope of discussion became wider and wider.

For example, a post on Reddit mentioned that GPT-4, which used to be able to answer code questions, can't even tell which is code and which is a question .

Under the questioning of other netizens, the author of the post gave an overview of the process of the problem, and also attached the chat records with GPT.

Regarding OpenAI's claim that the model has not been changed since March, there is indeed no relevant record at the public level.

In the update log of ChatGPT, updates to the model itself were mentioned on January 9 , January 30 , and February 13, respectively, involving improvements in factual accuracy and mathematical capabilities.

However, since the release of GPT-4 on March 14, there has been no mention of model updates, only web APP function adjustments and changes in adding networking mode, plug-in mode, and Apple APP .

Assuming that, as OpenAI said, the capabilities of the GPT-4 model itself have not changed, so why do so many people feel that its performance has deteriorated?

Many people also gave their own conjectures.

The first possible reason is psychological .

François Chollet, the founder of Keras, said that it is not that the performance of GPT has deteriorated, but that everyone has passed the initial surprise period and their expectations for it have become higher.

Some netizens on Hacker News also held the same view, and added that people's focus has changed, and they are more sensitive to GPT mistakes.

Regardless of the differences in people's psychological feelings, some people suspect that the API version and the web page version may not necessarily be the same , but there is no real evidence.

Another guess is that when the plugin is enabled, the extra prompt words of the plugin may be a kind of pollution to the problem to be solved .

Additional prompt words in WebPilot plug-in

This netizen said that in his opinion, the deterioration of GPT’s performance began after the plug-in function started public testing.

Some people also asked OpenAI employees whether the model itself has not changed, but whether the inference parameters have changed?

Qubit also accidentally "tortured" that the system prompt words of ChatGPT on iOS were not consistent with those on the web version .

  • If you start a conversation on the phone, it will know that it is interacting with you through the phone.
  • Will keep responses to one or two sentences unless lengthy reasoning is required.
  • Will not use emoticons unless you explicitly ask him to.

Not necessarily successful, high probability of refusing to answer

Then if you continue a conversation opened in the iOS version in the web version without realizing it , you may observe that the GPT-4 answer becomes easier.

In short, it is still an unsolved mystery whether GPT-4 has become stupid since its release.

But one thing is certain:

The GPT-4 that everyone started playing on March 14 was not as good as the one in the paper from the very beginning.

Aligning with humans makes AI less capable

The more than 150-page screen-sweeping paper "The Spark of AGI: Early Experiments of GPT-4" published by Microsoft Research clearly states:

They were qualified for testing long before GPT-4 development was complete and were tested for a long time.

Later, for many amazing examples in the paper, netizens could not successfully reproduce them with the public version of GPT-4 .

At present, there is a view in the academic community that although the later RLHF training made GPT-4 more aligned with humans—that is, it is more in line with human instructions and human values—but it also made its own reasoning and other abilities worse .

One of the authors of the paper, Microsoft scientist Zhang Yi also mentioned in the Chinese podcast "What's Next|Science and Technology Early Knowing" S7E11 issue:

That version of the model is stronger than the GPT-4 that everyone can get outside now, very much stronger.

For example, the Microsoft team mentioned in the paper that they let GPT-4 use TikZ in LaTeX to draw a unicorn at the same time intervals to track changes in GPT-4 capabilities.

The last result shown in the paper has been drawn quite well.

But Sebastien Bubeck , the first author of the paper, revealed more information in a subsequent speech at MIT.

Later, when OpenAI started to focus on safety issues, subsequent versions got worse and worse at this task.

The training method that is aligned with humans but does not lower the upper limit of AI's own capabilities has also become the research direction of many teams, but it is still in its infancy.

In addition to professional research teams, netizens who care about AI are also using their own methods to track changes in AI capabilities.

Someone asked GPT-4 to draw a unicorn once a day and publicly recorded it on the website.

Since April 12th, the general shape of a unicorn has not been seen until now.

Of course, the author of the website said that he let GPT-4 use the SVG format to draw pictures, which is different from the TikZ format in the paper.

And the painting in April seems to be just as bad as the painting now, and there is no obvious regression.

Finally, let me ask everyone, are you a GPT-4 user? Have you felt the ability of GPT-4 to decline in recent weeks? Welcome to chat in the comment area.

Bubeck talk:
www.youtube.com/watch?v=qbI…
Zhang Yi interview:
xyzfm.link/s/UfTan0
A GPT-4 unicorn a day
gpt-unicorn.adamkdean.co.uk

参考链接:
[1] news.ycombinator.com/item?id=361…
[2] twitter.com/nabeelqu/st…
[3] twitter.com/OfficialLog…
[4] discord.com/channels/97…
[ 5] twitter.com/mattshumer_…
[6] www.reddit.com/r/ChatGPT/c…
[7] help.openai.com/en/articles…
[8] twitter.com/fchollet/st…
[9] news.ycombinator.com/item?id=361…

Guess you like

Origin blog.csdn.net/bruce__ray/article/details/131144302