Heavy burst! OpenAI officially launches multimodal GPT-4

Hello everyone, today's article was translated by my good friend Alpha Rabbit. She stayed up late in the morning to study the GPT-4 launched by OpenAI. She basically read all the key points of the published content and shared it with everyone. some inspiration.

Author | OpenAI&TheVerge&Techcrunch

Translation & Analysis |

01

highlights

* This article is about 6000 words

  • GPT-4 can accept image and text input, while GPT-3.5 only accepts text.

  • GPT-4 achieves "human-level" performance on various professional and academic benchmarks. For example, it passed the mock bar exam with scores in the top 10% of test takers.

  • It took OpenAI 6 months to repeatedly adjust GPT-4 using the experience gained from the adversarial test project and ChatGPT.

  • In simple chat, the difference between GPT-3.5 and GPT-4 may be insignificant, but when the complexity of the task reaches a sufficient threshold, the difference comes out, and GPT-4 is more reliable and creative than GPT-3.5 Force, able to handle more subtle instructions.

  • GPT-4 can illustrate and interpret relatively complex images, such as identifying a Lightning Cable adapter from a picture plugged into an iPhone (picture below).

  • Image understanding capabilities are not yet available to all of OpenAI's clients, which OpenAI is testing with partner Be My Eyes.

  • OpenAI admits that GPT-4 is not perfect, and still has a sense of confusion about fact-checking problems, some reasoning errors, and occasional overconfidence.

  • Open-source OpenAI Evals, for creating and running benchmarks that evaluate models like GPT-4 while checking their performance sample-by-sample.

02

Official document

OpenAI has officially launched GPT-4, which is the latest milestone in OpenAI's expansion of deep learning. GPT-4 is a large multi-modal model ( capable of accepting image and text type input, giving text output ), although GPT-4 is not as capable as humans in many real-world scenarios, it can be used in various professional and academic On benchmarks, it exhibits near human-level performance.

Example: GPT-4 passed a simulated bar exam with scores in the top 10% of all test takers. In contrast, the score of GPT-3.5 is about the bottom 10%. Our team spent 6 months repeatedly adjusting GPT-4 using my adversarial testing project and related experience based on ChatGPT. The result is that GPT-4 achieves the best ever results in terms of factuality, steerability, and refusing to go outside of guardrails. It's not perfect yet)

Over the past two years, we refactored the entire deep learning stack and partnered with Azure to co-design a supercomputer for the workload from the ground up. A year ago, OpenAI trained GPT-3.5 as the first "trial run" of the entire system, specifically, we found and fixed some bugs and improved the previous theoretical foundation. As a result, our GPT-4 trains, runs (confidently: at least for us!) unprecedentedly stable, and becomes our first large model whose training performance can be accurately predicted in advance. As we continue to focus on reliable scaling, an intermediate goal is to honed methods to help OpenAI continue to predict and prepare for the future, which we believe is critical to safety.

We are releasing text input capabilities for GPT-4 via ChatGPT and API (you can join WaitList), and we are working closely with our partners to get off to a good start to make image input capabilities more widely available. We plan to open source OpenAI Evals, which is also our framework for automatically evaluating the performance of AI models. Anyone can suggest shortcomings in our model to help it further improve.

03

ability

It may not be easy to spot the difference between GPT-3.5 and GPT-4 in simple small talk. However, when the complexity of the task reaches a sufficient threshold, their differences come out. Specifically, GPT-4 is more reliable and creative than GPT-3.5, capable of handling finer-grained instructions.

To understand the differences between the two models, we tested them on a variety of benchmarks, including simulating tests originally designed for humans. By using the latest public test (Olympiad, AP, etc.) and including the purchase of the 2022-2023 version of the practice test, we have not specially trained the model for this type of test, of course, there are few problems in the test is present during the training process of the model, but we consider the following results to be representative.

5b150da388a036dc780683cdeb6a08ad.png

8ee583659cd44f3e6b34fcee94fb911a.png

We also evaluate GPT-4 on traditional benchmarks designed for machine learning models. GPT-4 substantially outperforms existing large language models and is neck-and-neck with most state-of-the-art (SOTA) models that include benchmark-specific or additional training protocols.

fd96caed869f6d850b256efdc58f5e8c.png

Since most existing ML benchmarks are written in English, to get an initial glimpse of capabilities in other languages, we used Azure Translate to translate the MMLU benchmark: a set of 14,000 multiple-choice questions covering 57 topics, into various languages. In 24 of the 26 languages ​​tested, GPT-4 outperformed GPT-3.5 and other large models (Chinchilla, PaLM) in English, and this excellence also includes languages ​​like Latvian, Welsh, and Sri Lankan. Vahili and more.

d732ae758e06daa8cac4df6d1fd8ff40.png

We have been using GPT-4 internally and have found it to have a large impact on functions such as support, sales, content moderation and programming. We are also using it to assist humans in evaluating the output of AI. This is the second phase of our adjustment strategy Start.

04

visual input

GPT-4 can accept text and image prompts, which parallels the text-only setup. For example, the user can specify any visual or language task, it can generate text output (natural language, code, etc.), the given input includes documents with text and photos, diagrams or screenshots, GPT-4 shows the same Similar capabilities for plain text input. In addition, it can also be applied to the test time technology developed for the plain text language model, including a few shots and CoT prompting, but the current image input is still a research preview, and there is no public product like the C-side.

The following pictures show the packaging of a "Lightning Cable" adapter with three panels.

c5c4325fbe60e35742f74f52ff42dd94.png

656ec0bad2acc4067505215f8558d9b2.png

Panel 1: A smartphone with a VGA connector (the big blue 15-pin connector usually used on computer monitors) plugged into its charging port.

Panel 2: There is a picture of the VGA port on the packaging of the "Lightning Cable" adapter.

Panel 3: A close-up of the VGA connector, ending in a small Lightning connector (used to charge iPhones and other Apple devices).

The hilarious nature of this image comes from plugging a large, outdated VGA connector into a small, modern smartphone charging port.. thus looking ridiculous

Preview GPT-4 by evaluating its performance on a narrow set of standard academic vision benchmarks. However, these numbers do not represent the scope of its capabilities, as we have found that this model is capable of handling many new and exciting tasks, and OpenAI plans to publish further analysis and evaluation numbers soon, as well as the technical effect on test time. Thoroughly investigate the results.

05

Controllable AI 

We have been working hard to achieve every aspect of the plan outlined in the article on defining AI behavior, including the controllability of AI. Instead of the fixed speech, tone and style of classic ChatGPT personalities, developers (and soon all ChatGPT users) can now dictate the style and tasks of their own AI by describing these directions in "system" messages. System messages allow API users to greatly customize the user experience within a certain range, and we will continue to improve.

06

limitation

Despite its amazing capabilities, GPT-4 suffers from limitations similar to earlier GPT models. On top of that, it's still not completely reliable (say, it can "hallucinate" facts and make inference errors). When using the output of a language model, especially in high-stakes situations, great care should be taken (e.g. human review is required, high-stakes usage should be avoided entirely) and it needs to be matched to the needs of the specific use case.

While all sorts of things still exist, GPT-4 drastically reduces hallucinations (meaning network illusions, in this case serious nonsense) compared to previous models (which themselves are constantly improving). In our internal adversarial factual evaluation, GPT-4 scores 40% higher than our state-of-the-art GPT-3.5.

7a1c3493260d7b2f41bb9b5941d0ee91.png

07

Controllable AI 

GPT-4's base model only slightly outperforms GPT-3.5 on this task; however, after post-training with RLHF (applying the same procedure we used for GPT-3.5), there is a large gap. The model will have various biases in its output, and we have made progress in these areas, but there is still more work to be done. According to our recent blog post, our goal is to make the AI ​​systems we build have sensible default behaviors that reflect a wide range of user values, allow these systems to be customized over wide ranges, and gain public input on those ranges.

GPT-4 generally lacks knowledge of events that occurred after the cutoff for the vast majority of its data (September 2021), and will not learn from its experience. It sometimes makes simple reasoning errors that don't seem to match the capabilities of so many domains, or is too gullible in users' obvious misrepresentations. It also sometimes fails at difficult problems like humans, such as introducing security holes in the code it produces. GPT-4 could also confidently err in its predictions.

08

Risks and Mitigations

We have been iterating on GPT-4 to make it more secure and consistent from the beginning of training. Our efforts include selection and filtering of pre-training data, evaluation, inviting experts to participate, improving model security, monitoring, and execution.

GPT-4 poses similar risks to past models, such as producing harmful advice, wrong code, or inaccurate information. However, the additional capabilities of GPT-4 also lead to new risk surfaces. To clarify the specifics of these risks, we hired more than 50 experts in the fields of AI docking risks, cybersecurity, biorisks, trust and security, and international security to conduct adversarial tests of the model. Their participation allows us to test the model's behavior in high-risk domains that require expertise to evaluate. Feedback and data from experts in these domains informed our mitigation and improvement models. For example, we've gathered additional data to improve GPT-4's ability to reject requests about how to synthesize dangerous chemicals.

GPT-4 incorporates an additional safety reward signal into RLHF training by training the model to reject requests for such content, thereby reducing harmful output (as defined by our usage guidelines). Rewards are provided by GPT-4's classifier, which is able to judge how security boundaries and security-related hints are done. To prevent models from rejecting valid requests, we collect diverse datasets from different sources (e.g., labeled production data, human red teams, model-generated hints) and apply security rewards on allowed and disallowed categories Signal (presence of positive or negative value).

Our mitigations substantially improve many of the security properties of GPT-4 compared to GPT-3.5. Compared to GPT-3.5, we reduced the propensity of the model to respond to requests for illegal content by 82%, while GPT-4 responded 29% more often to sensitive requests, such as medical advice and self-harm, in line with our policy %

Overall, our model-level interventions increase the difficulty of inducing undesirable behavior, but there is still "jailbreaking" to produce content that violates our usage guidelines. As the risks to AI systems increase, achieving extreme reliability in these interventions will become critical. What is important now is to supplement these limitations with time-of-deployment security technologies, such as finding ways to monitor.

GPT-4 and subsequent models are likely to have positive or negative impacts on society, and we are working with external researchers to improve our understanding and assessment of potential impacts, as well as build awareness of possible dangerous capabilities in future systems Evaluate. We will share more of our reflections on the potential social and economic impact of GPT-4 and other AI systems shortly.

09

training process

Like the previous GPT model, the GPT-4 base model is trained to predict the next word in a document and is trained using publicly available data (such as Internet data) as well as data we license. These data are drawn from extremely large corpora and include correct and incorrect solutions to mathematical problems, weak and strong reasoning, contradictory and consistent statements, and a wide variety of ideologies and ideas.

Thus, when prompted with a question, the underlying model can respond in a variety of ways that may be far from what the user intended. To align it with the user's intent, we fine-tune the model's behavior using reinforcement learning with human feedback (RLHF).

Note that the model's ability seems to come mainly from the pre-training process, RLHF does not improve test scores (it actually decreases test scores without active effort). But the bootstrap for the model comes in the post-training process, and the base model needs Prompt Engineering to even know it should answer the question.

10

predictable expansion

A big focus of the GPT-4 project is to build a deep learning stack that scales predictably. The main reason is that for very large training runs like GPT-4, it is not feasible to do a lot of model-specific tuning. We have developed and optimized the infrastructure to have very predictable behavior at multiple scales. To test this scalability, we accurately predicted in advance the final loss of GPT-4 in our internal codebase (not part of the training set) by inferring from a model trained using the same method, but using the computational The amount is 10000 times less.

We believe that the ability of machine learning to accurately predict the future is an important part of security that has been underappreciated relative to its potential impact (although we have been encouraged by the efforts of several institutions). We are expanding our efforts to develop ways to provide society with better guidance on what to expect from future systems, and we hope this becomes a common goal in the field.

11

Open AI Assessment

We are open-sourcing OpenAI Evals, our software framework for creating and running benchmarks that evaluate models like GPT-4, while checking their performance sample-by-sample. We use Evals to guide the development of our models (including identifying shortcomings and preventing regressions), and our users can apply it to track the performance of different model versions (which will now be rolled out regularly) and evolving product integrations. For example, Stripe already uses Evals to supplement their human evaluations to measure the accuracy of their GPT-powered documentation tools.

Because the code is open source, Evals supports writing new classes to implement custom evaluation logic. However, in our own experience, many benchmarks follow one of a few "templates", so we also include the most useful templates internally (including a template for "Model Grading Evals" - we found GPT-4 to have an impressive Surprised by the ability to check your own work). In general, the most efficient way to create a new assessment is to instantiate one of these templates and provide the data. We're excited to see what others can build with these templates and Evals more broadly.

We want Evals to be a tool for sharing and crowdsourcing benchmarks that best represent a wide range of failure modes and difficult tasks. As a follow-up example, we have created a logic puzzle assessment with ten hints that GPT-4 failed. Evals is also compatible with implementing existing benchmarks; we have included several notebooks implementing academic benchmarks and some variations integrating CoQA (a small subset) as examples.

We invite everyone to test our models with Evals and submit your most interesting examples. We believe Evals will be an integral part of the process of using and building on our models, and we welcome direct contributions, questions, and feedback.

12

ChatGPT Plus

ChatGPT Plus users will get usage-capped GPT-4 permissions on chat.openai.com. We will adjust the exact usage cap based on actual demand and system performance, but we expect capacity to be severely constrained (although we will expand and optimize over the next few months).

Depending on the traffic patterns we see, we may introduce a new subscription level for higher GPT-4 usage, and we also hope to offer a certain amount of free GPT-4 queries at some point, so that users who do not have a subscription can also try.

API

To get the GPT-4 API (using the same ChatCompletions API as gpt-3.5-turbo), please register on OpenAI's official Waitlist.

13

in conclusion

We look forward to GPT-4 becoming a valuable tool that improves people's lives by powering many applications. There is still a lot of work to be done, and we look forward to building on, exploring, and contributing to the model through the collective efforts of the community to improve the model together.

Text/Reposted from "Alpha Rabbit Research Notes"

references:

1.https://openai.com/research/gpt-4

2.https://techcrunch.com/2023/03/14/openai-releases-gpt-4-ai-that-it-claims-is-state-of-the-art/

3.https://www.theverge.com/2023/3/14/23638033/openai-gpt-4-chatgpt-multimodal-deep-learning

Click on the official account card below to follow me

In the official account dialog box, reply the keyword "1024"

Get a free practical tutorial on making money from sideline business

16845b7f1f0c4ffa136cfcbd56e1d3f6.png

Guess you like

Origin blog.csdn.net/loongggdroid/article/details/129576503