Available for free! ChatGPT's strongest competitor is here

802d2549327d7ad5bdcc79bc4fe76198.png

(Forever free, scan code to join)

Source: Heart of the Machine

This time, in addition to a large wave of ability upgrades for Claude 2, the more important thing is that everyone can use it.

Today, the artificial intelligence system Claude, which many netizens call "ChatGPT's strongest competitor", ushered in a major version update.

Claude 2 is officially released!

According to reports, Claude 2's ability to write code, analyze text, mathematical reasoning, etc. has been enhanced, and can generate longer responses .

What's more, users can try it out for free on the new beta site , and the Claude 2 commercial API is priced the same as version 1.3.

f1a4aaa4704bb7db015dc6becf2509ee.png

The heart of the machine has introduced Claude many times in previous articles . It was created by Anthropic, a company founded by OpenAI leavers. Two months after ChatGPT was released, the company quickly developed Claude, which can perform tasks such as summary summarization, search, authoring assistance, Q&A, and coding.

After that, it continued to upgrade, and expanded Claude's context window from 9k tokens to 100k through 100K Context Windows in May.

Now finally ushered in a major version update. Anthropic said that Claude 2 has been improved based on previous feedback from users.

Next, let's look at the details of various aspects of ability.

In what ways has Claude 2 been enhanced?

Overall, Claude 2 focuses on improving the following abilities:

  • Anthropic has worked to improve Claude's capabilities as a coding assistant, and Claude 2 has significantly improved performance on coding benchmarks and human feedback evaluations.

  • The long-context model is especially useful for handling long documents, few prompts, and control with complex instructions and specifications. Claude's context window has been extended from 9K tokens to 100K tokens (Claude 2 has been extended to 200K tokens, but the current release only supports 100K tokens).

  • Previous models were trained to write fairly short responses, but many users requested longer outputs. Claude 2 is trained to generate coherent documents of up to 4000 tokens, which equates to about 3000 words.

  • Claude is typically used to convert long and complex natural language documents into structured data formats. Claude 2 is trained to be better at producing correct output in JSON, XML, YAML, Code and Markdown formats.

  • Although Claude's training data is still mainly English, the proportion of non-English data in Claude 2's training data has increased significantly.

  • The training data for Claude 2 includes updated data in 2022 and early 2023. This means it is aware of recent events, but it can still be confusing.

The research conducted a series of evaluation experiments to test the performance level of Claude 2, including two parts of alignment evaluation and ability evaluation.

In terms of model alignment, the study made a specific assessment of three key requirements for large models, including: following instructions and generating content useful (helpfulness); generating content harmless (harmlessness); and generating content accurate and authentic (honesty).

Human Feedback Evaluation

Large models should follow the instructions provided by humans during the generation process, which will make the generated results meet the requirements and be practically useful. Aiming at this point, the study conducted experimental evaluations on Claude 2, Claude 1.3 and Claude Instant 1.1, and used the classic game level evaluation index - Elo score. The evaluation results of several models are shown in Figure 1 below:

417c8fb5b296768a7259553a80ba541b.png

bias assessment

The Bias Benchmark for QA (BBQ) is a commonly used benchmark for evaluating a model's bias against a crowd. The research was evaluated experimentally on the BBQ benchmark, and the experimental results of several models are shown in Figure 2 below:

fc46e3f76900d045a8d8bfb92ebd479f.png

Figure 3 below shows the accuracy of several models answering questions in the BBQ benchmark in the context of disambiguation. It is worth noting that the accuracy of the Claude model will be lower than the Helpful-Only model because the model will refuse to answer some biased questions.

730290d498f269d93f2c6a48be46dec1.png

factual assessment

Large models can sometimes generate false and confusing information, so it is important to test the factuality of what the model generates. TruthfulQA is a benchmark for evaluating the accuracy and authenticity of language model output in an adversarial environment. The test results of several models are shown in Figure 4 below:

e42ec3bf6e57a147608a0cd5640f9d82.png

In general, the overall performance of Claude 2 on the HHH (helpfulness, harmlessness, honesty,) evaluation is shown in Figure 6 below:

c1a24a19d1134877e83153ff587ef824.png

In terms of ability evaluation, this research conducts evaluation experiments on Claude 2 in terms of multilingual translation tasks, context windows, standard benchmark evaluation, and qualification level examinations .

multilingual translation

The study chose Flores 200, a translation benchmark covering more than 200 languages, to evaluate Claude 2's multilingual translation capabilities, including low-resource languages. The evaluation results of Claude 2, Claude 1.3 and Claude Instant 1.1 are shown in Figure 7 below:

efc859226256c1ceaac718162a37b204.png

context window

Earlier this year, the research team expanded Claude's context window from 9K tokens to 100K tokens, and now Claude 2 has further expanded the context window to 200K tokens, which is equivalent to about 150,000 words.

In order to prove that Claude 2 will actually use the complete context, the study measured the loss of each token position, averaging over 1000 long documents, as shown in Figure 8 below:

9a3a28afa157a82922d12ba435e81cab.png

However, the research team said that the currently released version only supports the context window of 100K tokens, and the complete context window will be integrated into their products .

Standard Benchmark Evaluation

The study evaluated Claude 2, Claude Instant 1.1, and Claude 1.3 on several standard benchmarks, including Codex HumanEval for python function synthesis, GSM8k for solving elementary school math problems, MMLU for multidisciplinary question answering, QuALITY for story questions and answers, ARC-Challenge for scientific questions, TriviaQA for reading comprehension, and RACE-H for reading comprehension and reasoning at the middle school level. The specific evaluation results are shown in the following table:

a0f1bfeba69d5ab6a9575fcd0342397e.png

It is worth noting that Claude 2's ability to generate code has improved significantly, and its score on Codex HumanEval has increased from 56% to 71.2%.

qualifying level exam

The study also tested Claude 2's practical ability with several common qualifying-level exam questions.

First, Claude 2 scored 76.5 percent on the Bar Exam multiple-choice test, higher than Claude 1.3's 73.0 percent.

b8f865b85a048346cc9a433971097292.png

Second, the research team also tested Claude 2's proficiency level on the Graduate Record Examination (GRE). Claude 2 scored above 90 percent on the GRE reading and writing tests, matching the same quantitative reasoning scores of test-takers who took the GRE. Median level.

b46c1c65e622ef7188264ff4670e1fc7.png

Finally, the study also tested Claude 2 on the United States Medical Licensing Examination (USMLE) questions:

4f4d67bc37557e04bf888c50c65cbfbc.png

Anthropic said that companies such as AI writing platform Jasper and code navigation tool Sourcegraph have begun incorporating Claude 2 into their operations.

Official examples and trial experience

Let's first look at some official examples provided by Anthropic.

1. Coding ability: add interactive data to static maps .

2. Text processing ability: summarize documents and output tables . Here Claude 2 uses the 100K token text processing function, and can upload hundreds of pages of documents in the prompt window.

In addition to the above, the heart of the machine also tried some examples of text analysis, mathematical reasoning and writing code.

2d6dfcb8a0afabd001db66c95f2b67f3.png

Trial address: http://claude.ai

First, let Claude 2 summarize the main points of "Claude 2 Technical Documentation" in the form of a directory. The summary is very detailed and helpful for us to write this article.

9202377027ad22fd0535558d5c0188fb.png

Two more math reasoning questions, which Claude 2 can solve in just one pass .

d33bc97729dbbd85af42b3326ada20ca.png

3d8fd1c0898e2eb5aea6dfe9a5c2243d.png

Finally, test some code questions, generate, check and complete the code .

66b60f469124769886d17cca0b12bbc7.png

c62274736697474b06f13ccb073760eb.png

e88c5a585525d3eb08f43f86cb0074f5.png

However, Claude 2 still does not have the multimodal capability to generate images .

cfd76f6c6514a1f177a70e2919b75514.png

Finally, I would like to recommend our member group, currently there are venture capital angel investors, headhunters HR, Douyin big V, emotional bloggers, lawyers, psychological counselors, medical sales, real estate, insurance, piano teachers, operators, business consulting, Students from cross-border e-commerce, construction, Internet industry data analysts, back-end development, python testing and other industries will join.

WeChat consultation: coder_v5 (be sure to note your intention)

Great value for money planet

At present, there are 430+ people on the planet, and 41 cheats have been updated in the content of the column. Every day, planets publish their own experiences. You can learn for only one dollar:

Python: 44 lessons of python introductory course + 9 lessons of Django column + interesting practical cases

chatgpt: entry, advanced, fun office, advanced courses

AI painting: Mj's basics, entry, advanced, Xiaohongshu gameplay

If you want to learn Python, ChatGPT, and AI painting, and you just want to spend a little money, welcome to join our planet member group, and you can meet a lot of great people!

Join to send ChatGPT independent account

d13e0ccbc869057bc7dba68bdf8c9b37.jpeg

Also send ChatGPT advanced video courses

The original price is 99, and now it is free to send planet members

18446c25fdbd29a61894182fb867b198.jpeg

WeChat long press to try the content

If you are not satisfied within three days, you can directly refund! ! !

2cf46d9e047326a39dc4f7bd8dbca1aa.png

推荐阅读:
入门: 最全的零基础学Python的问题  | 零基础学了8个月的Python  | 实战项目 |学Python就是这条捷径
干货:爬取豆瓣短评,电影《后来的我们》 | 38年NBA最佳球员分析 |   从万众期待到口碑扑街!唐探3令人失望  | 笑看新倚天屠龙记 | 灯谜答题王 |用Python做个海量小姐姐素描图 |碟中谍这么火,我用机器学习做个迷你推荐系统电影
趣味:弹球游戏  | 九宫格  | 漂亮的花 | 两百行Python《天天酷跑》游戏!
AI: 会做诗的机器人 | 给图片上色 | 预测收入 | 碟中谍这么火,我用机器学习做个迷你推荐系统电影
小工具: Pdf转Word,轻松搞定表格和水印! | 一键把html网页保存为pdf!|  再见PDF提取收费! | 用90行代码打造最强PDF转换器,word、PPT、excel、markdown、html一键转换 | 制作一款钉钉低价机票提示器! |60行代码做了一个语音壁纸切换器天天看小姐姐!|

Guess you like

Origin blog.csdn.net/cainiao_python/article/details/131714294