Mistral AI releases 7.3 billion parameter model, "crushing" Llama 2 13B

French artificial intelligence startup Mistral AI announced the launch of its first large language model, Mistral 7B, which is claimed to be the most powerful language model of its size to date; it is open source under the Apache-2.0 license and can be used completely free of charge. Any restrictions.

Mistral AI, a six-month-old startup, raised a massive $118 million in seed funding in June, said to be the largest seed round in European history. Mistral 7B is a model with 7.3 billion parameters. The company claims that the Mistral 7B performed significantly better than the Llama 2 7B and 13B and on par with the Llama 34B in benchmark tests covering a range of tasks.

In the large-scale multi-task language understanding (MMLU) test covering 57 subjects including mathematics, US history, computer science, law, etc., the accuracy of the Mistral 7B model was 60.1%, and the accuracy of Llama 2 7B and 13B was slightly higher respectively. are 44.4% and 55.6%.

Mistral 7B also outperformed both Llama models in accuracy on common sense reasoning and reading comprehension tests. On the world knowledge test, Llama 2 13B was on par with Mistral 7B, which Mistral said was likely due to the model's limited number of parameters, limiting the amount of knowledge it could compress.

The only area where Llama 2 13B is comparable to Mistral 7B is the world knowledge test, which Mistral claims "may be due to the limited number of parameters in Mistral 7B, which limits the amount of knowledge it can compress."

In terms of coding tasks, although Mistral claims that the performance of Mistral 7B is greatly improved; but the benchmark results show that it still does not exceed the fine-tuned CodeLlama 7B. In the 0-shot Humaneval and 3-shot MBPP tests, the accuracy rates of CodeLlama 7B were 31.1% and 52.5% respectively, while Mistral 7B was 30.5% and 47.5% respectively.

Mistral AI says that Mistral 7B uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at less cost.

"Mistral 7B employs SWA, where each layer focuses on the previous 4096 hidden states. The main improvement, and the reason for the original study, is the linear computational cost of O(sliding_window.seq_len). In practical applications, this is done with FlashAttention and xFormers The change resulted in a 2x speedup with a sequence length of 16k and a window of 4k."

In addition, the company plans to build on this work and release a larger model capable of better reasoning and supporting multiple languages, which is expected to be unveiled in 2024.

More details can be found in the official announcement .

Guess you like

Origin www.oschina.net/news/259954/mistral-ai-mistral-7b