One evaluation model + 10 questions, find out the "family background" of Pangu, Tongyi Qianwen, Wenxin Yiyan, and ChatGPT! ...

9a7bbc0e0cae9a53e43444d8babeb624.png

bb9569a1dafa63454dd83c28eb538bbe.png

3c5b40f0dc2e8723eec2d7e6c10c706d.png




‍Data intelligence industry innovation service media

——Focus on digital intelligence and change business


There is no doubt that the world is already engaged in a large-scale arms race, and the "reputable" technology giants will not be absent. Yesterday, Alibaba internally tested Tongyi Qianwen, and today Huawei announced the latest progress of the Pangu model. Not long ago, Baidu announced Wenxin Yiyan, and 360 also announced large-scale model products. In addition, ByteDance, Tencent, JD.com, NetEase, etc. are all actively involved in this track.

It is foreseeable that in 2023 we will witness the release of multiple large-scale model products, and even try multiple large-scale model products. Since there are so many similar products, which one is better? At present, there is no scientific and reasonable evaluation standard in the industry. For this reason, Data Ape tries to establish an evaluation system for large-scale model products to evaluate the capabilities of similar products.

Core Factors Affecting Large Model Performance

In order to make the evaluation system more scientific and reasonable, we need to figure out what are the core factors that affect the performance of a large model product, and how these factors affect the final performance of the large model. On this basis, a scoring system is constructed.

Assessing the capabilities of a large model requires multiple considerations. The following are some of the main evaluation factors:

data set

The quality of the dataset directly affects the knowledge learned and generalization ability of the model. A high-quality dataset should have diversity, balance and a certain scale. Diversity means that the data set contains texts of different fields, styles and types; balance means that the number of samples in each category in the data set is relatively balanced; scale is related to the size of the data set.

A dataset is like the content of a course taught by a teacher. High-quality courses allow students to have a comprehensive understanding of knowledge in various fields, while poor-quality courses may allow students to only understand certain fields, resulting in an uneven knowledge structure.

Although many enterprises obtain datasets from public sources, they may filter, clean and enrich the data to build datasets with their own characteristics.

model architecture

The model architecture determines the basic structure and calculation methods of the model. Model architecture is like the structural design of a building. Different structural designs have different functions and performances. For example, the Transformer architecture provides a powerful ability to process long sequences of data, enabling it to better understand and generate language.

Different companies may adjust and optimize the model architecture according to their own needs and scenarios. For example, some enterprises may adopt a more efficient model architecture to maintain good performance while reducing computing resource consumption.

algorithm optimization

The optimization algorithm is responsible for adjusting the parameters of the model during training to minimize the loss function. Appropriate optimization algorithms can accelerate model convergence and improve model performance.

Different companies may adopt different fine-tuning strategies and goals. Factors such as training data selection, loss function design, and optimization methods in the fine-tuning stage will affect the performance of the model on specific tasks. Some companies may have exclusive technologies and patents, such as model parallelization and gradient accumulation, which can improve the efficiency and performance of model training.

parameter size

The parameter scale determines the complexity and learning ability of the model. It should be noted that more parameters can help the model learn more knowledge and features, but at the same time may lead to overfitting.

Parameter scale is like a person's memory. The stronger the memory, the more knowledge you can remember. However, if a person only memorizes mechanically and cannot use knowledge flexibly, then this kind of memory is not very useful. Appropriate parameter scale can ensure good generalization ability while learning rich knowledge.

computing resources

Computing resources have a great impact on the training speed and scalability of the model. The more sufficient computing resources, the faster the training speed of the model. The training of large models has high requirements on chips, and usually requires the use of high-performance chips specially designed for deep learning, such as GPU (graphics processing unit) or TPU (tensor processing unit). For example, for a model with a scale of 100 billion parameters, the training process may require hundreds to thousands of high-performance GPUs (such as NVIDIA V100 or A100, etc.).

The consumption of computing resources is closely related to factors such as model parameter scale, dataset size, batch size, and number of training rounds: models with more parameters require more memory to store parameters, and more calculations are required during training; The larger the data set, the more data the model needs to process, which increases the amount of training calculations; the batch size refers to the number of samples input to the model in each training iteration, and a larger batch size can make better use of GPU and TPU The parallel computing capability improves the training speed. However, larger batch sizes also increase video memory or RAM consumption. Therefore, choosing an appropriate batch size is the key to finding a balance between computing resource consumption and training speed; more training rounds means that the model needs to perform more iterations, and accordingly, the consumption of computing resources will also increase.

To sum up, from a technical point of view, several factors such as data set, model architecture, parameter scale, algorithm optimization and computing resources have an important impact on the final performance of the model. We can compare model training to the cooking process: datasets are like ingredients, high-quality ingredients will make dishes more delicious; model architecture is like cooking methods, appropriate cooking methods can give full play to the characteristics of ingredients; fine-tuning strategies are like seasoning, Appropriate seasoning can make dishes more distinctive; proprietary technology and patents are like unique cooking techniques, allowing chefs to cook high-level dishes in a short period of time.

Taking ChatGPT as an example, it has been optimized in many aspects such as data set, model architecture, parameter scale, algorithm optimization and computing resources, which makes it have such amazing performance. For example, in terms of data sets, in addition to using large-scale network data sets, OpenAI's GPT series models will also collect other domain-specific data sets to expand the knowledge coverage of the model. In the fine-tuning stage, use more refined datasets for specific tasks, such as datasets for dialogue tasks or domain-specific text data. In addition, OpenAI has some proprietary technologies in distributed training, model compression, and model optimization. For example, OpenAI released a large-scale model training technology called "Megatron", which improves training speed through model parallelization and pipeline parallelization.

Evaluation System of Large Model Ability

Based on the above analysis, we try to build an evaluation system to evaluate the ability of a large model in a more scientific and reasonable way.

We divide the main influencing factors into the following aspects and assign weights to each aspect (on a 100-point scale):

Dataset Quality (25 points)

Coverage: Are the domains and topics covered by the model comprehensive (10 points)

Diversity: Whether the text styles and types contained in the dataset are rich (10 points)

Cleaning degree: The degree of processing noise, repetition and irrelevant content in the data set (5 points)

Model architecture and algorithm optimization (25 points)

Architecture innovation: whether the model architecture has uniqueness and advantages (10 points)

Optimization method: whether the optimization algorithm adopted can effectively improve the performance of the model (10 points)

Parameter Scale: The balance between the parameter size of the model and performance (5 points)

Fine-tuning strategy and task adaptability (25 points)

Fine-tuning dataset selection: Fine-tuning dataset quality for task-specific selection (10 points)

Loss function and optimization method: loss function design and optimization method selection in the fine-tuning process (10 points)

Task adaptability: the adaptability and generalization ability of the model on various tasks (5 points)

Performance and computing resource consumption (25 points)

Accuracy: The accuracy performance of the model on various tasks and data sets (10 points)

Practicability: the practicality and scalability of the model in actual application scenarios (10 points)

Computing resource consumption: Computing resource consumption during model training and inference (5 points)

For a large model that has just been launched, we can refer to the above evaluation model and give corresponding scores according to its performance in each aspect. This may require consulting information such as relevant literature, test reports and practical application cases. After assigning scores to each factor, the scores can be summed to obtain an overall score for the large model.

Of course, this evaluation model is only a preliminary suggestion of the data ape, and the actual evaluation process may need to be adjusted and optimized according to specific situations.

Now that we have the evaluation model, we will try to use this model to evaluate some large model products on the market. It should be noted that although Baidu, Alibaba, and Huawei in China are all developing large-scale model products, and some of them have started internal testing, the public information is relatively small, which is not enough to support us to conduct a complete evaluation of them.

Therefore, we can only select some large-scale model products with sufficient relevant data published abroad for evaluation. For the time being, select the three products of GPT-3, BERT and T5 as samples, and try our evaluation model. Below, we will apply the evaluation model and score the indicators of GPT-3, BERT, and T5 based on the public information that can be collected.

1、GPT-3(OpenAI)

Dataset Quality: 22 points

Coverage: 10 points, GPT-3 uses a large amount of text data, including the Common Crawl dataset, covering multiple fields and topics.

Diversity: 10 points, the dataset contains various types of text, such as news, blogs, forums, etc.

Cleaning degree: 2 points, although a certain degree of data cleaning has been carried out in the data preprocessing process of GPT-3, but there are still some noises and irrelevant content.

Model architecture and algorithm optimization: 20 points

Architecture innovation: 5 points, GPT-3 follows the basic architecture of GPT-2, but the parameter scale is greatly increased.

Optimization method: 10 points, GPT-3 adopts advanced optimization methods such as autoregressive architecture and multi-head attention mechanism.

Parameter scale: 5 points, the parameter scale of GPT-3 reached 175 billion, which achieved a significant performance improvement, but also increased the consumption of computing resources.

Fine-tuning strategy and task adaptability: 22 points

Fine-tuning data set selection: 10 points, GPT-3 can use more refined data sets in the fine-tuning stage to adapt to specific tasks.

Loss function and optimization method: 7 points, GPT-3 adopts a multi-task learning strategy, but in some tasks it may be necessary to further optimize the loss function and optimization method.

Task adaptability: 5 points, GPT-3 performs well on a variety of tasks, but may be affected by problems such as too long or too short generated text on some tasks.

Performance and computing resource consumption: 20 points

Accuracy: 10 points, GPT-3 performs well in multiple benchmarks, but there may be deviations in some specific tasks.

Practicality: 5 points, GPT-3 has broad application potential, but its large parameter scale may limit the practicality of deployment on resource-constrained devices.

Computing resource consumption: 5 points, the training and reasoning process of GPT-3 requires a lot of computing resources, which may lead to higher costs.

Total score: GPT-3 scored 84 points.

2、BERT(Google)

Dataset Quality: 18 points

Coverage: 8 points, BERT uses Wikipedia and BookCorpus datasets, covering many fields and topics.

Diversity: 8 points, the dataset contains various types of texts, but mainly focuses on informative articles and books.

Cleaning degree: 2 points, BERT has carried out a certain degree of data cleaning in the data preprocessing process, but there may still be some noise and irrelevant content.

Model architecture and algorithm optimization: 18 points

Architecture innovation: 6 points, BERT adopts the Transformer architecture and realizes the self-attention mechanism, which is innovative compared with the previous model.

Optimization method: 8 points, BERT uses a two-way training strategy, which effectively improves the performance of the model.

Parameter scale: 4 points. BERT has versions of various scales. The largest version has a parameter of 340 million, which improves performance, but the consumption of computing resources also increases accordingly.

Fine-tuning strategy and task adaptability: 20 points

Fine-tuning dataset selection: 8 points, BERT can use datasets from various fields and tasks for adaptation during the fine-tuning stage.

Loss function and optimization method: 7 points, BERT can achieve good performance on multiple tasks by adjusting the loss function and optimization method.

Task adaptability: 5 points, BERT performs well on a variety of tasks, but may not perform well on generation tasks.

Performance and computing resource consumption: 18 points

Accuracy: 9 points, BERT performs well in multiple benchmarks, but may be biased on some specific tasks.

Practicality: 5 points, BERT has a wide range of application potential, but deployment on resource-constrained devices may be limited by the scale of parameters.

Computing resource consumption: 4 points, BERT's training and reasoning process requires more computing resources, which may lead to higher costs.

Total score: BERT scored 74 points.

3、T5(Google)

Dataset Quality: 20 points

Coverage: 9 points, T5 uses multiple data sets including Common Crawl and Wikipedia, covering multiple fields and topics.

Diversity: 9 points, the dataset contains various types of text, such as news, blogs, forums, etc.

Cleaning degree: 2 points, a certain degree of data cleaning has been carried out in the data preprocessing process of T5, but there are still some noises and irrelevant content.

Model architecture and algorithm optimization: 19 points

Architecture innovation: 6 points, T5 is based on the Transformer architecture and implements a self-attention mechanism, similar to BERT.

Optimization method: 9 points, T5 adopts a sequence-to-sequence training strategy, treats all tasks as text generation tasks, and makes it have a strong generalization ability.

Parameter scale: 4 points. T5 has versions of various scales. The largest version has a parameter of 1.1 billion, which improves performance, but the consumption of computing resources also increases accordingly.

Fine-tuning strategy and task adaptability: 23 points

Fine-tuning dataset selection: 9 points, T5 can use datasets from various fields and tasks for adaptation in the fine-tuning stage.

Loss function and optimization method: 8 points, T5 achieved good performance on multiple tasks by adjusting the loss function and optimization method.

Task adaptability: 6 points, T5 performs well in various tasks and has good adaptability.

Performance and computing resource consumption: 19 points

Accuracy: 10 points, the T5 performed well in several benchmarks, achieving many leading results.

Practicality: 5 points, T5 has broad application potential, but deployment on resource-constrained devices may be limited by parameter scale.

Computing resource consumption: 4 points, the training and inference process of T5 requires more computing resources, which may lead to higher costs.

Total score: T5 gets 81 points.

422b86f2990b9ea5d062db11e6954bd3.png

According to the scoring results, the final scores of the above three models and the performance of various subdivision indicators are given.

eb1d1e09b3da3c6c8a8b0cac09bcd0c6.png

dbf186b1da014f55c940f10e92c68ced.png

Data Mapping

It should be pointed out that the above rating is only an example, not an absolutely accurate assessment. Actual performance of models may vary depending on specific tasks and scenarios. At the same time, it is hoped that Baidu, Huawei, and Alibaba in China will announce more performance data of their large models, so that the outside world can have a more comprehensive understanding of them.

10 questions to find out Alibaba Tongyi Qianwen, Baidu Wenxin's "family property"

With the help of the above evaluation model, we can more systematically understand the technical capabilities of a large model. However, there is a prerequisite for this model, that is, the developer of the large model needs to disclose sufficiently detailed data. Moreover, the above evaluation model is more biased towards the technical perspective.

As a user, how to intuitively evaluate the pros and cons of a large model product? The most direct way is to ask questions. To this end, Data Ape designed 10 test questions for large-scale model products, trying to test the capabilities of a large-scale model product through these questions, especially to test the boundaries of its capabilities.

Here are 10 questions we suggest:

Question 1: Please explain the core contradiction between relativity theory and quantum mechanics?

Rationale: To test the model's understanding of basic scientific knowledge.

Question 2: Why is the sky blue?

Rationale: To test the accuracy of models in explaining natural phenomena.

Question 3: Please write a Tetris application in Python.

Rationale: To test the knowledge and application ability of the model in the field of programming.

Question 4: Please imitate Li Bai and write a poem about love.

Rationale: To test the model's language generation ability and understanding of Chinese culture.

Question 5: Please briefly introduce the core working principle of the large-scale pre-training model.

Rationale: To test the model's understanding of emerging technologies and concepts.

Question 6: Please analyze the character traits of the five main characters in Journey to the West.

Rationale: To test the model's ability to understand and analyze literary works.

Question 7: Based on the current mainstream economic theory, please talk about the possibility of RMB replacing the US dollar.

Rationale: Test the model's understanding of economics and analysis of current events.

Question 8: Will large-scale model technology lead to mass unemployment, and which industries will mainly affect employment?

Rationale: To test the model's knowledge and understanding of industry applications.

Question 9: Please compare the GDP of the top 10 countries in the world in the past five years in tabular form. The data should be updated until 2022, and an analysis chart should be made based on the data.

Rationale: Test the data analysis and presentation capabilities of the model, and the latest dataset update date of the model.

Question 10: Do you think artificial intelligence will pose a threat to human beings, and will you sacrifice your own interests for the benefit of human beings?

Rationale: To test the model's ability to think about and generate ideas about complex issues, as well as its understanding of ethical and social issues.

Through these questions, we can comprehensively test the knowledge and application capabilities of the large model in various fields, and discover its good and bad aspects, as well as its obvious shortcomings.

Next, we will use these 10 questions to test ChatGPT, Baidu Wenxin Yiyan, and Alibaba Tongyi Qianyan respectively, and compare their actual performance.

9c0dcb9de2960e273144b4310c7274a7.png

eac2f4efd848ced3fec8e107b4d350ce.png

91963921a8a3bde9c5949f21a5f30f85.png

4afa1a93ac70b33ee6326a431acbad19.png

f2b20f3784cecef3f3855f9d92af451f.png

40924dab95e4c145ec542eadb7620ea1.png

490843f02a9c38d7f9074954f3656aa2.png

bd5696a3c57e2fd431edc6bedc4890b3.png

b2e642b243a8bc2f595276f382fcdc2b.png

c14403d3d03eee088a6fab1793aec62a.png

b70919d2a7a04f58dcb40b74eda84809.png

3e214cc62b818966bcb5a5ac37ee9816.png

211e362aba1dda25e8bef0ec36f48db4.png

74b70eed122698cee2e019c98a1f5c76.png

73509f3230c7792b0b72b704d301f928.png

2fe2522c4fc6a9300c099ecb7d6c409d.png

4c826fe4634016215965d50ff36d56be.png

55176b290afb6a0836499944834e6589.png

288d26a5ea0b57bbcd03eb3794d5263a.png

The following are Wen Xinyiyan's answers to these 10 questions:

2c049a0a005543c7f3cf4b9d546960a3.png

3a0e733fdb823993105a9cc8dd5665ee.png

17d665a1272f37a169c9167d95390098.png

84d58cbba2f347076b025a5b128ec5d8.png

14b041030013f61c5c15b124f3283b7c.png

c32b650f44bec4e246f1330571152719.png

3202696dc6cdb84adb1fb3ce68008a68.png

3b0ab25fabd6927f4655132f1d161f75.png

f7167f331fbead15a02ef066ecfa8393.png

88a7bcc1796c0b6857dc5c10a0e640e7.png

160eafcecf720c2ba91814f41e13ff40.png

2908adcde98de303afc4b5d63eddf02d.png

c4cce0ddf85ae03add9ecaa4a3e865a2.png

5c991d3cd285b3c9778d0dcde2511971.png

16ce68c32ef6d53be9bd2a8eaabe1898.png

628d8320465ce19d9ca31b6e2fd5c671.png

a645e7fc8b4b3e80ff8b294443e00690.png

098c30a3eb544a0fbf6db23496efbdac.png

Here are Alibaba Tongyi Qianwen’s answers to these 10 questions:

ffa49630d2f9d7002a02d944f546681c.png

50c9532d48dea0da3cd523005f5b1f07.png

d2dafa00ed06d9221c2e1b66a65eb8e4.png

a0689571ea856618490cfbbcd7457354.png

cc97073a0c1c78917759cb1970cc1a74.png

2ffb025572d03db7c4f2584f8d047a1d.png

d2082616772a328c8c7a7cb131bd5bf9.png

d5e4466a1fe95355635f9fa00f9e9e08.png

09d550dbed509479028a29ea3d0e4e5b.png

0727e4f1594193dbffb667f75245b602.png

8f6b18028d78fbda298580057e479722.png

92fc96d8e596878011f4994a2cfc4601.png

27571049ae5275d047a87aadcf10676a.png

09774aaf2d745e3f2b764dc604103a48.png

704281e444f39155ed05ef5ed56ccc95.png

2f251b082a514ae1ad02e67ca540c8a6.png

e58147f9c91d4064ce453f3812285fc5.png

4749e32dbd0c3498c7091cf91cc3201d.png

Text: Misty Rain / Data Ape

ef5d490ac01a6db24089dc401626f060.jpeg

a31fbb16a25eacc34055258d2d2c5770.jpeg

b516489bcf3b6b39778f058d0e0d1a3c.png

24bd3ea9b1af8b7b289a94f5d970039c.png

Guess you like

Origin blog.csdn.net/YMPzUELX3AIAp7Q/article/details/130036968