Data intelligence industry innovation service media
——Focus on digital intelligence and change business
There is no doubt that the world is already engaged in a large-scale arms race, and the "reputable" technology giants will not be absent. Yesterday, Alibaba internally tested Tongyi Qianwen, and today Huawei announced the latest progress of the Pangu model. Not long ago, Baidu announced Wenxin Yiyan, and 360 also announced large-scale model products. In addition, ByteDance, Tencent, JD.com, NetEase, etc. are all actively involved in this track.
It is foreseeable that in 2023 we will witness the release of multiple large-scale model products, and even try multiple large-scale model products. Since there are so many similar products, which one is better? At present, there is no scientific and reasonable evaluation standard in the industry. For this reason, Data Ape tries to establish an evaluation system for large-scale model products to evaluate the capabilities of similar products.
Core Factors Affecting Large Model Performance
In order to make the evaluation system more scientific and reasonable, we need to figure out what are the core factors that affect the performance of a large model product, and how these factors affect the final performance of the large model. On this basis, a scoring system is constructed.
Assessing the capabilities of a large model requires multiple considerations. The following are some of the main evaluation factors:
data set
The quality of the dataset directly affects the knowledge learned and generalization ability of the model. A high-quality dataset should have diversity, balance and a certain scale. Diversity means that the data set contains texts of different fields, styles and types; balance means that the number of samples in each category in the data set is relatively balanced; scale is related to the size of the data set.
A dataset is like the content of a course taught by a teacher. High-quality courses allow students to have a comprehensive understanding of knowledge in various fields, while poor-quality courses may allow students to only understand certain fields, resulting in an uneven knowledge structure.
Although many enterprises obtain datasets from public sources, they may filter, clean and enrich the data to build datasets with their own characteristics.
model architecture
The model architecture determines the basic structure and calculation methods of the model. Model architecture is like the structural design of a building. Different structural designs have different functions and performances. For example, the Transformer architecture provides a powerful ability to process long sequences of data, enabling it to better understand and generate language.
Different companies may adjust and optimize the model architecture according to their own needs and scenarios. For example, some enterprises may adopt a more efficient model architecture to maintain good performance while reducing computing resource consumption.
algorithm optimization
The optimization algorithm is responsible for adjusting the parameters of the model during training to minimize the loss function. Appropriate optimization algorithms can accelerate model convergence and improve model performance.
Different companies may adopt different fine-tuning strategies and goals. Factors such as training data selection, loss function design, and optimization methods in the fine-tuning stage will affect the performance of the model on specific tasks. Some companies may have exclusive technologies and patents, such as model parallelization and gradient accumulation, which can improve the efficiency and performance of model training.
parameter size
The parameter scale determines the complexity and learning ability of the model. It should be noted that more parameters can help the model learn more knowledge and features, but at the same time may lead to overfitting.
Parameter scale is like a person's memory. The stronger the memory, the more knowledge you can remember. However, if a person only memorizes mechanically and cannot use knowledge flexibly, then this kind of memory is not very useful. Appropriate parameter scale can ensure good generalization ability while learning rich knowledge.
computing resources
Computing resources have a great impact on the training speed and scalability of the model. The more sufficient computing resources, the faster the training speed of the model. The training of large models has high requirements on chips, and usually requires the use of high-performance chips specially designed for deep learning, such as GPU (graphics processing unit) or TPU (tensor processing unit). For example, for a model with a scale of 100 billion parameters, the training process may require hundreds to thousands of high-performance GPUs (such as NVIDIA V100 or A100, etc.).
The consumption of computing resources is closely related to factors such as model parameter scale, dataset size, batch size, and number of training rounds: models with more parameters require more memory to store parameters, and more calculations are required during training; The larger the data set, the more data the model needs to process, which increases the amount of training calculations; the batch size refers to the number of samples input to the model in each training iteration, and a larger batch size can make better use of GPU and TPU The parallel computing capability improves the training speed. However, larger batch sizes also increase video memory or RAM consumption. Therefore, choosing an appropriate batch size is the key to finding a balance between computing resource consumption and training speed; more training rounds means that the model needs to perform more iterations, and accordingly, the consumption of computing resources will also increase.
To sum up, from a technical point of view, several factors such as data set, model architecture, parameter scale, algorithm optimization and computing resources have an important impact on the final performance of the model. We can compare model training to the cooking process: datasets are like ingredients, high-quality ingredients will make dishes more delicious; model architecture is like cooking methods, appropriate cooking methods can give full play to the characteristics of ingredients; fine-tuning strategies are like seasoning, Appropriate seasoning can make dishes more distinctive; proprietary technology and patents are like unique cooking techniques, allowing chefs to cook high-level dishes in a short period of time.
Taking ChatGPT as an example, it has been optimized in many aspects such as data set, model architecture, parameter scale, algorithm optimization and computing resources, which makes it have such amazing performance. For example, in terms of data sets, in addition to using large-scale network data sets, OpenAI's GPT series models will also collect other domain-specific data sets to expand the knowledge coverage of the model. In the fine-tuning stage, use more refined datasets for specific tasks, such as datasets for dialogue tasks or domain-specific text data. In addition, OpenAI has some proprietary technologies in distributed training, model compression, and model optimization. For example, OpenAI released a large-scale model training technology called "Megatron", which improves training speed through model parallelization and pipeline parallelization.
Evaluation System of Large Model Ability
Based on the above analysis, we try to build an evaluation system to evaluate the ability of a large model in a more scientific and reasonable way.
We divide the main influencing factors into the following aspects and assign weights to each aspect (on a 100-point scale):
Dataset Quality (25 points)
Coverage: Are the domains and topics covered by the model comprehensive (10 points)
Diversity: Whether the text styles and types contained in the dataset are rich (10 points)
Cleaning degree: The degree of processing noise, repetition and irrelevant content in the data set (5 points)
Model architecture and algorithm optimization (25 points)
Architecture innovation: whether the model architecture has uniqueness and advantages (10 points)
Optimization method: whether the optimization algorithm adopted can effectively improve the performance of the model (10 points)
Parameter Scale: The balance between the parameter size of the model and performance (5 points)
Fine-tuning strategy and task adaptability (25 points)
Fine-tuning dataset selection: Fine-tuning dataset quality for task-specific selection (10 points)
Loss function and optimization method: loss function design and optimization method selection in the fine-tuning process (10 points)
Task adaptability: the adaptability and generalization ability of the model on various tasks (5 points)
Performance and computing resource consumption (25 points)
Accuracy: The accuracy performance of the model on various tasks and data sets (10 points)
Practicability: the practicality and scalability of the model in actual application scenarios (10 points)
Computing resource consumption: Computing resource consumption during model training and inference (5 points)
For a large model that has just been launched, we can refer to the above evaluation model and give corresponding scores according to its performance in each aspect. This may require consulting information such as relevant literature, test reports and practical application cases. After assigning scores to each factor, the scores can be summed to obtain an overall score for the large model.
Of course, this evaluation model is only a preliminary suggestion of the data ape, and the actual evaluation process may need to be adjusted and optimized according to specific situations.
Now that we have the evaluation model, we will try to use this model to evaluate some large model products on the market. It should be noted that although Baidu, Alibaba, and Huawei in China are all developing large-scale model products, and some of them have started internal testing, the public information is relatively small, which is not enough to support us to conduct a complete evaluation of them.
Therefore, we can only select some large-scale model products with sufficient relevant data published abroad for evaluation. For the time being, select the three products of GPT-3, BERT and T5 as samples, and try our evaluation model. Below, we will apply the evaluation model and score the indicators of GPT-3, BERT, and T5 based on the public information that can be collected.
1、GPT-3(OpenAI)
Dataset Quality: 22 points
Coverage: 10 points, GPT-3 uses a large amount of text data, including the Common Crawl dataset, covering multiple fields and topics.
Diversity: 10 points, the dataset contains various types of text, such as news, blogs, forums, etc.
Cleaning degree: 2 points, although a certain degree of data cleaning has been carried out in the data preprocessing process of GPT-3, but there are still some noises and irrelevant content.
Model architecture and algorithm optimization: 20 points
Architecture innovation: 5 points, GPT-3 follows the basic architecture of GPT-2, but the parameter scale is greatly increased.
Optimization method: 10 points, GPT-3 adopts advanced optimization methods such as autoregressive architecture and multi-head attention mechanism.
Parameter scale: 5 points, the parameter scale of GPT-3 reached 175 billion, which achieved a significant performance improvement, but also increased the consumption of computing resources.
Fine-tuning strategy and task adaptability: 22 points
Fine-tuning data set selection: 10 points, GPT-3 can use more refined data sets in the fine-tuning stage to adapt to specific tasks.
Loss function and optimization method: 7 points, GPT-3 adopts a multi-task learning strategy, but in some tasks it may be necessary to further optimize the loss function and optimization method.
Task adaptability: 5 points, GPT-3 performs well on a variety of tasks, but may be affected by problems such as too long or too short generated text on some tasks.
Performance and computing resource consumption: 20 points
Accuracy: 10 points, GPT-3 performs well in multiple benchmarks, but there may be deviations in some specific tasks.
Practicality: 5 points, GPT-3 has broad application potential, but its large parameter scale may limit the practicality of deployment on resource-constrained devices.
Computing resource consumption: 5 points, the training and reasoning process of GPT-3 requires a lot of computing resources, which may lead to higher costs.
Total score: GPT-3 scored 84 points.
2、BERT(Google)
Dataset Quality: 18 points
Coverage: 8 points, BERT uses Wikipedia and BookCorpus datasets, covering many fields and topics.
Diversity: 8 points, the dataset contains various types of texts, but mainly focuses on informative articles and books.
Cleaning degree: 2 points, BERT has carried out a certain degree of data cleaning in the data preprocessing process, but there may still be some noise and irrelevant content.
Model architecture and algorithm optimization: 18 points
Architecture innovation: 6 points, BERT adopts the Transformer architecture and realizes the self-attention mechanism, which is innovative compared with the previous model.
Optimization method: 8 points, BERT uses a two-way training strategy, which effectively improves the performance of the model.
Parameter scale: 4 points. BERT has versions of various scales. The largest version has a parameter of 340 million, which improves performance, but the consumption of computing resources also increases accordingly.
Fine-tuning strategy and task adaptability: 20 points
Fine-tuning dataset selection: 8 points, BERT can use datasets from various fields and tasks for adaptation during the fine-tuning stage.
Loss function and optimization method: 7 points, BERT can achieve good performance on multiple tasks by adjusting the loss function and optimization method.
Task adaptability: 5 points, BERT performs well on a variety of tasks, but may not perform well on generation tasks.
Performance and computing resource consumption: 18 points
Accuracy: 9 points, BERT performs well in multiple benchmarks, but may be biased on some specific tasks.
Practicality: 5 points, BERT has a wide range of application potential, but deployment on resource-constrained devices may be limited by the scale of parameters.
Computing resource consumption: 4 points, BERT's training and reasoning process requires more computing resources, which may lead to higher costs.
Total score: BERT scored 74 points.
3、T5(Google)
Dataset Quality: 20 points
Coverage: 9 points, T5 uses multiple data sets including Common Crawl and Wikipedia, covering multiple fields and topics.
Diversity: 9 points, the dataset contains various types of text, such as news, blogs, forums, etc.
Cleaning degree: 2 points, a certain degree of data cleaning has been carried out in the data preprocessing process of T5, but there are still some noises and irrelevant content.
Model architecture and algorithm optimization: 19 points
Architecture innovation: 6 points, T5 is based on the Transformer architecture and implements a self-attention mechanism, similar to BERT.
Optimization method: 9 points, T5 adopts a sequence-to-sequence training strategy, treats all tasks as text generation tasks, and makes it have a strong generalization ability.
Parameter scale: 4 points. T5 has versions of various scales. The largest version has a parameter of 1.1 billion, which improves performance, but the consumption of computing resources also increases accordingly.
Fine-tuning strategy and task adaptability: 23 points
Fine-tuning dataset selection: 9 points, T5 can use datasets from various fields and tasks for adaptation in the fine-tuning stage.
Loss function and optimization method: 8 points, T5 achieved good performance on multiple tasks by adjusting the loss function and optimization method.
Task adaptability: 6 points, T5 performs well in various tasks and has good adaptability.
Performance and computing resource consumption: 19 points
Accuracy: 10 points, the T5 performed well in several benchmarks, achieving many leading results.
Practicality: 5 points, T5 has broad application potential, but deployment on resource-constrained devices may be limited by parameter scale.
Computing resource consumption: 4 points, the training and inference process of T5 requires more computing resources, which may lead to higher costs.
Total score: T5 gets 81 points.
According to the scoring results, the final scores of the above three models and the performance of various subdivision indicators are given.
Data Mapping
It should be pointed out that the above rating is only an example, not an absolutely accurate assessment. Actual performance of models may vary depending on specific tasks and scenarios. At the same time, it is hoped that Baidu, Huawei, and Alibaba in China will announce more performance data of their large models, so that the outside world can have a more comprehensive understanding of them.
10 questions to find out Alibaba Tongyi Qianwen, Baidu Wenxin's "family property"
With the help of the above evaluation model, we can more systematically understand the technical capabilities of a large model. However, there is a prerequisite for this model, that is, the developer of the large model needs to disclose sufficiently detailed data. Moreover, the above evaluation model is more biased towards the technical perspective.
As a user, how to intuitively evaluate the pros and cons of a large model product? The most direct way is to ask questions. To this end, Data Ape designed 10 test questions for large-scale model products, trying to test the capabilities of a large-scale model product through these questions, especially to test the boundaries of its capabilities.
Here are 10 questions we suggest:
Question 1: Please explain the core contradiction between relativity theory and quantum mechanics?
Rationale: To test the model's understanding of basic scientific knowledge.
Question 2: Why is the sky blue?
Rationale: To test the accuracy of models in explaining natural phenomena.
Question 3: Please write a Tetris application in Python.
Rationale: To test the knowledge and application ability of the model in the field of programming.
Question 4: Please imitate Li Bai and write a poem about love.
Rationale: To test the model's language generation ability and understanding of Chinese culture.
Question 5: Please briefly introduce the core working principle of the large-scale pre-training model.
Rationale: To test the model's understanding of emerging technologies and concepts.
Question 6: Please analyze the character traits of the five main characters in Journey to the West.
Rationale: To test the model's ability to understand and analyze literary works.
Question 7: Based on the current mainstream economic theory, please talk about the possibility of RMB replacing the US dollar.
Rationale: Test the model's understanding of economics and analysis of current events.
Question 8: Will large-scale model technology lead to mass unemployment, and which industries will mainly affect employment?
Rationale: To test the model's knowledge and understanding of industry applications.
Question 9: Please compare the GDP of the top 10 countries in the world in the past five years in tabular form. The data should be updated until 2022, and an analysis chart should be made based on the data.
Rationale: Test the data analysis and presentation capabilities of the model, and the latest dataset update date of the model.
Question 10: Do you think artificial intelligence will pose a threat to human beings, and will you sacrifice your own interests for the benefit of human beings?
Rationale: To test the model's ability to think about and generate ideas about complex issues, as well as its understanding of ethical and social issues.
Through these questions, we can comprehensively test the knowledge and application capabilities of the large model in various fields, and discover its good and bad aspects, as well as its obvious shortcomings.
Next, we will use these 10 questions to test ChatGPT, Baidu Wenxin Yiyan, and Alibaba Tongyi Qianyan respectively, and compare their actual performance.
The following are Wen Xinyiyan's answers to these 10 questions:
Here are Alibaba Tongyi Qianwen’s answers to these 10 questions:
Text: Misty Rain / Data Ape