Interpretation: BloombergGPT, the first public large language model for the financial field

The following content is from Zhihu link:

https://zhuanlan.zhihu.com/p/619444812

Author: Century See Reprinted with the permission of the author.

Typesetting: Artificial Intelligence Quantification Laboratory
Public Account: Algorithm Engineering Notes

BloombergGPT is the language model involved in an article published by Bloomberg on arXiv on March 30, 2023 - BloombergGPT: A Large Language Model for Finance. It is also the first large language model to publish an article publicly in the financial field ( (hereinafter referred to as "LLM").

Here we introduce the contents of each chapter involved in the paper.

c77e52c533e49c3c78e92f7f752484cd.png

Main points

  • BloombergGPT is a financial language model (LLM for Finance) trained by Bloomberg

  • The number of model parameters is 50 billion , using a financial field data set containing 363 billion tokens and a general data set of 345 billion tokens.

  • The hidden layer dimension is 7680 , and the number of long heads is 40

  • The model uses Unigram tokenizer and AdamW optimizer

  • The model was trained for 53 days on 64 AWS p4d.24xlarge instances , each of which contains 8 40GB A100 GPUs.

  • The assessment of BloombergGPT consists of two parts : financial domain assessment and general domain assessment.

  • Other large language models evaluated and compared include GPT-NeoX, OPT, BLOOM, and GPT-3

  • On tasks in the financial field, BloombergGPT has the best overall performance ; on general tasks, the overall score of BloombergGPT is also better than other models of the same parameter magnitude , and the score on some tasks is higher than that of models with larger parameters.

  • While the BloombergGPT model has achieved good results in the financial field, it has not sacrificed the model’s general capabilities.

  • Results of a qualitative evaluation of the model show that BloombergGPT can improve work efficiency

  • For security reasons, the BlogbergGPT model will not be made public, but the relevant experiences and thoughts on model training and evaluation will be shared.

  • The author believes that the three factors (sorted from high to low by impact) that contribute the most to improving model performance are carefully cleaned data sets, reasonable tokenizers, and popular model structures.

1 Introduction

BloombergGPT is an LLM based on the BLOOM model with 50 billion parameters . The process adopts a method that has both general capabilities and specific fields.

To use this approach, the authors constructed the largest financial data set to date based on 40 years of data accumulated by Bloomberg .

The main contributions of the article are as follows:

  • Mixed dataset training methods can perform well not only on specific tasks but also on general NLP benchmarks

  • Different from common web crawled data, the data in this article contains a huge amount of carefully cleaned data from trusted sources.

  • It includes not only the evaluation results of the model on the benchmark test set, but also the evaluation results on Bloomberg's internal tasks.

  • A 50 billion parameter LLM was trained on 569 billion tokens in a corpus of more than 700 billion tokens.

  • Use the Unigram model instead of the commonly used subword tokenizer based on greedy merging for tokenization to facilitate more intelligent tokenization during inference.

  • Learn from BLOOM’s large model training method, and also share your own experience in training BloombergGPT.

2. Dataset

BloombergGPT is an LLM based on the BLOOM model with 50 billion parameters . The process adopts a method that has both general capabilities and specific fields.

The author first built FinPile - a financial field data set that contains English financial documents such as news, archives, web crawled press releases, English financial documents, etc., and also used a common data set.

2.1 Financial field data set

The financial field data set contains a total of 363 billion tokens, accounting for 54.2% of the total data set token volume . It is specifically composed of the following parts:

  • Web pages related to the financial field, 298 billion tokens, accounting for 42.01%

  • Well-known news source in the financial field, 38 billion tokens, accounting for 5.31%

  • Company financial report, 14 billion tokens, accounting for 2.04%

  • Publications of financial-related companies, 9 billion tokens, accounting for 1.21%

  • bloomberg, 5 billion tokens, accounting for 0.7%

Because it contains some paid and private data, this data set will not be made public, but the model training method is disclosed in the article.

2.2 Common data sets

The general data set contains a total of 345 billion tokens, accounting for 48.73% of the total data set tokens . It is divided into the following parts:

  • The Pile data set, 184 billion tokens, accounting for 25.9%

  • C4 data set, 138 billion tokens, accounting for 19.48%

  • Wikipedia data set, 24 billion tokens, accounting for 3.35%

The dataset uses Unigram tokenizer to tokenize the original text. During the specific processing, the author made two improvements (for specific content, please refer to the original paper "2.3Tokenization"):

  • In the pretokenization step, the number is treated as a single token and the existence of phrases is allowed to increase information density and reduce sentence length.

  • Use the idea of ​​divide and conquer to optimize the implementation of Unigram tokenizer on large data sets, and control the final vocabulary size to be on the order of 130,000


3. Model

3.1 Model structure

The model is based on the autoregressive structure of the BLOOM model , which specifically includes 70 layers of transformer decoder .

Some other details are as follows (see "3.1 Architecture" for details):

  • The nonlinear function in the feedforward layer (FFN) uses GELU

  • Position encoding uses ALiBi encoding

  • The model has an extra layer normalization on the first layer.

3.2 Model scale

In this part, the author first has a computing power budget ( a total of 1.3 million GPU hours for 40G memory A100 ), and sets aside about 25% of the time budget for intermediate checkpoint storage.

According to Chinchilla scaling laws, the model parameters and required data volume are calculated - the model parameters are 50 billion and the token data volume is 11000+ billion .

Considering that the number of tokens in the financial field accounts for more than 50% of the total number of tokens, and the current data cannot be expanded for the time being, the final model parameter size is selected to be 50 billion, and the token data volume is 700 billion+ .

On the other hand, the hidden layer dimension D can also be calculated based on the number of decoder layers. Here, the hidden layer dimension is calculated to be 7680 , and the number of long heads is 40 .

8c19ad7a5fe690cb900db273717f7bb8.jpeg

BloombergGPT hyperparameters

3.3 Training configuration

This part of the original paper is written in more detail. For details, see "3.3 Training Configuration". Here is a brief summary:

  • The author added a special tag <|endoftext|> at the end of each document, and the sentence length selected during model training was 2048 tokens.

  • The optimization method used during training is AdamW. The values ​​of beta1, beta2, and weight decay are 0.9, 0.95, and 0.1 respectively . The initial learning rate is 6e-5, and cosine decay and linear warmup methods are used.

  • The model parameters are randomly initialized to a normal distribution with mean 0 and standard deviation 0.006588 , and the output of the second layer and attention layer of MLP are scaled.

  • Regarding the instability of training, the article does not describe the method used to train BloombergGPT, but only introduces the relevant progress.

  • Regarding the hardware used for computing, 64 AWS p4d.24xlarge instances were used. Each p4d.24xlarge instance contains 8 40GB A100 GPUs.

3.4 Methods adopted for large-scale optimization

In this part, the author describes the specific optimization methods: ZeRO optimization, MiCS, Activation Checkpointing, Mixed Precision Training, and fused kernels.

For details, see "3.4 Large-scale Optimization"

After the above optimization, the average computing power level of the above hardware has reached 102TFLOPs , and the training step takes 32.5 seconds .

3.5 Training process

The change curve of the loss function with the number of training steps is shown in the figure below:

0365cbca3867d534977cd8a37d2f7fb7.jpeg

Loss function changes curve with the number of training steps

The article records that the model was trained for a total of 139,200 steps , about 0.8 epochs , and 53 days of training. The reason why one epoch has not been trained is that the loss function on the verification set no longer continues to decline.

The specific training process is as follows :

  • The batch size of the initial training was 1024, and the warm-up process lasted for 7200 steps. The author subsequently modified the batch size to 2048.

  • After 115,500 steps, the loss on the validation set no longer decreases, and then the author shrinks the learning rate to 2/3 of the original;

  • After 129,900 steps, the learning rate is reduced to 1/2 of the previous value, and dropout is increased.

  • After 137,100 steps, the learning rate shrinks again to 1/2 of the previous one.

  • Finally, training ends at 146,000 steps. The author selects the model of step 139,200 as the final model used.

It is recommended to read the description of the training method in Sections 3.3 and 3.4 of the original article, which has certain reference significance for large model training.

4. Evaluation

The evaluation of BloombergGPT in the article is divided into two parts : financial field tasks and general tasks . The purpose of this is also relatively intuitive, which is to verify that a model pre-trained in a specific field can perform well in a specific field, and at the same time, its performance in general fields will not be much worse .

The overall evaluation tasks are distributed as follows:

896d2d8e5479ce5e9ebe59c7f58386ab.jpeg

Evaluation task classification

At the same time, the article compares the performance of BloombergGPT, GPT-NeoX, OPT, BLOOM, and GPT-3 on different tasks . Note that because the GPT-3 model is not available, it is only evaluated on some general tasks .

The number of tokens, parameters, and calculations used by each model are as follows:

54cfeb6d2fb50fe7d1895352938d4424.jpeg

Comparison of the number of tokens, parameters, and calculation volume used by the model

The author independently evaluated each model and used the same standard prompts, the same examples, no task descriptions and any CoT prompts in each task to ensure the fairness of the evaluation results.

For tasks with multiple answers, the article uses likelihood-based classification for evaluation; for other tasks, the article uses greedy decoding for evaluation.

4.1 holdout loss

The author first evaluated each model on bits per byte on some samples reserved in the FinPile data set.

The bits per byte indicator is a common indicator for evaluating language models . It is similar to perplexity. The smaller the value, the better the model. The specific calculation method can be found in How to compute bits per character (BPC)?
The bits per byte values ​​of each model on various types of data are as follows:

22909e038bae3cf1a92c30d048b1f614.jpeg

bits per byte comparison chart

It can be seen that BloombergGPT's bits per byte on financial corpus are better than other models, and its performance is particularly outstanding in the category of financial reports (Filings) . This result is also in line with expectations. Otherwise, there may be no need for subsequent task comparisons.

4.2 Tasks in the financial field

There are 6 types of tasks in the financial field, 3 discriminative tasks and 3 generative tasks . The format of specific tasks is as follows:

6f46a2a0a8bffadc3d9b7ba9f8eaa4d2.jpeg

Financial field task template

The article further divides tasks in the financial field into external tasks and Bloomberg internal tasks . On each task, in addition to evaluating the performance of the model on the task, the author also evaluated the winning rate (WR) of pairwise comparisons between the results generated by different models under the same task.

4.3 External tasks

The main external tasks are as follows:

  • ConvFinQA, S&P 500 earnings report Q&A reasoning

  • FiQA SA, three aspect-based sentiment classification of financial news and microblog titles (positive and negative)

  • FPB, three sentence-level sentiment classifications of financial news (positive and negative)

  • Headline, a two-category of news titles under predefined tags

  • NER, named entity recognition for credit risk assessment data

BloombergGPT achieved the best results in 4 of the above 5 tasks, and achieved second place in another one; it also had the highest winning rate in pairwise comparison of model results, and was far ahead in the ConvFinQA task.

The specific scores are as follows:

d6a5750057e4171269f10dc9c55eb238.jpeg

Comparison of external task scores in finance

4.4 Sentiment analysis of Bloomberg’s internal tasks

The sentiment analysis in this task is based on aspect-specific sentiment. The content of the data set can be roughly understood through the task name.

BloombergGPT's performance on the above four data sets is significantly ahead of other models .

The specific results are as follows:

fddfa5eac4df33f62deea7f531516cd0.jpeg

Internal sentiment analysis tasks

4.5 Exploratory Task: NER

Note that NER here only involves three types of entities: ORG, PER, and LOC .

At the same time, the exploratory task NER+NED refers to identifying the entity and then linking the entity to the stock abbreviation of the listed company . For example, "AAPL announced that they will stop using Intel chips in future products." The result of NER is "AAPL, Intel" , and the result of NER+NED is "AAPL, INTC" .

The data sets involved in these two types of tasks include 7 data sets , namely BN (content on Bloomberg BN wire), BFW (content on Bloomberg First Word), Filings (financial report content), Headlines (Bloomberg news content), Premium (third-party news content collected by Bloomberg), Transcripts (transcripts of company press conferences), Social Media.

In the end, under the NER task, BloombergGPT only scored the highest on the Headlines data set; but under the NER+NED task, BloombergGPT scored first on all tasks except the Social Media task .

The specific results are as follows:

df3bfa606f5fd6e1d8ddc6794c4673a8.jpeg

NER task performance

4.6 Common tasks

The article makes a lot of comparisons on general tasks. Here we only give a brief description of the task types and results. For details, see Sections 5.4~5.7 in the article .

The author has done research on BIG-bench Hard (a subset of BIG-bench, which only includes tasks for which the current model cannot perform better than humans), common sense test (does not provide any background knowledge, only data that can be used for training), reading comprehension , linguistics (Disambiguation, grammar recognition, implication discrimination, etc.) and other tasks were tested.

On the BIG-bench Hard task, BloombergGPT scores lower than PaLM and BLOOM, which have larger parameter sizes. However, compared with GPT-NeoX or OPT66B with similar parameter sizes, the performance of BloombergGPT is closer to BLOOM, which illustrates the development of a large language dedicated to finance . The model does not significantly sacrifice its general capabilities.

In the common sense test task, BloombergGPT won first place on one task and second place on the remaining three tasks (GPT-3 is not considered here) .

On the reading comprehension task, GPT-3 ranked first on all tasks, and BloombergGPT ranked second on 5/6 tasks , with scores much higher than the BLOOM model.

In terms of linguistic tasks, GPT-3 ranks first overall, and BloombergGPT ranks second overall , and its overall score is higher than the BLOOM model.

4.7 Evaluation summary

In terms of tasks in the financial field, BloombergGPT has the best overall performance ;

On general tasks, the overall score of BloombergGPT is better than other models of the same parameter magnitude, and on some tasks the score is higher than that of models with larger parameter quantities .

This all shows that the development of financial-specific large language models has achieved good results in the financial field without sacrificing the model's general capabilities .

This conclusion can also give us an inspiration. In other specific fields, we can also develop dedicated large language models .

4.8 Qualitative assessment

The author also shows an example of qualitative evaluation of BloombergGPT in Chapter 6 of the article to demonstrate the promotion effect of the model in the professional field.

These examples include:

  • BQL (Bloomberg Query Language) generation, which uses natural language to complete Bloomberg database queries, similar to NL2SQL

  • News title prompts to assist reporters in generating short news titles

  • Financial Q&A

ef61669cdd1945ca63e38af4490af0f0.jpeg

BQL sample

a41bdacf31ee21861939b18ea6a6a329.jpeg

News headline generation example

4abcb9f33ca7f7e361e541297ee38e39.jpeg

Financial Q&A Sample

5. Related work

This chapter describes various aspects involved in large language model training from 7 aspects. Here is only the following figure as a summary.

e28fe17b27439fe049034cb3315ea038.jpeg

Large model related work

For details, please see the content of "7. Related Work".

6. Ethics, limitations and research significance

There is not much worth writing in this chapter. It mainly emphasizes that the current large language model may generate harmful and biased content, and there may be a risk of prompt injection leading to information leakage. Bloomberg will do a good job before and after using the large language model. Risk control to ensure the accuracy of generated content.

At the same time, the BloogbergGPT model will not be made public, but the relevant experiences and thoughts on model training and evaluation will be shared .

7. Summary and outlook

The article proposes BloombergGPT, a top LLM in the financial field, and has made the following contributions in training large language models in specific fields:

  • The training method using domain data and general data can allow the model to obtain balanced results in these two aspects.

  • The model parameters refer to Chinchilla scaling laws.

  • Relevant training details announced

Next, the authors will continue research in the following directions:

  • fine-tuning in the financial field

  • Use more harmless and non-judgmental language

  • Study the impact of tokenization methods on model results

Finally, the author attributes the current performance of the model to the following three factors (ordered from high to low influence):

  • Carefully cleaned internal data sets

  • Choice of tokenizer

  • Popular model structures

Recommended reading in previous issues

WWW 2023 | Papers related to quantitative trading (with paper links)

IJCAI 2022 | Papers related to quantitative trading (with paper links)

WWW 2022 | Papers related to quantitative trading (with paper links)

KDD 2022 | Papers related to quantitative trading (with paper links)

Interpretation: A graphical model framework for stock trend prediction by mining shared information between concepts

Interpretation: What kind of metrics should be used for machine learning prediction revenue models?

Interpretation: Futures short-term trend prediction model based on order flow, technical analysis and neural network

[Python Quantification] Significantly improve prediction performance and use NSTransformer for stock price prediction

[Python Quantification] Use Transformer model for stock price prediction

[Python Quantification] Build a CNN-LSTM model for stock price prediction

[Python Quantification] Use Python to build a stock public opinion analysis system

[Python Quantification] Use Informer for stock price prediction

[python quantification] Using DeepAR for multi-step probability prediction of stock prices

df721b4143baf928d95d7bc3a952991d.png

"Artificial Intelligence Quantitative Laboratory" Knowledge Planet

509049d5a35f8faf71454a54cee5cac2.png

By joining the Knowledge Planet of the Artificial Intelligence Quantitative Laboratory, you can get: (1) Regular push of the latest research results related to the quantitative application of artificial intelligence, including high-level journal papers and high-quality financial engineering research reports from securities companies, so that you can understand the latest cutting-edge knowledge anytime and anywhere; (2) The complete source code of the Python project in the official account’s historical articles; (3) High-quality Python, machine learning, and quantitative trading related e-books PDF; (4) High-quality quantitative trading information and project code sharing; (5) Communicate and make friends with star friends Like-minded friends. (6) Ask the bloggers questions and answer questions.

7e60039752c8e4f5c6a1156fb18ad4ef.png

Guess you like

Origin blog.csdn.net/FrankieHello/article/details/131218850