Recommend FinGLM, an open source large model project for financial analysis

" The SMP Financial Large Model Challenge incubated the FinGLM project, a conversational interactive intelligent system focused on in-depth analysis of annual reports of listed companies. FinGLM aims to use large language models (LLM) to achieve expert-level financial analysis and process financial text The professional terminology and implicit information. The data preparation and model fine-tuning of the FinGLM project are important steps in building a question answering system. They include converting PDF data into a processable text format, data segmentation and processing, and fine-tuning the model to adapt needs in the financial sector.

5d1798315c2429ec087d782e6015e967.png

01

The artificial intelligence large model LLM has made significant progress in text generation, but it still needs further improvement in its application in more complex and challenging financial scenarios.

Today I recommend to a friend FinGLM, a large model open sourced by the Financial Large Model Challenge. ‍‍‍‍‍‍‍‍‍‍

Open source address:

https://github.com/MetaGLM/FinGLM

Project Introduction:

FinGLM is a conversational interactive intelligent system designed to deeply analyze the annual reports of listed companies. Faced with professional terms and implicit information in financial texts, we are committed to using large model LLM to achieve expert-level financial analysis.

Background:

The project was generated from the SMP Financial Large Model Challenge. The SMP 2023 ChatGLM Financial Large Model Challenge (The Evaluation of Large Model of Finance Technology, SMP2023-ELMFT) was sponsored by the Social Media Processing Committee (SMP) of the Chinese Information Society of China. Zhipu AI, Anshuo Information, Alibaba Cloud, Moda Community, and Beijing Jiaotong University jointly organize the event, and Tianchi Platform is the designated event platform.

1b74c9e903ecfbd0bef758bcd7df48cc.png

Challenge address:

https://tianchi.aliyun.com/competition/entrance/532126

Challenge purpose:

Can capabilities be enhanced on the basis of existing open source small models, and the performance of AI models in financial analysis can be improved through advanced methods such as fine-tuning large models, collaboration of large and small models, and vector databases?

Specifically, whether it is possible to train a large model that can answer relevant investment questions based on the annual report data of listed companies.

Open source dataset :

Covers the annual reports of some listed companies from 2019 to 2021. This data set contains a total of 11588 detailed PDF files, with a total size of 69GB.

Q&A Demo :

{"id": 0, "question": "2021年其他流动资产第12高的是哪家上市公司?", "answer": "2021年其他流动资产第12高的公司是苏美达股份有限公司。"}
{"id": 1, "question": "注册地址在重庆的上市公司中,2021年营业收入大于5亿的有多少家?", "answer": "2021年注册在重庆,营业收入大于5亿的公司一共有4家。"}
{"id": 2, "question": "广东华特气体股份有限公司2021年的职工总人数为?", "answer": "2021年广东华特气体股份有限公司职工总人数是1044人。"}
{"id": 3, "question": "在保留两位小数的情况下,请计算出金钼股份2019年的流动负债比率", "answer": "2019金钼股份流动负债比率是61.10%。其中流动负债是1068418275.97元;总负债是1748627619.69元;"}
{"id": 4, "question": "2019年负债总金额最高的上市公司为?", "answer": "2019年负债合计最高的是上海汽车集团股份有限公司。"}
{"id": 5, "question": "2019年总资产最高的前五家上市公司是哪些家?", "answer": "2019年资产总计最高前五家是上海汽车集团股份有限公司、中远海运控股股份有限公司、国投电力控股股份有限公司、华域汽车系统股份有限公司、广州汽车集团股份有限公司。"}
{"id": 6, "question": "2020年营业收入最高的3家并且曾经在宁波注册的上市公司是?金额是?", "answer": "注册在宁波,2020年营业收入最高的3家是宁波均胜电子股份有限公司营业收入47889837616.15元;宁波建工股份有限公司营业收入19796854240.57元;宁波继峰汽车零部件股份有限公司营业收入15732749552.37元。"}
{"id": 7, "question": "注册地址在苏州的上市公司中,2020年利润总额大于5亿的有多少家?", "answer": "2020年注册在苏州,利润总额大于5亿的公司一共有2家。"}
{"id": 8, "question": "浙江运达风电股份有限公司在2019年的时候应收款项融资是多少元?", "answer": "2019年浙江运达风电股份有限公司应收款项融资是51086824.07元。"}
{"id": 9, "question": "神驰机电股份有限公司2020年的注册地址为?", "answer": "2020年神驰机电股份有限公司注册地址是重庆市北碚区童家溪镇同兴北路200号。"}
{"id": 10, "question": "2019年山东惠发食品股份有限公司营业外支出和营业外收入分别是多少元?", "answer": "2019年山东惠发食品股份有限公司营业外收入是1018122.97元;营业外支出是2513885.46元。"}
{"id": 11, "question": "福建广生堂药业股份有限公司2020年年报中提及的财务费用增长率具体是什么?", "answer": "2020福建广生堂药业股份有限公司财务费用增长率是34.33%。其中,财务费用是7766850.48元;上年财务费用是5781839.51元。"}
{"id": 12, "question": "华灿光电股份有限公司2021年的法定代表人与上年相比相同吗?", "answer": "不相同,华灿光电股份有限公司2020年法定代表人是俞信华,2021年法定代表人是郭瑾。"}
{"id": 13, "question": "请具体描述一下2020年仲景食品控股股东是否发生变更。", "answer": "2020年,仲景食品控股股东没有发生变更。"}
{"id": 14, "question": "什么是其他债权投资?", "answer": "其他债权投资是指企业或机构投资者通过购买债券、贷款、定期存款等金融产品获得的固定收益。这些金融产品通常由政府、公司或其他机构发行,具有一定的信用等级和风险。\n\n其他债权投资是企业或机构投资组合中的一部分,通常用于稳定收益和分散风险。与股票投资相比,其他债权投资的风险较低,但收益也相对较低。\n\n其他债权投资的管理和投资策略与其他资产类别类似,包括分散投资、风险控制、收益最大化等。然而,由于其他债权投资的种类繁多,其投资和管理也存在一定的特殊性。"}

Contributors:

The teams that finally entered the finals include Mantou Technology, Nan Nadu Team, Marriage Buying a House and Daidaijiucai, nsddd, Chatglm Anti-roll General Administration, Xiaodaxiaowao, Northeast Big Potato, Anshuoshuo Eye Exploration Enterprise, etc., and will finalize the project plan. , code and model are completely open source for this FinGLM.

02

process

From the annual report PDF document to the construction of an intelligent financial analysis Q&A system, it requires multiple steps such as data preparation and model fine-tuning. Finally, the financial analysis system will be presented to end users through the Q&A function. ‍‍‍‍

1. Data preparation stage

    • PDF to TXT:

      • Convert to TXT format.

      • Keep the table and merge the cells.

    • Data segmentation:

      • Basic information: such as company name, etc.

      • Financial data: such as balance sheet, etc.

      • Comprehensive information: such as financial indicators, etc.

    • data processing:

      • Basic calculation formula: such as operating cost rate, etc.

      • Calculate the growth rate.

      • Calculate industry averages and rankings.

    • Save to database:

      • Store in SQL, Mongo and ES.

      • Including table creation and storage.

2. Model fine-tuning stage‍‍

    • Data classification: such as SQL data, ES data, etc.

    • Select a fine-tuning strategy: such as ptuningv2, lora, etc.

    • Perform fine-tuning: based on selected strategy.

3. Q&A process

    • Input question: User input question.

    • Prompt preparation: Generate prompt based on the problem.

    • Generate query statements: Select the generation method based on GPU usage.

    • Query the database: and return the results.

    • Answer generation: combine questions and query results to generate answers.

199494a677ea7b8a1c39c076f3feda60.jpeg

03‍

Open source agreement

FinGLM project data/schemes/codes/models are open source. Related resources are only for research and communication. It is generally not recommended for commercial use. If used for commercial purposes, please bear the legal risks caused by it.
‍‍‍‍

If it involves the commercial use of the model, please follow the ChatGLM model usage agreement.

References:

https://github.com/MetaGLM/FinGLM

https://mp.weixin.qq.com/s/FML3mx7McW735Qt0pgy6TQ

Reading recommendations:

Reversing the Curse|The Biggest BUG of Large Models! A=B Can’t the large model LLM trained learn B=A? Domestic models say yes

AI Product List·Top 100 Domestic and Top 20 Overseas (August 2023)

A long article of 10,000 words will give you a comprehensive interpretation of the visual model

The direction of development of large model applications|The rise of agent and its future (Part 2)

The direction of development of large model applications|The rise of agent and its future (Part 1)

Recommend three open source projects to build proprietary knowledge base + large model intelligent assistant

Andrew Ng: Opportunities for AI

Foreign reports indicate that 90% of AI product companies have achieved profitability, but interviews with domestic large models and AIGC said that this is too high.

Embrace the future and learn AI skills! Follow me and receive free AI learning resources.

Guess you like

Origin blog.csdn.net/fogdragon/article/details/133366807