[AI Medicine] Several fine-tuning & pre-training large model projects in the medical field

Keywords: AI medicine, large medical model, instruction fine-tuning, PubMed

Field instruction fine-tuning sample generation framework OpenGPT and medical health large model NHS-LLM
ChatDoctor: use medical knowledge base to generate fine-tuning samples to build a large model in the medical field

Two representative large models in the medical field of AI were sorted out earlier, mainly talking about the fine-tuning data generation of large domain models and the fine-tuning training on the basis of general large models. This article continues the research in this direction, and sorts out project information on the current major AI medical large model fine-tuning and pre-training.

Medical fine-tuning mockup

1. MedicalGPT-zh

A Chinese medical general model fine-tuned based on ChatGLM-6B LoRA 16-bit instructions. Based on the Chinese medical consensus and clinical guideline texts of a total of 28 departments, a high-quality instruction data set with more comprehensive coverage of medical knowledge and more accurate answers is generated.

Project address : https://github.com/MediaBrain-SJTU/MedicalGPT-zh

  • The sample data format for generating knowledge documents based on medical guidelines is as follows:
{
"指南": "患者获益;相反减少热卡摄入会导致营养不良,尤其是晚期CKD患者。因此,减轻体重干预措施的利弊需进一步研究,并未作为临床推荐。CKD合并糖尿病患者的降糖治疗一些新型降糖药物已获准应用于临床,包括钠葡萄糖共转运蛋白2抑制剂(SGLT2i),胰高血糖素样肽-1受体激动剂(GLP-1RA)和二肽基肽酶4抑制剂(DPP4i)。多项大型临床试验证实新型降糖药物能有效改善2型糖尿病患者心血管和肾脏临床预后(图4)。这些试验为治疗提供了证据。在对已发表的研究成果进行严格的系统回顾后,工作组经过综合考量,为CKD患者合并2型糖尿病推荐了首选的治疗方案。1型糖尿病合并CKD患者,仍以胰岛素降糖主为,缺乏新型降糖药物疗效的对照研究,故在本指南未涉及。指南推荐CKD患者合并2型糖尿病患者在生活方式干预的基础上,以二甲双胍联合SGLT2i作为一线降糖治疗方案,必要时再联合其他降糖药物(图5)。对于大多数eGFR≥30ml/(min·1.73m2)的CKD合并糖尿病患者可以从二甲双胍和SGLT2i治疗中获益。二甲双胍是一种价格低廉且耐受性良好的药物,可以有效降低血糖;SGLT2i有益于降低CKD及CVD患者心血管",
        "qa_pairs": [
            {
                "question": "CKD患者减轻体重的干预措施可能会导致什么后果?有哪些风险需要注意?",
                "answer": "CKD患者减轻体重的干预措施可能会导致以下风险和后果:\n\n1. 营养不良:如果减重不当,可能会导致营养不良和不良的健康后果。\n\n2. 肌肉损伤:不适当的减重可能会导致肌肉损伤,尤其是在白天饮食受限的情况下。\n\n3. 血糖水平变化:CKD患者减重可能会影响血糖水平的稳定性。\n\n4. 肾脏功能恶化:不适当的减重可能会影响肾脏功能的稳定性。\n\n5. 药物副作用增加:减重可能会影响药物的代谢和吸收,从而增加药物副作用的风险。\n\n6. 微量元素缺失:减重可能会导致某些微量元素的摄入不足,例如钙、铁和维生素D等。\n\n因此,CKD患者在减轻体重时需要注意上述风险,并在医生或营养师的指导下制定合理的减肥计划。"
            },
            {
                "question": "新型降糖药物对于CKD并发糖尿病的患者能够产生怎样的治疗效果?这些效果有哪些证据支持?",
                "answer": "一些新型降糖药物针对CKD并发糖尿病的患者具有良好的治疗效果,包括以下:\n\n1. 糖皮质激素受体拮抗剂(SGLT2抑制剂):这类药物可以通过促进糖尿病患者的尿液排泄,从而降低血糖水平。目前已有多项研究证实,这类药物可以在CKD患者中降低血糖水平,并且有助于减少肾脏疾病的进展。\n\n2. 胰岛素类药物:这类药物可以增强糖尿病患者的胰岛素功能,从而促进血糖的吸收和利用。尽管胰岛素类药物在CKD患者中使用可能存在一些风险,但已有研究表明,一些胰岛素类药物在CKD并发糖尿病的患者中可以有效控制血糖水平。\n\n3. 胰岛素样增效因子-1受体激动剂(GLP-1RA):这类药物可以增强患者的胰岛素分泌和对胰岛素的敏感度,从而降低血糖水平。现有的研究证据表明,GLP-1RA对于CKD患者并发糖尿病的治疗具有潜在的益处。\n\n需要注意的是,不同类型的药物可能具有不同的治疗效果和副作用,应根据患者的具体情况进行选择和调整治疗方案。此外,任何药物的使用都应在医生的监护下进行。"
            },
            {
                "question": "在CKD合并糖尿病的治疗中,二甲双胍和SGLT的联合使用能够带来哪些益处?这些药物适用于哪些CKD患者?",
                "answer": "二甲双胍是一种口服药物,通常用于治疗2型糖尿病,能够降低血糖和减少体重。SGLT(Sodium glucose co-transporter 2)抑制剂是一种新型口服降糖药物,能够通过抑制肾脏中的SGLT2,促使糖尿病患者在尿液中排出更多的葡萄糖,从而降低血糖。\n\n对于合并糖尿病的CKD患者,二甲双胍和SGLT抑制剂的联合使用可能会带来以下益处:\n\n1. 降低血糖:二甲双胍和SGLT抑制剂的联合使用可以更加有效地降低血糖,相比于单一药物的治疗,联合使用可能会更加有效。\n\n2. 保护肾脏:SGLT抑制剂具有减轻肾小管负担的作用,可以减少肾脏的损伤和疾病进展的风险。此外,二甲双胍的使用也可以降低糖尿病患者肾脏疾病的风险。\n\n3. 减轻体重:二甲双胍和SGLT抑制剂的联合使用可以通过减少摄入的糖分和促进葡萄糖的代谢而减少体重。\n\n适用于哪些CKD患者?\n\n二甲双胍和SGLT抑制剂的联合使用适用于合并糖尿病的CKD患者。然而,这些药物并不适用于所有CKD患者,因此应该根据患者的具体情况,在专业医生的建议下使用这些药物。此外,需要注意二甲双胍和SGLT抑制剂的副作用和药物相互作用,以避免不良反应。"
            }
        ]
}
  • The Chinese text segments of medical consensus and clinical guidelines cover a total of 32k text segments in 28 departments. The departments and their distribution are as follows:
    insert image description here

2. DoctorGLM

Based on the ChatGLM-6B model, the Lora and p-tuningv2 methods are used respectively, and the Chinese medical dialogue dataset is introduced to fine-tune ChatGLM-6B, with a sample size of nearly 800k question-answer pairs.

Project address : https://github.com/xionghonglin/DoctorGLM

- The Chinese medical dialogue dataset contains 6 folders and the corresponding question-answer pairs as follows :

总计6个文件夹,792099个问答对:
<Andriatria_男科> 94596个问答对 
<IM_内科> 220606个问答对 
<OAGD_妇产科> 183751个问答对 
<Oncology_肿瘤科> 75553个问答对 
<Pediatric_儿科> 101602个问答对 
<Surgical_外科> 115991个问答对 

- The data is in CSV format, an example is as follows:

Department Title Ask Answer
Cardiology Can hypertensive patients eat Codonopsis pilosula? I have high blood pressure. When my son-in-law came these two days, he brought me some Codonopsis pilosula to soak in water. Hello, can I eat Codonopsis pilosula for high blood pressure? Patients with high blood pressure can take Codonopsis pilosula orally. Codonopsis pilosula has the effect of lowering blood fat and blood pressure, and can completely eliminate the garbage in the blood, so it has a certain stable and preventive effect on patients with coronary heart disease and cardiovascular disease. In addition, Codonopsis Codonopsis not only nourishes qi and nourishes blood, reduces the central nervous system, adjusts the function of the digestive system, but also strengthens the spleen and lungs. Thank you for your consultation, I hope my explanation is helpful to you.
Gastroenterology Which hospital can treat gastric reflux Heartburn, hiccups, cough and low-grade fever, for more than 4 years It is recommended that you use omeprazole at the same time, add morphine or mosapride or Zhuansheng Liwei, and you can also add Daxi tablets

3. Huatuo-Llama-Med-Chinese & ChatGLM-Med

Instruction fine-tuning on LLaMA-7B based on Chinese medical knowledge. Chinese medical knowledge is an instruction fine-tuning sample generated by the Chinese knowledge map CMeKG after chatGPT, and fine-tuning training is also carried out on ChatGLM-6B, and a new model ChatGLM-Med (6B) is obtained . The fine-tuning training samples total nearly 8k.

Project address : https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese

The sample generation method prompt is not given in detail. With the help of chatGPT, from the structured knowledge map to the fine-tuning sample generation, we can learn from the ChatDoctor and DoctorGLM projects.

Medical pre-trained language model

1. BioMedLM (2.7B)

Stanford crfm is based on the GPT-2 model architecture, using the abstract and text data of PubMed biomedical papers to continue pre-training, the pre-training data has 300B Tokens, and reached a score of 50.3 on the MedQA task.

Project address : https://github.com/stanford-crfm/BioMedLM

2. PMC-LLaMA (7B)

On the basis of the LLaMA model, add 4.9M PubmedCentral medical knowledge-related academic paper data, more than 75B tokens, and continue to pre-train LLaMA. Compared with BioMedLM, both of them are pre-trained on PubMed. The difference is that the project is based on the LLaMA model and has its own set of logic in the screening of medical-related papers.

Paper title : PMC-LLaMA: Further Finetuning LLaMA on Medical Papers
Paper address : https://arxiv.org/abs/2304.14454
Project address : https://github.com/chaoyi-wu/PMC-LLaMA
insert image description here

The paper performs full parameter fine-tuning and PEFT fine-tuning on LLaMA-7B respectively. Compared with the original model, the performance in the evaluation set is improved, which shows that the pre-training of the introduction of domain data is effective for the domain capability of the model, but it is still insufficient compared with chatGPT.

insert image description here

3. BioMedGPT (1.6B)

OpenBioMed: An open-source toolkit for multimodal representation learning for AI-driven biomedical research. This project focuses on multimodal information, such as knowledge graphs and biomedical texts of drugs, proteins, and single cells, and a wide range of applications, including drug-target interaction prediction, molecular property prediction, cell type prediction, molecule-text retrieval , Molecule-Text Generation and Drug Response Prediction, etc. Researchers can use many deep learning models including BioMedGPT-1.6B and CellLM to facilitate downstream tasks. This project provides easy-to-use APIs and commands to accelerate life science research. OpenBioMed: OpenBioMed: An open-source toolkit for multimodal representation learning for AI-driven biomedical research. This project focuses on multimodal information, such as knowledge graphs and biomedical texts of drugs, proteins, and single cells, and a wide range of applications, including drug-target interaction prediction, molecular property prediction, cell type prediction, molecule-text retrieval , Molecule-Text Generation and Drug Response Prediction, etc. Researchers can use many deep learning models including BioMedGPT-1.6B and CellLM to facilitate downstream tasks. This project provides easy-to-use APIs and commands to accelerate life science research. This project focuses on this project

Project address : https://github.com/BioFM/OpenBioMed

Summary and reflection

Through sorting out the above large models in the medical field, we can see that the work on large models in the field is mainly manifested in two aspects: one is to continue
pretraining generative language models on massive field data;
On the basis of the model, domain data is introduced for instruction fine-tuning training (general large model base + domain data instruction fine-tuning); the

continued pre-training of the generative language model has higher requirements for data volume and computing resources, and most of the current project work is concentrated Fine-tuning training on domain data instructions for general models. In terms of instruction fine-tuning training, the difference in work is mainly manifested in "domain sample data generation" (such as various self-instruct generated samples) and "low-resource training" (such as various PEFT methods for fine-tuning some parameters).

The construction of large-scale models in the medical field is necessary . Due to the sensitivity of medical data, it is difficult to use external cloud services in most cases. Building a private medical large-scale model and deploying it locally has its application scenarios. Although the current chatGPT-like platform has considerable capabilities in medical Q&A, due to the particularity of the usage scenarios, such as the automated processing of electronic medical records in hospitals, the construction of patient diagnosis and treatment timelines, etc., each medical industry needs to further build and improve their own A locally deployable domain model .

At present, there are still relatively few public data in the Chinese medical field. The data generated by relying on chatGPT as a teacher is biased and uncertain. High-quality data is crucial to the improvement of model performance . It is necessary to gather medical data resources at a higher level to promote Improve the quality of standardized data. At the same time, an evaluation set in the medical field that can evaluate the capabilities of large models is also necessary. At present, the evaluation of most open source projects is still a manual evaluation of the generated results at the perceptual level. Establish a unified evaluation method and automated evaluation tools for large models in the follow-up domain . development is also important .

Thanks to the open source community for their contributions to large models & AI medicine!

Reference:
https://mp.weixin.qq.com/s/5q6If6hhMGGWD7mZeRfNLg

Synchronous update to : AI gas station

----------END----------

Guess you like

Origin blog.csdn.net/iling5/article/details/130688018