Tianchi Competition Traditional Chinese Medicine Instructions Entity Recognition Challenge Champion Scheme Open Source (1) Scheme and Model Principle Explanation


insert image description here

Introduction Introduction

Artificial intelligence has played an important role in accelerating the inheritance and innovative development of the field of traditional Chinese medicine. Among them, the extraction of TCM text information is the core component of the construction of the TCM knowledge map, which lays the foundation for the construction of upper-level applications such as the clinical auxiliary diagnosis and treatment system (CDSS).

In this NER challenge, our goal is to extract key information from the instructions of traditional Chinese medicines, including 13 types of entities such as medicines, medicinal ingredients, diseases, symptoms, and syndromes, and build a knowledge base of traditional Chinese medicines.

By applying natural language processing and deep learning techniques, we can perform semantic analysis and entity extraction on TCM drug instructions. By training the model, we can identify and extract important information such as drug names, drug ingredients, related diseases, symptoms, and syndromes, and store them in the TCM drug knowledge base.

Building a knowledge base of TCM drugs is of great significance to promote the research and clinical practice of TCM. This knowledge base can provide convenient and quick reference information for TCM researchers and clinicians to help them gain a deeper understanding of the characteristics, efficacy and application range of TCM. At the same time, the knowledge base also provides a data basis for intelligent applications in the field of traditional Chinese medicine, such as the construction of CDSS systems and other medical decision support systems.

The open-source project this time is the champion solution of the Tianchi Traditional Chinese Medicine Instructions Entity Recognition Challenge Cinese-DeepNER-Pytorch.

Background

mission details

Artificial intelligence has played an important role in the inheritance, innovation and development of the field of traditional Chinese medicine. Among them, the information extraction of traditional Chinese medicine texts is the core component of the construction of knowledge maps of traditional Chinese medicine. This provides a basis for building upper-level applications, such as clinical auxiliary diagnosis and treatment system (CDSS). The goal of this Named Entity Recognition (NER) challenge is to extract key information from TCM drug instructions, including 13 types of entities such as drugs, drug ingredients, diseases, symptoms, and syndromes, to build a TCM drug knowledge base.

Data exploration and analysis

The training data for this competition has three characteristics:

  • Most of the drug instructions of traditional Chinese medicine are presented in the form of long text.

insert image description here

  • In medical scenarios, we often face the challenge of insufficient labeled samples

insert image description here

  • In the medical field, the distribution of labels is usually unbalanced.
    insert image description here

core idea

data preprocessing

First of all, for the manual text, we need to perform pre-cleaning and long text segmentation. The goal of the pre-cleaning stage is to filter out invalid characters and ensure the accuracy and usability of the text. For the long text problem, we adopt a two-level text segmentation strategy. By segmenting long texts into shorter sentences, the text content can be better processed and understood.

However, since the segmented sentences may be too short, we need to merge the short texts to ensure that the length of the merged text does not exceed the preset maximum length. This maintains the coherence and integrity of the text.

In addition, we can also use all labeled data to build entity knowledge bases as a priori dictionary for the domain. Such a knowledge base can provide information and context about entities, and provide strong support and reference for subsequent entity extraction tasks.

Baseline: BERT-CRF

insert image description here

  • Baseline details

    • Pre-training model: Selected UER-large-24 layer[1], UER RoBerta-wwmuses large-scale high-quality Chinese corpus to continue training under the framework, single-mode first in the CLUE task
    • Differential learning rate: BERT layer learning rate 2e-5; other layer learning rate2e-3
    • Parameter initialization: other modules of the model adopt the same initialization method as BERT
    • Sliding parameter averaging: weighted average of the weights of the last few epoch models to obtain a smoother and better performing model
  • Baseline bad-case analysis

insert image description here

Optimization 1: Adversarial training

  • Motivation: Use confrontational training to alleviate the problem of poor robustness of the model and improve the generalization ability of the model
  • Adversarial training is a training method that introduces noise, which can regularize parameters and improve model robustness and generalization ability
    • Fast Gradient Method (FGM): Add disturbance to the embedding layer in the gradient direction
    • Projected Gradient Descent (PGD) [2]: Iterative disturbance, each disturbance is projected into the specified range

Optimization 2: Mixed precision training (FP16)

  • Motivation: Adversarial training reduces computational efficiency, use mixed precision training to optimize training time consumption
  • Mixed precision training
    • Use FP16 for storage and multiplication in memory to speed up
    • Use FP32 to do accumulation to avoid rounding errors
  • loss amplification
    • Expand the loss by 2^k times before backpropagation to prevent the loss from overflowing
    • Restore the weight gradient after backpropagation

Optimization 3: Multi-model Fusion

  • Motivation: Baseline errors focus on ambiguity errors, adopt multi-level medical named entity recognition system to disambiguate

  • Method: A Differentiated Multilevel Model Fusion System

    • Model framework differentiation: BERT-CRF & BERT-SPAN & BERT-MRC
    • Differentiation of training data: change the random seed, change the sentence segmentation length (256, 512)
    • Multilevel Model Fusion Strategy
  • Fusion Model 1 - BERT-SPAN

    • Use the form of SPAN pointer to replace the CRF module to speed up the training speed
    • Predict the starting position of the entity with a half-pointer-half-label structure, and give the entity category during the labeling process
    • Strict decoding is adopted, and the overlapping entity selects the one with the largest logits to ensure accuracy
    • Use label smooth to alleviate overfitting problems

insert image description here

  • Fusion Model 2 - BERT-MRC
    • Processing NER tasks based on reading comprehension
      • query: the description of the entity type as query
      • doc: the original text after the sentence is used as doc
    • Construct a sample for each type, there are a large number of negative samples during training, you can randomly select 30% to join the training, and discard the rest to ensure efficiency
    • When predicting, it is necessary to construct a sample for each category, and there is no limit to the decoding output to ensure the recall rate
    • Use label smooth to alleviate overfitting problems
    • The accuracy of MRC on this data set is not good, and the efficiency of training and reasoning is low. It is only used as a solution to improve the recall rate. The code provided is only for learning and is not recommended for daily use.

insert image description here

  • multilevel fusion strategy
    • The model obtained by CRF/SPAN/MRC 5-fold cross-validation is subjected to the first-level probability fusion, and the logits are averaged to decode the entity
    • The model after CRF/SPAN/MRC probability fusion performs the second-level voting fusion to obtain the final result

insert image description here

Optimization 4: Semi-supervised learning

  • Motivation: In order to alleviate the problem of scarcity of labeled corpus in medical scenarios, we use semi-supervised learning (pseudo-label) to make full use of the unlabeled 500 preliminary test set
  • Strategy: Dynamic Pseudo-Tags
    • First train a benchmark model M using raw labeled data
    • Use the benchmark model M to predict the preliminary test set to get pseudo-labels
    • Add the pseudo-label to the training set, give the pseudo-label a dynamic learnable weight (alpha in the figure), and join the real label data to train together to obtain the model M'

insert image description here
- tips: Use the multi-mode fusion benchmark model to reduce the noise of the pseudo-label; the weight can also be fixed, and choose which one needs to try more to get the best effect. In essence, it is to reduce the loss weight of the pseudo-label, which is a method to alleviate the noise of the pseudo-label.

Other Attempts with No Significant Improvement

  • After taking BERT, the four-layer dynamic weighted output has no obvious improvement
  • Add BiLSTM / IDCNN module after BERT output, overfitting is serious, and training speed is greatly reduced
  • Data enhancement, random replacement of similar entity words to expand training data
  • BERT-SPAN / MRC model uses focal loss / dice loss to alleviate label imbalance
  • Using the constructed domain dictionary to revise the model output

The final online score is 72.90%, Rank 1 in the rematch and Rank 1 in the final

References

[1] Zhao et al., UER: An Open-Source Toolkit for Pre-training Models, EMNLP-IJCNLP, 2019. [2]
Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks, ICLR, 2018.
[3] ] Tianchi Traditional Chinese Medicine Instructions Entity Recognition Challenge Champion Solution Open Source

Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/131665957